Machine Learning Notes - Adapted from Macalester’s STAT 253 SP25 Course by Brianna Heggeseth

Introduction

Machine Learning: using algorithms that can learn from existing patterns and make predictions

Supervised vs Unsupervised Learning

Supervised Learning

  • model relationship between input \(x = (x_1, x_2, \cdots, x_p)\) and output \(y\).

\[\begin{align*}y &= f(x) + \epsilon \\ &= \text{ trend in relationship } + \text{ residual deviation from trend }\end{align*}\]

  • Types:
    • Regression (\(y\) is quantitative)
    • Classification (\(y\) is categorical)

Unsupervised Learning

  • No output variable
  • Goal is to use \(x\) to understand/modify strucutre of data with respect to \(x\)
  • Types:
    • Clustering (similar \(x\)’s’)
    • Dimension Reduction (turn original set of \(p\) input varaibles into set with \(k<p\) variables)

Regression

We are trying to build a model \(f(x)\) of some quantitative output.

\[y = f(x) + \epsilon\]

```{r}
#| message: false
#| warning: false
#| eval: false

# model specificiation
model_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine(engine = "lm")

# model recipe
data_recipe <- recipe(y ~ ., data = ___)

# model workflow
model_wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(model_spec)

# cross validation
multiple_models <- model_wf %>%
  fit_resamples(
    resamples = vfold_cv(___, v = ___),
    metrics = metric_set(mae)
  )

# final model
final_model <- model_wf %>%
  fit(data = ___)
```

Model Evaluation

Model Evaluation

Once we have the model, we need to evaluate:

  1. Is the model wrong?
    1. eg. Linear fit on quadratic data
    2. Look at residual plot
  2. Is the model strong?
    1. How well does our model explain the variability in response?
    2. Look at \(R^2\), adjusted \(R^2\)
  3. Does the model produce accurate predictions?
    1. Look at residual plot
    2. MSE, RMSE, MAE
  4. Is the model fair?
    1. Who collected/funded the data?
    2. How/why did they collect it?
    3. Implications of analysis

Overfitting

Adding more predictors can result in an overfit model. Relying on in-sample metrics can lead to overfitting

  • In-sample metrics only tell you how the model performs on that dataset used to make the model.
  • Biased!
  • Overly optimisitc!

Prevention

  • Training vs testing groups
  • k-Fold Cross Validation

Cross-Validation

Algorithm:

  • Split data into \(k\) folds
  • For each fold:
    • Remove the fold from the data set, this will be the testing set
    • Use the rest of the data to train the model
    • Calculate the error between the testing and training model
  • Average all error estimates from each fold

This is a more accurate value for what you expect your model to be off by when tested on new data.

Common values of \(k\): 7, 10

Note: \(k=n\) is also known as Leave One Out Cross Validation (LOOCV)

Model Selection

How can we decide what predictors to include in our model?

Methods:

  1. Variable selection Identify a subset of predictors
    • Best subset selection
    • Backward-stepwise selection
    • Forward-stepwise selection
  2. Shrinkage / regularization Shrink/regularize the coefficients of all predictors towards 0.
    1. LASSO, ridge regression, elastic net
  3. Dimension Reduction Combine predictors into a smaller set of new predictors
    1. Principal componets regression

Model Selection

Greedy algorithms:

  • Best Subset Selection Build all \(2^p\) models that use any combination of available predictors and identify best model based on chosen metric
    • Computationally expensive
  • Backward Stepwise Selection Build a model with all \(p\) predictors. Remove the predictor with largest p-value and build a model with the remaining. Repeat till you have \(p\) models. Identify best to some metric.
    • Greedy
    • Overestimates significance of remaining predictors
  • Forward Stepwise Selection Better than Best Subset Selection, but worse than Backwards Stepwise Selection.
    • Computationally expensive
    • Can produce odd combination of predictors

LASSO: Shrinkage / Regularization

Least squares that penalizes predictors that add complexity.

Pros:

  • helps with model selection, preventing overfitting
  • Predictors must be scaled (R does it for us)
  • Isn’t greedy and doesn’t choose variables based on p-values

Tuning:

  • \(\lambda = 0\), LASSO = Least Squares
  • As \(\lambda\) increases, variable coefficients go to zero
```{r}
#| message: false
#| warning: false
#| eval: false

# model specificiation
model_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine(engine = "glmnet") %>%
  set_args(mixture = 1, penalty = tune())

# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
  step_dummy(all_nominal_predictors())  # normalization done by glmnet

# model workflow
model_wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(model_spec)

# cross validation
set.seed(___)
multiple_models <- model_wf %>%
  tune_grid(
    grid = grid_regular(
      penalty(range = c(___, ___)),  # range is on log10 scale     
      levels = ___  # how many models to build
    ),
    resamples = vfold_cv(___, v = ___),
    metrics = metric_set(mae)
)

# model selection
multiple_models %>%
  collect_metrics()

autoplot(multiple_models)

best_param <- multiple_models %>%
  select_best(metric = "mae")

parsimonious_param <- multiple_models %>%   
  select_by_one_std_err(metric = "mae", 
                        desc(penalty))

# final model
final_model <- model_wf %>%
  finalize_workflow(parameters = parimonious_param) %>%
  fit(data = ___)
```

Flexible Models

What if the relationship between \(x\) and \(y\) is more complicated? Use flexible models!

Parametric vs Nonparametric Models

  • Parametric Assumes relationship across domain of function
  • Nonparametric More flexibility in relationship between \(x\) and \(y\)

Nonparametric Models

Need to consider the scale of variables when calculating distances between observations with more than one predictor (NORMALIZE!)

  • Normalization - an example of pre-processing, decisions wherein effect models and predictons
  • Might need to preprocess data in variable recipe

KNN Regression and the Bias-Variance Tradeoff

  • Nonparametric
  • Assumption: Similar \(x\) values imply similar \(y\) values.
  • from kknn package: knn predicts y from weighted average
    • more influence to closer neighbors

Algorithm:

  • Identify \(K\) nearest neighbors of \(x\) with respect to Euclidean distance
  • Observe the \(y\) values of these neighbors
  • \(f(x) = \text{average}(y \text{ neighbors})\)
```{r}
#| message: false
#| warning: false
#| eval: false

# model specificiation
model_spec <- nearest_neighbors() %>%
  set_mode("regression") %>%
  set_engine(engine = "kknn") %>%
  set_args(neighbors = tune())

# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalization(all_numeric_predictors())

# model workflow
model_wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(model_spec)

# cross validation
set.seed(___)
knn_models <- model_wf %>%
  tune_grid(
    grid = grid_regular(
      neighbors(range = c(1, ___)),     
      levels = ___  # how many models to build
    ),
    resamples = vfold_cv(___, v = ___),
    metrics = metric_set(mae)
)

# model selection
knn_models %>%
  collect_metrics()

autoplot(knn_models)

best_knn <- select_best(knn_models, metric = "mae")  # parsimony is irrelevant

# final model
final_model <- model_wf %>%
  finalize_workflow(parameters = best_knn) %>%
  fit(data = ___)
```

Bias-Variance Tradeoff

  • Bias How well does the model predict values?
    • Extremes: \(K=1\), overfit, no bias; \(K=K\) just a line but smoothed due to weights
  • Variance How does the model shape change from dataset to dataset?
    • Overfit model: passes through each point in the data, high variance

LOESS & Splines

Splines

  • Fit polynomial models in small localized regions and make the boundaries smooth.
  • Can only spline one quantitative predictor
  • Natural vs basis
    • Natural: pass deg_free argument where deg_free = # of knots + 1
      • Variant of B-splines with additional constraint at boundaries to reduce variance
      • Knots typically based on quantiles of predictor
    • Basis: specify specific knots options = list(knots = c(x_1, x_2))
      • You choose the knots
      • Functions can be unstable at boundaries
```{r}
#| message: false
#| warning: false
#| eval: false

# model specificiation
model_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine(engine = "lm")

# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
  step_ns(x, deg_free = ___)  # natural splines, deg_free = # knots + 1
  step_bs(x, options = list(knots = c(___, ___)))  # basis splines, specific locations of knots

# model workflow
model_wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(model_spec)

# cross validation
multiple_models <- model_wf %>%
  fit_resamples(
    resamples = vfold_cv(___, v = ___),
    metrics = metric_set(mae)
  )

# final model
final_model <- model_wf %>%
  fit(data = ___)
```

LOESS

  • Fit regression models in small, localized regions, where nearby data have greater influence than far data
  • Can only LOESS one quantitative predictor
  • Nonparametric
  • Algorithm:
    • Define the span \(h \in (0,1]\). Perform the following to estimate \(f(x)\) at each possible predictor value \(x\)
      • Identify a neighborhood consisting of the \(100\times h\%\) of cases that are closes to \(x\)
      • Putting more weight on closer \(x\) neighbors, fit a linear model in this neighborhood
      • Use the local linear model to estimate \(f(x)\)

Classification

Fitting a model to a categorical outcome.

\[y = f(x) + \epsilon\]

Logistic Regression

  • Classification to model binary categorical \(y\).
  • Interpret in log(odds)

\[\text{odds} = \frac{p}{p-1},\quad p = \frac{\text{odds}}{\text{odds} + 1}\]

Scale Impossible 50/50 Certain
p 0 0.5 1
odds 0 1 \(\infty\)
log(odds) -\(\infty\) 0 \(\infty\)

If we want to make more concrete predictions, we use the probability to make a classification through a classification rule. Can be based on:

  • Predictor value
  • Probability
  • log(odds)

Visually, the classification rule partitions the data. If the data overlaps, our rule might not always be correct. We can create an in-smaple confusion matrix from our classification rule to calculate the following:

Truth
Prediction No Yes
No TN FN
Yes FP TP

Sensitivity (TP rate): \(\frac{TP}{TP + FN}\)

Specificity (TN rate): \(\frac{TN}{TN + FP}\)

If you want to improve sensitivity, lower the classification threshold. Note that sensitivity is inversely proportional to specificity, so increasing sensitivity will decrease specificity.

Tuning the classification rule: context

  • Consequences of misclassification
  • Prioritize high sensitivity to avoid FN / increase TP
  • Prioritize high specificity to avoice FP / increase TN

Reciever Operating Characteristic (ROC) Curve:

  • Sensitivity vs 1 - Specificity (FPR)

Area under the curve (AUC):

  • Probability we correctly identified \(y=0\) and \(y=1\)
  • Measures overall effectiveness of classification model
  • Closer to 1 the better (0.5 is when Sensitivity = 1 - Specificity)
```{r}
#| message: false
#| warning: false
#| eval: false

# Ensure outcome is categorical
sample_data <- sample_data %>% 
  mutate(y = as.factor(y))

# relevel outcome if you want to look at different category
sample_data <- sample_data %>%
  mutate(y = relevel(y, ref = "CATEGORY NOT INTERESTED IN"))

# Create logistic regression classification model specification 
model_spec <- logistic_reg() %>%
  set_mode("classification") %>%
  set_engine(engine = "glm")

# Make data recipe
data_rec <- recipe(y ~ x, data = sample_data)

# Create model workflow
model_wf <- workflow() %>%
  add_recipe(data_rec) %>%
  add_model(model_spec)

# cross-validate with 10 folds
multiple_models <- model_wf %>%
  fit_resamples(
    resamples = vfold_cv(sample_data, v = 10),
    control = control_resamples(save_pred = TRUE, entry_level = "second"),
    metrics = metric_set(accuracy)  # could also specify sensitivity, specificity, roc_auc
  )

# finalize model
final_model <- model_wf %>%
  fit(data = sample_data)
```

KNN & Trees

Pro:

  • Non-parametric classification
  • Flexibility
  • No assumptions on shape of relationship
  • Good with complicated relationships

Con:

  • Lack of insights
  • Ignores relationships that do exist
  • Computationally intensive

KNN

Goal: build classification model of categorical response variable

Algorithm:

  • ID nearest neighbors w.r.t. \(x\) values
  • Get \(y\) value of neighbors
  • Predict the most common neighbor value

Pro:

  • Flexible
  • Intuitive
  • More than 2 outcome categories

Con:

  • Lazy learner; must calculate distances at time of prediction for each point we classify
  • Computationally intensive
  • Provides classifications, but no real sense of relationships
```{r}
#| message: false
#| warning: false
#| eval: false

# Ensure outcome is categorical
sample_data <- sample_data %>% 
  mutate(y = as.factor(y))

# Create KNN classification model specification 
model_spec <- nearest_neighbor() %>%
  set_mode("classification") %>%
  set_engine(engine = "kknn") %>%
  set_args(neighbors = tune())

# Make data recipe, convert all predictors to numbers and normalize
data_rec <- recipe(y ~ x, data = sample_data) %>%
  step_dummy(all_nomial_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Create model workflow
model_wf <- workflow() %>%
  add_recipe(data_rec) %>%
  add_model(model_spec)

# Cross-validate
multiple_models <- model_wf %>%
  tune_grid(
    grid = grid_regular(
      neighbors(range = c(1, 7)),
      levels = 10
    ),
    resamples = vfold_cv(sample_data, 10),
    metrics = metric_set(accuracy)
  )

# plot the models
multiple_models %>%
  autoplot()

# select best parameter
# NOTE: parsimony is irrelevant
best_param <- multiple_models %>%
  select_best(metric = "accuracy")

# finalize model
final_model <- model_wf %>%
  finalize_workflow(best_param)
  fit(data = sample_data)
```

Classification Trees

Predict outcome from a series of yes/no questions.

Recursive binary splitting algorithm:

  1. Start with all the data
  2. Create the best binary split
  1. Minimizing Gini index (\(G = p_1(1 - p_1) + \dots\)) -
  2. Maximizing node homogeneity
  1. Repeat 2 until (pruning steps = tuning parameters)
  1. Reach max depth (# of splits from root to leaf)
  2. Min sample size in each node
  3. Cost complexity \(\alpha\) (like \(\lambda\) for lasso) a. \(\alpha = 0\) no penalty b. \(\alpha \gg\) only make split if very useful

Pro:

  • More insight into relationship
  • Eager learners - can use a tree to classify all new points

Con:

  • Greedy
  • Computationally intesive

Note: KNN and Trees are equal when \(\alpha = \infty\) and \(K = N\)

```{r}
#| message: false
#| warning: false
#| eval: false

# Ensure outcome is categorical
sample_data <- sample_data %>% 
  mutate(y = as.factor(y))

# Create classification tree classification model specification 
model_spec <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine(engine = "rpart") %>%
  set_args(min_n = 2, tree_depth = 30, cost_complexity = tune())

# Make data recipe
# NOTE: preprocessing not absolutely necessary
data_rec <- recipe(y ~ x, data = sample_data)

# Create model workflow
model_wf <- workflow() %>%
  add_recipe(data_rec) %>%
  add_model(model_spec)

# Cross-validate
multiple_models <- model_wf %>%
  tune_grid(
    grid = grid_regular(
      cost_complexity(range = c(-5, 2)),  # log10 scale
      levels = 10
    ),
    resamples = vfold_cv(sample_data, 10),
    metrics = metric_set(accuracy)
  )

# plot the models
multiple_models %>%
  autoplot()

# select best parameter
best_param <- multiple_models %>%
  select_best(metric = "accuracy")

# select parsimonious parameter
parsimonious_param <- multiple_models %>%
  select_by_one_std_err(metric = "accuracy", desc(cost_complexity))

# finalize model
final_model <- model_wf %>%
  finalize_workflow(parsimonious_param)
  fit(data = sample_data)
```

Bagging and Random Forests

Idea:

  1. Build a bunch of trees with high variance and low bias (unpruned)
  2. Average results to get 1 tree

Both bagging and random forests are examples of ensemble methods: combining outputs of multiple ML algorithms

  • In general: low variability; more stable predictions

Bagging

Bootstrap aggregation (bagging):

  • Each tree is created from a resampling with replacement
  • ~2/3 of original data will show up in resample, ~1/3 will not
  • All \(p\) predictors are considered for each split
  • Use resample to train, and remaining to test (out of bag error)

Forest = many unpruned tress from slighly different datasets (bootsrapping/resampling)

Overall prediction is from the combined prediction across all trees

  • Classification: Majority rules
  • Regression: Averages
```{r}
#| message: false
#| warning: false
#| eval: false

# Ensure outcome is categorical
sample_data <- sample_data %>% 
  mutate(y = as.factor(y))

# Create bagging classification model specification 
model_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine(engine = "ranger") %>%
  set_args(
    mtry = NULL, 
    min_n = 2, 
    trees = 500, 
    probability = FALSE, 
    importance = "impurity")

# Make data recipe, preprocessing steps aren't absolutly necessary
data_rec <- recipe(y ~ x, data = sample_data)

# Create model workflow
model_wf <- workflow() %>%
  add_recipe(data_rec) %>%
  add_model(model_spec)

# NOTE: no tuning needed - OOB estimation instead

# finalize model
final_model <- model_wf %>%
  fit(data = sample_data)
```

Random Forest

At each split in the tree, randomly select and consider only a subset of predictors

  • Typical values are \(p/2\) and \(\sqrt{p}\)

Pros:

  • Low bias, low variance (maintine bias from trees, but reduce variance)
  • Less computationally intesive (only consider a subset of predictors) compared to bagging
  • Less greedy (surrogates get a chance to shine)
  • Decorrelate trees (random subset of predictors)

Cons:

  • Computationally expensive
  • Can’t plot results, lose interpretability

Note: A forest is the best algorithm for reducing variance.

```{r}
# The only thing you have to change for random forest from bagging code is model specification
# Create random forest classification model specification 
model_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine(engine = "ranger") %>%
  set_args(
    mtry = ___, # number of predictors to consider at each split
    min_n = 2, 
    trees = 500, 
    probability = FALSE, 
    importance = "impurity")
```

Unsupervised Learning

A set of data with no outcome

Instead of predicting, focus on

  • structure of the data
  • jumping off point for further investigations

Clustering

Goal: Identify similarities amongst rows, with respect to variables (features).

In general, clustering is a DISCOVERY PROCESS!

Hierarchical Clustering

Goal:

  • Calculate euclidean distance
  • Group (cluster) together 2 closest points by chosen linkage type
  • Repeat with new clusters till end with 1 cluster

Linkage Types:

  • Complete: Max distance between any pair of cases
  • Single: min distance
  • Average: average distance
  • Centroid: centroid distance

This creates a dendrogram, which is NOT a classification tree (we start from the leaves and end at the root).

General properties of dendrograms:

  • The more similar a cluster is, the sooner it will fuse
  • Horizontal distance between leaves in meaningless
  • Height = distance between 2 points

No “correct” way to cut a dendrogram

  • Higher cut, fewer clusters, easier to understand, but maybe too simple and not very meaningful.

Drawbacks:

  • Computationally expensive: need to calculate distances between each pair of points
  • Greedy: best local decision of which cluster might not be best split/most meaningful cluster

Code:

```{r}
# Install packages
library(tidyverse)
library(cluster)      # to build the hierarchical clustering algorithm
library(factoextra)   # to draw the dendrograms

# convert column to rowname
sample_data <- sample_data %>%
  column_to_rowname("id")

# run clustering algorithm
# use "complete", "single", "average", or "centroid" linkage method
hier_model <- hclust(dist(scale(sample_data)), method = ___)

# visualizing the clustering: heat map (not clustering)
heatmap(scale(data.matrix(sample_data)), Colv = NA, Rowv = NA)

# visualizing the clustering: heat map (ordered by dendrogram/clustering)
heatmap(scale(data.matrix(sample_data)), Colv = NA)

# Dentrogram (change font size with cex)
fviz_dend(hier_model, cex = 1)
fviz_dend(hier_model, horiz = True, cex = 1)  # plot horizontally to read longer labels

# Visualizing clusters in a dendrogram
# typically want to store in new dataset so cluster assignment isnt written into features used later
cluster_data <- sample_data %>%
  mutate(hier_cluster_k = as.factor(cutree(hier_model, k = ___)))

# visualize the clusters on the dendrogram
fviz_dend(hier_model, k = ___, cex = 1)
fviz_dend(hier_model, k = ___, horiz = True, cex = 1)  # plot horizontally to read longer labels
```

K-Means Clustering

Goal: Split cases into \(k\)-nonoverlapping homogeneous clusters

Homogeneous: within cluster varience (\(W(c_k)\)) of each cluster (\(c_1,\dots, c_k\)) = average squared distance between all pairs of cases in cluster.

Algorithm:

  1. Pick \(k\) - from prior knowledge
  2. Initialize locations for \(k\) centroids, assign each case to nearest centroid
  3. Calculate new centroid
  4. Reassign each case to cluster with nearest centroid
  5. Repeat 3 and 4 till stablized

Code:

```{r}
# Install packages
library(tidyverse)
library(cluster)      # to build the hierarchichal clustering algorithm
library(factoextra)   # to draw the dendrograms

# process the data
sample_data <- sample_data %>% 
  column_to_rownames("id")

# K-means can't handle NA: 2 options
# Omit missing cases (this can be bad if there are a lot of missing points!)
sample_data <- na.omit(sample_data)

# Impute the missing cases
library(VIM)
sample_data <- sample_data %>%
  VIM::kNN(imp_var = FALSE)

# Turn categorical features into dummy variables
# DON'T DO IF YOU HAVE QUANTITATIVE OR LOGICAL FEATURES
sample_data <- data.frame(model.matrix(~ . - 1, sample_data))

# Run K-means
set.seed(___)
kmeans_model <- kmeans(scale(sample_data), centers = ___)

# Get the "the total within- cluster sum of squares"
# This is the total squared distance of each case from its assigned centroid
kmeans_model$tot.withinss

# Tuning K-means
tibble(K = 1:___) %>%
  mutate(SS = map(K, ~ kmeans(scale(sample_data), centers = .x)$tot.withinss)) %>% 
  unnest(cols = c(SS))

# Assign each sample case to a cluster
# We typically want to store this in a new dataset so that the cluster assignments aren't 
# accidentally used as features in a later analysis!
cluster_data <- sample_data %>% 
  mutate(kmeans_cluster_k = kmeans_model$cluster)
```

Dimension Reduction

Goal: Find similarities amongst columns by collapsing/combining features

Motivation: What if we have large data (especially \(p>n\))? We might want to

  • identify structure and patterns among features
  • conserve computational resources
  • facilitate regression and classification by eleminiated multicolinear predictors

Combining linear predictors:

  • Idea 1: kick one out; lose variablity in other predictor
  • Idea 2: find direction of best picture (preserves variability/most information) of data

Idea 2:

  • Principal components are linear combinations of original features where loadings refer to the coefficients.
  • PC1 contains the most variation.
  • We want to retain high variability and reduce correlation

Principal Component Analysis

Goal: Turn original set of predictors into \(k<p\) uncorrelated principal components (PCs) preserving the majority of information (variability)

Trade offs (relative to kicking out columns):

  • PCA preserves more info, but features now lost meaning; difficult to interpret

Details:

  • Score plots: examine relationships among PCs
  • Loadings: coefficients of linear combinations
  • Loading plots: understand how PCs are constructed, thus how to interpret PC
```{r}
# Install packages
library(tidyverse)

# process the data
sample_data <- sample_data %>% 
  column_to_rownames("id")

# PCA can't handle NAs
# Omit missing cases (this can be bad if there are a lot of missing points!)
sample_data <- na.omit(sample_data)

# Impute the missing cases
library(VIM)
sample_data <- sample_data %>%
  VIM::kNN(imp_var = FALSE)

# Turn categorical features into dummy variables
# DONT DO IF QUANTITATIVE OR LOGICAL
sample_data <- data.frame(model.matrix(~ . - 1, sample_data))

# Run PCA
# scale = TRUE, center = TRUE first standardizes the features
pca_results <- prcomp(sample_data, scale = TRUE, center = TRUE)

# ---------------------------------------------------------
# PC visualizations
# Get the loadings which define the PCs
pca_results %>% 
  pluck("rotation")

# Plot loadings for first "k" PCs (you pick k)
library(reshape2)
melt(pca_results$rotation[, 1:k]) %>% 
  ggplot(aes(x = Var1, y = value, fill = Var1)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ Var2) + 
    labs(y = "loadings", x = "original features", fill = "original features")

# Plot loadings for just the first PC
melt(pca_results$rotation) %>%
  filter(Var2 == "PC1") %>% 
  ggplot(aes(x = Var1, y = value, fill = Var1)) +
    geom_bar(stat = "identity") +
    labs(y = "loadings", x = "original features", fill = "original features")

# Loadings plot for first 2 PCs
library(factoextra)
fviz_pca_var(pca_results, repel = TRUE)

# ---------------------------------------------------------
# Examine amount of variability captured by each PC
# Load package for tidy table
library(tidymodels)

# Numerical summaries: Measure information captured by each PC
pca_results %>% 
  tidy(matrix = "eigenvalues")

# Graphical summary 1: SCREE PLOT
# Plot % of variance explained by each PC
pca_results %>% 
  tidy(matrix = "eigenvalues") %>% 
  ggplot(aes(y = percent, x = PC)) + 
    geom_point(size = 2) + 
    geom_line() + 
    labs(y = "% of variance explained")

# Graphical summary 2: Plot cumulative % of variance explained by each PC
pca_results %>% 
  tidy(matrix = "eigenvalues") %>% 
  rbind(0) %>% 
  ggplot(aes(y = cumulative, x = PC)) + 
    geom_point(size = 2) + 
    geom_line() + 
    labs(y = "CUMULATIVE % of variance explained")

# ---------------------------------------------------------
# Examine scores (i.e. PC coordinates for data points)
# Numerical summary: check out the scores
pca_results %>% 
  pluck("x")

# Graphical summary: Score plot
# Plot PC1 scores (x-axis) vs PC2 scores (y-axis) of all data points
fviz_pca_ind(pca_results, repel = TRUE)
```

Principal Component Regression

Idea: Combining clustering and regression given some \(y\) and \(p>n\)

Using PCR:

  1. Ignore \(y\) for now, do PCA
  2. Keep first \(k\) PCs that retain sufficient information from OG predictors
  3. Model \(y\) by those first PCs

Note:

  • PCA might not produce strongest possible predictors of \(y\)
  • Partial least squares provides alternative

Code:

```{r}
library(tidymodels)
library(tidyverse)

# STEP 1: specify linear regression model
lm_spec <- linear_reg() %>%
  set_model("regression") %>%
  set_engine("lm")

# STEP 2: variable recipe
pcr_recipe <- recipe(y ~ ., data = sample_data) %>%
  update_role(data_id, new_role = "id") %>%
  step_dummy(all_nomial_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = tune())

# STEP 3: workflow
pcr_workflow <- workflow() %>%
  add_recipe(pcr_recipe) %>%
  add_model(lm_spec)

# STEP 4: Estimate multiple PCR models with varying numbers of PCs
# Largest range is p
# Put same number in for levels
set.seed(___)
pcr_models <- pcr_workflow %>%
  tune_grid(
    grid = grid_regular(num_comp(range = c(1, ___)), levels = ___),
    resamples = vfold_cv(sample_data, v = 10),
    metrics = metric_set(mae)
  )

# explore models
pcr_models %>%
  autoplot()

best_param <- multiple_models %>%
  select_best(metric = "mae")

parsimonious_param <- multiple_models %>%   
  select_by_one_std_err(metric = "mae", 
                        desc( parameter name))

final_model <- pcr_workflow %>% 
  finalize_workflow(___) %>% # your chosen parameter
  fit(data = ___)
```