Machine Learning Notes - Adapted from Macalester’s STAT 253 SP25 Course by Brianna Heggeseth
Introduction
Machine Learning: using algorithms that can learn from existing patterns and make predictions
Supervised vs Unsupervised Learning
Supervised Learning
- model relationship between input \(x = (x_1, x_2, \cdots, x_p)\) and output \(y\).
\[\begin{align*}y &= f(x) + \epsilon \\ &= \text{ trend in relationship } + \text{ residual deviation from trend }\end{align*}\]
- Types:
- Regression (\(y\) is quantitative)
- Classification (\(y\) is categorical)
Unsupervised Learning
- No output variable
- Goal is to use \(x\) to understand/modify strucutre of data with respect to \(x\)
- Types:
- Clustering (similar \(x\)’s’)
- Dimension Reduction (turn original set of \(p\) input varaibles into set with \(k<p\) variables)
Regression
We are trying to build a model \(f(x)\) of some quantitative output.
\[y = f(x) + \epsilon\]
```{r}
#| message: false
#| warning: false
#| eval: false
# model specificiation
model_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine(engine = "lm")
# model recipe
data_recipe <- recipe(y ~ ., data = ___)
# model workflow
model_wf <- workflow() %>%
add_recipe(data_recipe) %>%
add_model(model_spec)
# cross validation
multiple_models <- model_wf %>%
fit_resamples(
resamples = vfold_cv(___, v = ___),
metrics = metric_set(mae)
)
# final model
final_model <- model_wf %>%
fit(data = ___)
```Model Evaluation
Model Evaluation
Once we have the model, we need to evaluate:
- Is the model wrong?
- eg. Linear fit on quadratic data
- Look at residual plot
- Is the model strong?
- How well does our model explain the variability in response?
- Look at \(R^2\), adjusted \(R^2\)
- Does the model produce accurate predictions?
- Look at residual plot
- MSE, RMSE, MAE
- Is the model fair?
- Who collected/funded the data?
- How/why did they collect it?
- Implications of analysis
Overfitting
Adding more predictors can result in an overfit model. Relying on in-sample metrics can lead to overfitting
- In-sample metrics only tell you how the model performs on that dataset used to make the model.
- Biased!
- Overly optimisitc!
Prevention
- Training vs testing groups
- k-Fold Cross Validation
Cross-Validation
Algorithm:
- Split data into \(k\) folds
- For each fold:
- Remove the fold from the data set, this will be the testing set
- Use the rest of the data to train the model
- Calculate the error between the testing and training model
- Average all error estimates from each fold
This is a more accurate value for what you expect your model to be off by when tested on new data.
Common values of \(k\): 7, 10
Note: \(k=n\) is also known as Leave One Out Cross Validation (LOOCV)
Model Selection
How can we decide what predictors to include in our model?
Methods:
- Variable selection Identify a subset of predictors
- Best subset selection
- Backward-stepwise selection
- Forward-stepwise selection
- Shrinkage / regularization Shrink/regularize the coefficients of all predictors towards 0.
- LASSO, ridge regression, elastic net
- Dimension Reduction Combine predictors into a smaller set of new predictors
- Principal componets regression
Model Selection
Greedy algorithms:
- Best Subset Selection Build all \(2^p\) models that use any combination of available predictors and identify best model based on chosen metric
- Computationally expensive
- Backward Stepwise Selection Build a model with all \(p\) predictors. Remove the predictor with largest p-value and build a model with the remaining. Repeat till you have \(p\) models. Identify best to some metric.
- Greedy
- Overestimates significance of remaining predictors
- Forward Stepwise Selection Better than Best Subset Selection, but worse than Backwards Stepwise Selection.
- Computationally expensive
- Can produce odd combination of predictors
LASSO: Shrinkage / Regularization
Least squares that penalizes predictors that add complexity.
Pros:
- helps with model selection, preventing overfitting
- Predictors must be scaled (R does it for us)
- Isn’t greedy and doesn’t choose variables based on p-values
Tuning:
- \(\lambda = 0\), LASSO = Least Squares
- As \(\lambda\) increases, variable coefficients go to zero
```{r}
#| message: false
#| warning: false
#| eval: false
# model specificiation
model_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine(engine = "glmnet") %>%
set_args(mixture = 1, penalty = tune())
# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
step_dummy(all_nominal_predictors()) # normalization done by glmnet
# model workflow
model_wf <- workflow() %>%
add_recipe(data_recipe) %>%
add_model(model_spec)
# cross validation
set.seed(___)
multiple_models <- model_wf %>%
tune_grid(
grid = grid_regular(
penalty(range = c(___, ___)), # range is on log10 scale
levels = ___ # how many models to build
),
resamples = vfold_cv(___, v = ___),
metrics = metric_set(mae)
)
# model selection
multiple_models %>%
collect_metrics()
autoplot(multiple_models)
best_param <- multiple_models %>%
select_best(metric = "mae")
parsimonious_param <- multiple_models %>%
select_by_one_std_err(metric = "mae",
desc(penalty))
# final model
final_model <- model_wf %>%
finalize_workflow(parameters = parimonious_param) %>%
fit(data = ___)
```Flexible Models
What if the relationship between \(x\) and \(y\) is more complicated? Use flexible models!
Parametric vs Nonparametric Models
- Parametric Assumes relationship across domain of function
- Nonparametric More flexibility in relationship between \(x\) and \(y\)
Nonparametric Models
Need to consider the scale of variables when calculating distances between observations with more than one predictor (NORMALIZE!)
- Normalization - an example of pre-processing, decisions wherein effect models and predictons
- Might need to preprocess data in variable recipe
KNN Regression and the Bias-Variance Tradeoff
- Nonparametric
- Assumption: Similar \(x\) values imply similar \(y\) values.
- from kknn package: knn predicts y from weighted average
- more influence to closer neighbors
Algorithm:
- Identify \(K\) nearest neighbors of \(x\) with respect to Euclidean distance
- Observe the \(y\) values of these neighbors
- \(f(x) = \text{average}(y \text{ neighbors})\)
```{r}
#| message: false
#| warning: false
#| eval: false
# model specificiation
model_spec <- nearest_neighbors() %>%
set_mode("regression") %>%
set_engine(engine = "kknn") %>%
set_args(neighbors = tune())
# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalization(all_numeric_predictors())
# model workflow
model_wf <- workflow() %>%
add_recipe(data_recipe) %>%
add_model(model_spec)
# cross validation
set.seed(___)
knn_models <- model_wf %>%
tune_grid(
grid = grid_regular(
neighbors(range = c(1, ___)),
levels = ___ # how many models to build
),
resamples = vfold_cv(___, v = ___),
metrics = metric_set(mae)
)
# model selection
knn_models %>%
collect_metrics()
autoplot(knn_models)
best_knn <- select_best(knn_models, metric = "mae") # parsimony is irrelevant
# final model
final_model <- model_wf %>%
finalize_workflow(parameters = best_knn) %>%
fit(data = ___)
```Bias-Variance Tradeoff
- Bias How well does the model predict values?
- Extremes: \(K=1\), overfit, no bias; \(K=K\) just a line but smoothed due to weights
- Variance How does the model shape change from dataset to dataset?
- Overfit model: passes through each point in the data, high variance
LOESS & Splines
Splines
- Fit polynomial models in small localized regions and make the boundaries smooth.
- Can only spline one quantitative predictor
- Natural vs basis
- Natural: pass
deg_freeargument wheredeg_free = # of knots + 1- Variant of B-splines with additional constraint at boundaries to reduce variance
- Knots typically based on quantiles of predictor
- Basis: specify specific knots
options = list(knots = c(x_1, x_2))- You choose the knots
- Functions can be unstable at boundaries
- Natural: pass
```{r}
#| message: false
#| warning: false
#| eval: false
# model specificiation
model_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine(engine = "lm")
# model recipe
data_recipe <- recipe(y ~ ., data = ___) %>%
step_ns(x, deg_free = ___) # natural splines, deg_free = # knots + 1
step_bs(x, options = list(knots = c(___, ___))) # basis splines, specific locations of knots
# model workflow
model_wf <- workflow() %>%
add_recipe(data_recipe) %>%
add_model(model_spec)
# cross validation
multiple_models <- model_wf %>%
fit_resamples(
resamples = vfold_cv(___, v = ___),
metrics = metric_set(mae)
)
# final model
final_model <- model_wf %>%
fit(data = ___)
```LOESS
- Fit regression models in small, localized regions, where nearby data have greater influence than far data
- Can only LOESS one quantitative predictor
- Nonparametric
- Algorithm:
- Define the span \(h \in (0,1]\). Perform the following to estimate \(f(x)\) at each possible predictor value \(x\)
- Identify a neighborhood consisting of the \(100\times h\%\) of cases that are closes to \(x\)
- Putting more weight on closer \(x\) neighbors, fit a linear model in this neighborhood
- Use the local linear model to estimate \(f(x)\)
- Define the span \(h \in (0,1]\). Perform the following to estimate \(f(x)\) at each possible predictor value \(x\)
Classification
Fitting a model to a categorical outcome.
\[y = f(x) + \epsilon\]
Logistic Regression
- Classification to model binary categorical \(y\).
- Interpret in log(odds)
\[\text{odds} = \frac{p}{p-1},\quad p = \frac{\text{odds}}{\text{odds} + 1}\]
| Scale | Impossible | 50/50 | Certain |
|---|---|---|---|
| p | 0 | 0.5 | 1 |
| odds | 0 | 1 | \(\infty\) |
| log(odds) | -\(\infty\) | 0 | \(\infty\) |
If we want to make more concrete predictions, we use the probability to make a classification through a classification rule. Can be based on:
- Predictor value
- Probability
- log(odds)
Visually, the classification rule partitions the data. If the data overlaps, our rule might not always be correct. We can create an in-smaple confusion matrix from our classification rule to calculate the following:
| Truth | ||
|---|---|---|
| Prediction | No | Yes |
| No | TN | FN |
| Yes | FP | TP |
Sensitivity (TP rate): \(\frac{TP}{TP + FN}\)
Specificity (TN rate): \(\frac{TN}{TN + FP}\)
If you want to improve sensitivity, lower the classification threshold. Note that sensitivity is inversely proportional to specificity, so increasing sensitivity will decrease specificity.
Tuning the classification rule: context
- Consequences of misclassification
- Prioritize high sensitivity to avoid FN / increase TP
- Prioritize high specificity to avoice FP / increase TN
Reciever Operating Characteristic (ROC) Curve:
- Sensitivity vs 1 - Specificity (FPR)
Area under the curve (AUC):
- Probability we correctly identified \(y=0\) and \(y=1\)
- Measures overall effectiveness of classification model
- Closer to 1 the better (0.5 is when Sensitivity = 1 - Specificity)
```{r}
#| message: false
#| warning: false
#| eval: false
# Ensure outcome is categorical
sample_data <- sample_data %>%
mutate(y = as.factor(y))
# relevel outcome if you want to look at different category
sample_data <- sample_data %>%
mutate(y = relevel(y, ref = "CATEGORY NOT INTERESTED IN"))
# Create logistic regression classification model specification
model_spec <- logistic_reg() %>%
set_mode("classification") %>%
set_engine(engine = "glm")
# Make data recipe
data_rec <- recipe(y ~ x, data = sample_data)
# Create model workflow
model_wf <- workflow() %>%
add_recipe(data_rec) %>%
add_model(model_spec)
# cross-validate with 10 folds
multiple_models <- model_wf %>%
fit_resamples(
resamples = vfold_cv(sample_data, v = 10),
control = control_resamples(save_pred = TRUE, entry_level = "second"),
metrics = metric_set(accuracy) # could also specify sensitivity, specificity, roc_auc
)
# finalize model
final_model <- model_wf %>%
fit(data = sample_data)
```KNN & Trees
Pro:
- Non-parametric classification
- Flexibility
- No assumptions on shape of relationship
- Good with complicated relationships
Con:
- Lack of insights
- Ignores relationships that do exist
- Computationally intensive
KNN
Goal: build classification model of categorical response variable
Algorithm:
- ID nearest neighbors w.r.t. \(x\) values
- Get \(y\) value of neighbors
- Predict the most common neighbor value
Pro:
- Flexible
- Intuitive
- More than 2 outcome categories
Con:
- Lazy learner; must calculate distances at time of prediction for each point we classify
- Computationally intensive
- Provides classifications, but no real sense of relationships
```{r}
#| message: false
#| warning: false
#| eval: false
# Ensure outcome is categorical
sample_data <- sample_data %>%
mutate(y = as.factor(y))
# Create KNN classification model specification
model_spec <- nearest_neighbor() %>%
set_mode("classification") %>%
set_engine(engine = "kknn") %>%
set_args(neighbors = tune())
# Make data recipe, convert all predictors to numbers and normalize
data_rec <- recipe(y ~ x, data = sample_data) %>%
step_dummy(all_nomial_predictors()) %>%
step_normalize(all_numeric_predictors())
# Create model workflow
model_wf <- workflow() %>%
add_recipe(data_rec) %>%
add_model(model_spec)
# Cross-validate
multiple_models <- model_wf %>%
tune_grid(
grid = grid_regular(
neighbors(range = c(1, 7)),
levels = 10
),
resamples = vfold_cv(sample_data, 10),
metrics = metric_set(accuracy)
)
# plot the models
multiple_models %>%
autoplot()
# select best parameter
# NOTE: parsimony is irrelevant
best_param <- multiple_models %>%
select_best(metric = "accuracy")
# finalize model
final_model <- model_wf %>%
finalize_workflow(best_param)
fit(data = sample_data)
```Classification Trees
Predict outcome from a series of yes/no questions.
Recursive binary splitting algorithm:
- Start with all the data
- Create the best binary split
- Minimizing Gini index (\(G = p_1(1 - p_1) + \dots\)) -
- Maximizing node homogeneity
- Repeat 2 until (pruning steps = tuning parameters)
- Reach max depth (# of splits from root to leaf)
- Min sample size in each node
- Cost complexity \(\alpha\) (like \(\lambda\) for lasso) a. \(\alpha = 0\) no penalty b. \(\alpha \gg\) only make split if very useful
Pro:
- More insight into relationship
- Eager learners - can use a tree to classify all new points
Con:
- Greedy
- Computationally intesive
Note: KNN and Trees are equal when \(\alpha = \infty\) and \(K = N\)
```{r}
#| message: false
#| warning: false
#| eval: false
# Ensure outcome is categorical
sample_data <- sample_data %>%
mutate(y = as.factor(y))
# Create classification tree classification model specification
model_spec <- decision_tree() %>%
set_mode("classification") %>%
set_engine(engine = "rpart") %>%
set_args(min_n = 2, tree_depth = 30, cost_complexity = tune())
# Make data recipe
# NOTE: preprocessing not absolutely necessary
data_rec <- recipe(y ~ x, data = sample_data)
# Create model workflow
model_wf <- workflow() %>%
add_recipe(data_rec) %>%
add_model(model_spec)
# Cross-validate
multiple_models <- model_wf %>%
tune_grid(
grid = grid_regular(
cost_complexity(range = c(-5, 2)), # log10 scale
levels = 10
),
resamples = vfold_cv(sample_data, 10),
metrics = metric_set(accuracy)
)
# plot the models
multiple_models %>%
autoplot()
# select best parameter
best_param <- multiple_models %>%
select_best(metric = "accuracy")
# select parsimonious parameter
parsimonious_param <- multiple_models %>%
select_by_one_std_err(metric = "accuracy", desc(cost_complexity))
# finalize model
final_model <- model_wf %>%
finalize_workflow(parsimonious_param)
fit(data = sample_data)
```Bagging and Random Forests
Idea:
- Build a bunch of trees with high variance and low bias (unpruned)
- Average results to get 1 tree
Both bagging and random forests are examples of ensemble methods: combining outputs of multiple ML algorithms
- In general: low variability; more stable predictions
Bagging
Bootstrap aggregation (bagging):
- Each tree is created from a resampling with replacement
- ~2/3 of original data will show up in resample, ~1/3 will not
- All \(p\) predictors are considered for each split
- Use resample to train, and remaining to test (out of bag error)
Forest = many unpruned tress from slighly different datasets (bootsrapping/resampling)
Overall prediction is from the combined prediction across all trees
- Classification: Majority rules
- Regression: Averages
```{r}
#| message: false
#| warning: false
#| eval: false
# Ensure outcome is categorical
sample_data <- sample_data %>%
mutate(y = as.factor(y))
# Create bagging classification model specification
model_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine(engine = "ranger") %>%
set_args(
mtry = NULL,
min_n = 2,
trees = 500,
probability = FALSE,
importance = "impurity")
# Make data recipe, preprocessing steps aren't absolutly necessary
data_rec <- recipe(y ~ x, data = sample_data)
# Create model workflow
model_wf <- workflow() %>%
add_recipe(data_rec) %>%
add_model(model_spec)
# NOTE: no tuning needed - OOB estimation instead
# finalize model
final_model <- model_wf %>%
fit(data = sample_data)
```Random Forest
At each split in the tree, randomly select and consider only a subset of predictors
- Typical values are \(p/2\) and \(\sqrt{p}\)
Pros:
- Low bias, low variance (maintine bias from trees, but reduce variance)
- Less computationally intesive (only consider a subset of predictors) compared to bagging
- Less greedy (surrogates get a chance to shine)
- Decorrelate trees (random subset of predictors)
Cons:
- Computationally expensive
- Can’t plot results, lose interpretability
Note: A forest is the best algorithm for reducing variance.
```{r}
# The only thing you have to change for random forest from bagging code is model specification
# Create random forest classification model specification
model_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine(engine = "ranger") %>%
set_args(
mtry = ___, # number of predictors to consider at each split
min_n = 2,
trees = 500,
probability = FALSE,
importance = "impurity")
```Unsupervised Learning
A set of data with no outcome
Instead of predicting, focus on
- structure of the data
- jumping off point for further investigations
Clustering
Goal: Identify similarities amongst rows, with respect to variables (features).
In general, clustering is a DISCOVERY PROCESS!
Hierarchical Clustering
Goal:
- Calculate euclidean distance
- Group (cluster) together 2 closest points by chosen linkage type
- Repeat with new clusters till end with 1 cluster
Linkage Types:
- Complete: Max distance between any pair of cases
- Single: min distance
- Average: average distance
- Centroid: centroid distance
This creates a dendrogram, which is NOT a classification tree (we start from the leaves and end at the root).
General properties of dendrograms:
- The more similar a cluster is, the sooner it will fuse
- Horizontal distance between leaves in meaningless
- Height = distance between 2 points
No “correct” way to cut a dendrogram
- Higher cut, fewer clusters, easier to understand, but maybe too simple and not very meaningful.
Drawbacks:
- Computationally expensive: need to calculate distances between each pair of points
- Greedy: best local decision of which cluster might not be best split/most meaningful cluster
Code:
```{r}
# Install packages
library(tidyverse)
library(cluster) # to build the hierarchical clustering algorithm
library(factoextra) # to draw the dendrograms
# convert column to rowname
sample_data <- sample_data %>%
column_to_rowname("id")
# run clustering algorithm
# use "complete", "single", "average", or "centroid" linkage method
hier_model <- hclust(dist(scale(sample_data)), method = ___)
# visualizing the clustering: heat map (not clustering)
heatmap(scale(data.matrix(sample_data)), Colv = NA, Rowv = NA)
# visualizing the clustering: heat map (ordered by dendrogram/clustering)
heatmap(scale(data.matrix(sample_data)), Colv = NA)
# Dentrogram (change font size with cex)
fviz_dend(hier_model, cex = 1)
fviz_dend(hier_model, horiz = True, cex = 1) # plot horizontally to read longer labels
# Visualizing clusters in a dendrogram
# typically want to store in new dataset so cluster assignment isnt written into features used later
cluster_data <- sample_data %>%
mutate(hier_cluster_k = as.factor(cutree(hier_model, k = ___)))
# visualize the clusters on the dendrogram
fviz_dend(hier_model, k = ___, cex = 1)
fviz_dend(hier_model, k = ___, horiz = True, cex = 1) # plot horizontally to read longer labels
```K-Means Clustering
Goal: Split cases into \(k\)-nonoverlapping homogeneous clusters
Homogeneous: within cluster varience (\(W(c_k)\)) of each cluster (\(c_1,\dots, c_k\)) = average squared distance between all pairs of cases in cluster.
Algorithm:
- Pick \(k\) - from prior knowledge
- Initialize locations for \(k\) centroids, assign each case to nearest centroid
- Calculate new centroid
- Reassign each case to cluster with nearest centroid
- Repeat 3 and 4 till stablized
Code:
```{r}
# Install packages
library(tidyverse)
library(cluster) # to build the hierarchichal clustering algorithm
library(factoextra) # to draw the dendrograms
# process the data
sample_data <- sample_data %>%
column_to_rownames("id")
# K-means can't handle NA: 2 options
# Omit missing cases (this can be bad if there are a lot of missing points!)
sample_data <- na.omit(sample_data)
# Impute the missing cases
library(VIM)
sample_data <- sample_data %>%
VIM::kNN(imp_var = FALSE)
# Turn categorical features into dummy variables
# DON'T DO IF YOU HAVE QUANTITATIVE OR LOGICAL FEATURES
sample_data <- data.frame(model.matrix(~ . - 1, sample_data))
# Run K-means
set.seed(___)
kmeans_model <- kmeans(scale(sample_data), centers = ___)
# Get the "the total within- cluster sum of squares"
# This is the total squared distance of each case from its assigned centroid
kmeans_model$tot.withinss
# Tuning K-means
tibble(K = 1:___) %>%
mutate(SS = map(K, ~ kmeans(scale(sample_data), centers = .x)$tot.withinss)) %>%
unnest(cols = c(SS))
# Assign each sample case to a cluster
# We typically want to store this in a new dataset so that the cluster assignments aren't
# accidentally used as features in a later analysis!
cluster_data <- sample_data %>%
mutate(kmeans_cluster_k = kmeans_model$cluster)
```Dimension Reduction
Goal: Find similarities amongst columns by collapsing/combining features
Motivation: What if we have large data (especially \(p>n\))? We might want to
- identify structure and patterns among features
- conserve computational resources
- facilitate regression and classification by eleminiated multicolinear predictors
Combining linear predictors:
- Idea 1: kick one out; lose variablity in other predictor
- Idea 2: find direction of best picture (preserves variability/most information) of data
Idea 2:
- Principal components are linear combinations of original features where loadings refer to the coefficients.
- PC1 contains the most variation.
- We want to retain high variability and reduce correlation
Principal Component Analysis
Goal: Turn original set of predictors into \(k<p\) uncorrelated principal components (PCs) preserving the majority of information (variability)
Trade offs (relative to kicking out columns):
- PCA preserves more info, but features now lost meaning; difficult to interpret
Details:
- Score plots: examine relationships among PCs
- Loadings: coefficients of linear combinations
- Loading plots: understand how PCs are constructed, thus how to interpret PC
```{r}
# Install packages
library(tidyverse)
# process the data
sample_data <- sample_data %>%
column_to_rownames("id")
# PCA can't handle NAs
# Omit missing cases (this can be bad if there are a lot of missing points!)
sample_data <- na.omit(sample_data)
# Impute the missing cases
library(VIM)
sample_data <- sample_data %>%
VIM::kNN(imp_var = FALSE)
# Turn categorical features into dummy variables
# DONT DO IF QUANTITATIVE OR LOGICAL
sample_data <- data.frame(model.matrix(~ . - 1, sample_data))
# Run PCA
# scale = TRUE, center = TRUE first standardizes the features
pca_results <- prcomp(sample_data, scale = TRUE, center = TRUE)
# ---------------------------------------------------------
# PC visualizations
# Get the loadings which define the PCs
pca_results %>%
pluck("rotation")
# Plot loadings for first "k" PCs (you pick k)
library(reshape2)
melt(pca_results$rotation[, 1:k]) %>%
ggplot(aes(x = Var1, y = value, fill = Var1)) +
geom_bar(stat = "identity") +
facet_wrap(~ Var2) +
labs(y = "loadings", x = "original features", fill = "original features")
# Plot loadings for just the first PC
melt(pca_results$rotation) %>%
filter(Var2 == "PC1") %>%
ggplot(aes(x = Var1, y = value, fill = Var1)) +
geom_bar(stat = "identity") +
labs(y = "loadings", x = "original features", fill = "original features")
# Loadings plot for first 2 PCs
library(factoextra)
fviz_pca_var(pca_results, repel = TRUE)
# ---------------------------------------------------------
# Examine amount of variability captured by each PC
# Load package for tidy table
library(tidymodels)
# Numerical summaries: Measure information captured by each PC
pca_results %>%
tidy(matrix = "eigenvalues")
# Graphical summary 1: SCREE PLOT
# Plot % of variance explained by each PC
pca_results %>%
tidy(matrix = "eigenvalues") %>%
ggplot(aes(y = percent, x = PC)) +
geom_point(size = 2) +
geom_line() +
labs(y = "% of variance explained")
# Graphical summary 2: Plot cumulative % of variance explained by each PC
pca_results %>%
tidy(matrix = "eigenvalues") %>%
rbind(0) %>%
ggplot(aes(y = cumulative, x = PC)) +
geom_point(size = 2) +
geom_line() +
labs(y = "CUMULATIVE % of variance explained")
# ---------------------------------------------------------
# Examine scores (i.e. PC coordinates for data points)
# Numerical summary: check out the scores
pca_results %>%
pluck("x")
# Graphical summary: Score plot
# Plot PC1 scores (x-axis) vs PC2 scores (y-axis) of all data points
fviz_pca_ind(pca_results, repel = TRUE)
```Principal Component Regression
Idea: Combining clustering and regression given some \(y\) and \(p>n\)
Using PCR:
- Ignore \(y\) for now, do PCA
- Keep first \(k\) PCs that retain sufficient information from OG predictors
- Model \(y\) by those first PCs
Note:
- PCA might not produce strongest possible predictors of \(y\)
- Partial least squares provides alternative
Code:
```{r}
library(tidymodels)
library(tidyverse)
# STEP 1: specify linear regression model
lm_spec <- linear_reg() %>%
set_model("regression") %>%
set_engine("lm")
# STEP 2: variable recipe
pcr_recipe <- recipe(y ~ ., data = sample_data) %>%
update_role(data_id, new_role = "id") %>%
step_dummy(all_nomial_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = tune())
# STEP 3: workflow
pcr_workflow <- workflow() %>%
add_recipe(pcr_recipe) %>%
add_model(lm_spec)
# STEP 4: Estimate multiple PCR models with varying numbers of PCs
# Largest range is p
# Put same number in for levels
set.seed(___)
pcr_models <- pcr_workflow %>%
tune_grid(
grid = grid_regular(num_comp(range = c(1, ___)), levels = ___),
resamples = vfold_cv(sample_data, v = 10),
metrics = metric_set(mae)
)
# explore models
pcr_models %>%
autoplot()
best_param <- multiple_models %>%
select_best(metric = "mae")
parsimonious_param <- multiple_models %>%
select_by_one_std_err(metric = "mae",
desc( parameter name))
final_model <- pcr_workflow %>%
finalize_workflow(___) %>% # your chosen parameter
fit(data = ___)
```