Chapter 6 Linear Model Selection and Regularization

6.1 Conceptual Exercises

6.1.1 Exercise 1.

  • Question (a)

The model with \(k\) predictors which has the smallest training RSS should be the best subset model because this model contains the \(k\) predictors which give the smallest training RSS after trying all of the models with \(k\) predictors. However, by luck, the model with \(k\) predictors selected with forward stepwise or backward stepwise procdedure can be the same than the one from the best subset selection.

  • Question (b)

We can not say which model with \(k\) predictors has the smallest test RSS because the best subset selection can overfit (it looks god on training data) and forward and backaward selection might not lead to the same model with \(k\) predictors.

  • Question (c)

The predictors in the \(k\)-variable model identified by forward and backward stepwise are a subset of the predictors in the \((k+1)\)-variable model identified by forward and backward stepwise selection by definition of the forward and backward stepwise selection procedure (it is the step 2.a of both algorithms). The predictors in the \(k\)-variable model identified by backward (resp. forward) stepwise are not a subset of the predictors in the \((k+1)\)-variable model identified by forward (resp. backward) stepwise selection because these two procdedure are different. The predictors in the \(k\)-variable model identified by best subset are not necessarily a subset of the predictors in the \((k+1)\)-variable model identified by best subset selection because for any particular \(k\), the best subset selection considers all the models with \(k\) predictors, so these model can be very different for two differents \(k\).

6.1.2 Exercise 2.

  • Question (a)

The lasso, relative to least squares, is less flexible (because we add a constraint) and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance bacause the variance of the lasso estimator will substantially decrease and the bias will slightly increase has the \(\lambda\) increase.

  • Question (b)

The ridge regression has the same properties relative to least squares than the lasso (but different constraints).

  • Question (c)

Non-linear methods, relative to least squares, are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias because non-linear methods tends to have large variance and low bias.

6.1.3 Exercise 3.

Suppose we estimate the regression coefficients in a linear regression model by minimizing

\[ \mid\mid Y - X\beta \mid\mid^2 \quad\text{subject to}\quad \mid\mid \beta \mid\mid_1 \leq s\]

for a particular value of \(s\). This minimization corresponds to the Lasso regression.

  • Question (a)

When \(s\) increases from \(0\), we increase the flexibility of the model. So, there is less constraint with large \(s\). And, the training RSS steadily decreases with an increasing of \(s\).

  • Question (b)

The test RSS will decrease at the beginning when \(s\) starts to increase from 0. And then, it will reach a minimum point and finally, restart to increase when the model overfits.

  • Question (c)

The fewer the constraints the larger the variance. The variance steadily increases with \(s\).

  • Question (d)

As the opposite, the fewer the constraints the smaller the bias. The bias steadily decreases with \(s\).

  • Question (e)

By definition, the irreducible error is irreducible. So, it will remain constant along the change of \(s\).

6.1.4 Exercise 4.

Suppose we estimate the regression coefficients in a linear regression model by minimizing

\[ \mid\mid Y - X\beta \mid\mid^2 + \lambda\mid\mid \beta \mid\mid_2^2\]

for a particular value of \(s\). This minimization corresponds to the Ridge regression.

  • Question (a)

When \(\lambda\) increases from \(0\), we decrease the flexibility of the model. So, there is more constraint with large \(\lambda\). And, the training RSS steadily increases with an increasing of \(\lambda\).

  • Question (b)

The test RSS will decrease at the beginning when \(\lambda\) starts to increase from 0. And then, it will reach a minimum point and finally, restart to increase when the model overfits.

  • Question (c)

The harder the constraints the smaller the variance. The variance steadily decreases with \(\lambda\).

  • Question (d)

As the opposite, the harder the constraints the larger the bias. The bias steadily increases with \(\lambda\).

  • Question (e)

By definition, the irreducible error is irreducible. So, it will remain constant along the change of \(\lambda\).

6.1.5 Exercise 5.

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that \(n = 2\), \(p = 2\), \(x_{11} = x_{12}\), \(x_{21} = x_{22}\). Furthermore, suppose that \(y_1 + y_2 = 0\) and \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: \(\widehat{β}_0 = 0\).

  • Question (a)

The ridge regression optimization problem is written:

\[ \arg \min_{\beta_1, \beta_2} \mathcal{L}(\beta_1, \beta_2) = \arg \min_{\beta_1, \beta_2} \left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right)^2 + \left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right)^2 + \lambda\left(\beta_1^2 + \beta_2^2\right)\]

  • Question (b)

By taking the partial derivatives to \(\beta_1\) and \(\beta_2\) and setting to \(0\).

\[\frac{\partial \mathcal{L}(\beta_1, \beta_2)}{\partial\beta_1} = 0 \Leftrightarrow -2x_{11}\left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right) - 2x_{21}\left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right) + 2\lambda\beta_1 = 0\]

\[\frac{\partial \mathcal{L}(\beta_1, \beta_2)}{\partial\beta_2} = 0 \Leftrightarrow -2x_{12}\left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right) - 2x_{22}\left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right) + 2\lambda\beta_2 = 0\]

Using \(x_{11} = x_{12}\) and \(x_{22} = x_{21}\), we found

\[(\beta_1 + \beta_2)(x_{11} + x_{21}) + \lambda\beta_1 = x_{11}y_1 + x_{21}y_2\]

and

\[(\beta_1 + \beta_2)(x_{12} + x_{22}) + \lambda\beta_2 = x_{11}y_1 + x_{21}y_2\]

Then, using \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), we arrive to

\[\left\{ \begin{array}{r c l} \lambda\beta_1 &=& x_{11}y_1 + x_{21}y_2\\ \lambda\beta_2 &=& x_{11}y_1 + x_{21}y_2 \end{array} \right. \Leftrightarrow \beta_1 = \beta_2\]

So, the ridge coefficient estimates satisfy \(\widehat{\beta_1} = \widehat{\beta_2}\).

  • Question (c)

The lasso regression optimization problem is written:

\[ \arg \min_{\beta_1, \beta_2} \mathcal{L}(\beta_1, \beta_2) = \arg \min_{\beta_1, \beta_2} \left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right)^2 + \left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right)^2 + \lambda\left(\mid\beta_1\mid + \mid\beta_2\mid\right)\]

  • Question (d)

By taking the partial derivatives to \(\beta_1\) and \(\beta_2\) and setting to \(0\).

\[\frac{\partial \mathcal{L}(\beta_1, \beta_2)}{\partial\beta_1} = 0 \Leftrightarrow -2x_{11}\left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right) - 2x_{21}\left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right) + \lambda\frac{\mid\beta_1\mid}{\beta_1} = 0\]

\[\frac{\partial \mathcal{L}(\beta_1, \beta_2)}{\partial\beta_2} = 0 \Leftrightarrow -2x_{12}\left(y_1 - \beta_1x_{11} - \beta_2x_{12}\right) - 2x_{22}\left(y_2 - \beta_1x_{21} - \beta_2x_{22}\right) + \lambda\frac{\mid\beta_2\mid}{\beta_2} = 0\]

Using \(x_{11} = x_{12}\) and \(x_{22} = x_{21}\), we found

\[(\beta_1 + \beta_2)(x_{11} + x_{21}) + \lambda\frac{\mid\beta_1\mid}{2\beta_1} = x_{11}y_1 + x_{21}y_2\]

and

\[(\beta_1 + \beta_2)(x_{12} + x_{22}) + \lambda\frac{\mid\beta_2\mid}{2\beta_2} = x_{11}y_1 + x_{21}y_2\]

Then, using \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), we arrive to

\[\left\{ \begin{array}{r c l} \lambda\frac{\mid\beta_1\mid}{2\beta_1} &=& x_{11}y_1 + x_{21}y_2\\ \lambda\frac{\mid\beta_2\mid}{2\beta_2} &=& x_{11}y_1 + x_{21}y_2 \end{array} \right. \Leftrightarrow \frac{\mid\beta_1\mid}{\beta_1} = \frac{\mid\beta_2\mid}{\beta_2}\]

So, the ridge coefficient estimates satisfy \(\frac{\mid\beta_1\mid}{\beta_1} = \frac{\mid\beta_2\mid}{\beta_2}\). So, there are an infinite number of solutions for this equation. The only constraint is that \(\beta_1\) and \(\beta_2\) must have the same sign (and different of \(0\)).

6.1.6 Exercise 6.

  • Question (a)

The equation (6.12) with \(p = 1\) is written:

\[L = \left(y_1 - \beta_1\right)^2 + \lambda\beta_1^2\]

y <- 2; lambda <- 10
beta <- seq(-10, 10, 0.1)
ridge <- function(beta, y, lambda) return((y - beta)**2 + lambda*beta**2)
df <- tibble(beta = beta, L = ridge(beta, y, lambda))
The ridge regression optimization problem is solved by (6.14)

Figure 6.1: The ridge regression optimization problem is solved by (6.14)

  • Question (b)

The equation (6.13) with \(p = 1\) is written:

\[L = \left(y_1 - \beta_1\right)^2 + \lambda\mid\beta_1\mid\]

y <- 2; lambda <- 10
beta <- seq(-10, 10, 0.1)
ridge <- function(beta, y, lambda) return((y - beta)**2 + lambda*abs(beta))
df <- tibble(beta = beta, L = ridge(beta, y, lambda))
The lasso regression optimization problem is solved by (6.15)

Figure 6.2: The lasso regression optimization problem is solved by (6.15)

6.1.7 Exercise 7.

We will now derive the Bayesian connection to the lasso and ridge regression.

  • Question (a)

Suppose the regression model: \(Y = X\beta + \epsilon\), where \(\epsilon \sim \mathcal{N}(0, \sigma^2I_n)\).

The likelihood of the data can be written as:

\[\begin{align*} f(Y \mid X, \beta) &= \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}\left(y_i - x_i\beta\right)^2\right)\\ &= \left(\frac{1}{2\pi\sigma^2}\right)^{n / 2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2\right) \end{align*}\]

  • Question (b)

We have the following prior on \(\beta\): \(p(\beta) = \frac{1}{2b}\exp(-\mid\beta\mid / b)\).

So, the posterior distribution for \(\beta\) is: \[\begin{align*} p(\beta \mid Y, X) &\propto f(Y \mid X, \beta)p(\beta) \\ &\propto \left(\frac{1}{2\pi\sigma^2}\right)^{n / 2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2\right)\frac{1}{2b}\exp(-\mid\beta\mid / b) \\ &\propto \frac{1}{2b}\left(\frac{1}{2\pi\sigma^2}\right)^{n / 2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 - \frac{\mid\beta\mid}{b} \right) \end{align*}\]

  • Question (c)

The mode for \(\beta\) under this posterior distribution is the maximum of \(p(\beta \mid, Y, X)\):

\[\begin{align*} \max_\beta p(\beta \mid Y, X) &= \min_\beta \frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 + \frac{\mid\beta\mid}{b} \\ &= \min_\beta \sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 + \frac{2\sigma^2}{b}\mid\beta\mid \end{align*}\]

which is equivalent to the lasso regression optimization function. So, the lasso estimate if the mode for \(\beta\) under this posterior distribution.

  • Question (d)

We have the following prior on \(\beta\): \(p(\beta) = \prod_{i = 1}^p \frac{1}{\sqrt{2\pi c}}\exp(- \beta_i / 2c)\).

So, the posterior distriubtion for \(\beta\) is:

\[\begin{align*} p(\beta \mid Y, X) &\propto f(Y \mid X, \beta)p(\beta) \\ &\propto \left(\frac{1}{2\pi\sigma^2}\right)^{n / 2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2\right)\left(\frac{1}{2\pi c}\right)^{p/2}\exp\left(- \frac{1}{2c}\sum_{i = 1}^p \beta_i^2\right) \\ &\propto \left(\frac{1}{2\pi c}\right)^{p/2}\left(\frac{1}{2\pi\sigma^2}\right)^{n / 2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 - \frac{1}{2c}\sum_{i = 1}^p \beta_i^2\right) \end{align*}\]

  • Question (e)

The mode for \(\beta\) under this posterior distribution is the maximum of \(p(\beta \mid, Y, X)\):

\[\begin{align*} \max_\beta p(\beta \mid Y, X) &= \min_\beta \frac{1}{2\sigma^2}\sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 + \frac{1}{2c}\sum_{i = 1}^p \beta_i^2 \\ &= \min_\beta \sum_{i = 1}^n \left(y_i - x_i\beta\right)^2 + \frac{\sigma^2}{b}\sum_{i = 1}^p \beta_i^2 \end{align*}\]

which is equivalent to the ridge regression optimization function. As the posterior distribution is also gaussian, we know that the mode is also the mean.

6.2 Applied Exercises

6.2.1 Exercise 8.

  • Question (a)
set.seed(42)
n <- 100
X <- rnorm(n, 0, 1)
eps <- rnorm(n, 0, 1)
  • Question (b)
Y <- 5 + 3 * X + 0.4 * X**2 + X**3 + eps
  • Question (c)
X <- as_tibble(poly(X, degree = 10, raw = TRUE))
colnames(X) <- paste0("$X_{", colnames(X), "}$")
df <- X %>% add_column(Y, .before = 1)

First, we perform best subset selection in order to choose the best model containing the predictors \(X\), \(X^2, \dots, X^{10}\).

reg_subset <- regsubsets(Y ~ ., data = df, nvmax = 10)
reg_subset$xnames <- str_sub(reg_subset$xnames, 2, -2)
Selected variables for each criteria for best subset selection.

Figure 2.2: Selected variables for each criteria for best subset selection.

Best models according to $C_p$, $BIC$ and adjusted $R^2$ for best subset selection.

Figure 2.3: Best models according to \(C_p\), \(BIC\) and adjusted \(R^2\) for best subset selection.

Table 6.1: Coefficients for the best model according to \(BIC\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{5}\) \(X_{7}\) \(X_{9}\)
5.00352 3.05965 0.25633 0.81479 -0.17204 0.01039
Table 6.2: Coefficients for the best model according to \(C_p\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\)
4.86923 3.01089 0.65315 0.89162 -0.09081 -0.21299 0.01459 0.01568
Table 6.3: Coefficients for the best model according to Adjusted \(R^2\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\)
4.86923 3.01089 0.65315 0.89162 -0.09081 -0.21299 0.01459 0.01568
  • Question (d)
reg_subset <- regsubsets(Y ~ ., data = df, nvmax = 10, method = 'forward')
reg_subset$xnames <- str_sub(reg_subset$xnames, 2, -2)
Selected variables for each criteria for forward selection.

Figure 6.3: Selected variables for each criteria for forward selection.

Best models according to $C_p$, $BIC$ and adjusted $R^2$ for forward selection.

Figure 6.4: Best models according to \(C_p\), \(BIC\) and adjusted \(R^2\) for forward selection.

Table 6.4: Coefficients for the best model according to \(BIC\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{3}\) \(X_{5}\) \(X_{7}\) \(X_{9}\)
5.00245 2.95906 0.25966 0.19189 0.72191 -0.15637 0.00955
Table 6.5: Coefficients for the best model according to \(C_p\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{3}\) \(X_{5}\) \(X_{7}\) \(X_{9}\)
5.00245 2.95906 0.25966 0.19189 0.72191 -0.15637 0.00955
Table 6.6: Coefficients for the best model according to Adjusted \(R^2\).
Intercept \(X_{1}\) \(X_{2}\) \(X_{3}\) \(X_{4}\) \(X_{5}\) \(X_{7}\) \(X_{9}\) \(X_{10}\)
4.85097 3.02558 0.81393 -0.04738 -0.21319 0.92544 -0.2203 0.01611 0.00076
reg_subset <- regsubsets(Y ~ ., data = df, nvmax = 10, method = 'backward')
reg_subset$xnames <- str_sub(reg_subset$xnames, 2, -2)
Selected variables for each criteria for backward selection.

Figure 6.5: Selected variables for each criteria for backward selection.

Best models according to $C_p$, $BIC$ and adjusted $R^2$ for backward selection.

Figure 6.6: Best models according to \(C_p\), \(BIC\) and adjusted \(R^2\) for backward selection.

Table 6.7: Coefficients for the best model according to \(BIC\).
Intercept \(X_{1}\) \(X_{4}\) \(X_{5}\) \(X_{7}\) \(X_{9}\)
5.13379 3.07276 0.04595 0.8142 -0.17212 0.01048
Table 6.8: Coefficients for the best model according to \(C_p\).
Intercept \(X_{1}\) \(X_{4}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\)
4.98423 2.96772 0.5976 0.96214 -0.2685 -0.24729 0.03118 0.01974
Table 6.9: Coefficients for the best model according to Adjusted \(R^2\).
Intercept \(X_{1}\) \(X_{4}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\)
4.98423 2.96772 0.5976 0.96214 -0.2685 -0.24729 0.03118 0.01974
  • Question (e)
set.seed(42)
lasso_cv <- cv.glmnet(as.matrix(df[-1]), df$Y, alpha = 1)
Cross validation error as a function of $\lambda$.

Figure 6.7: Cross validation error as a function of \(\lambda\).

lasso_model <- glmnet(as.matrix(df[-1]), df$Y, alpha = 1)
lasso_coef <- predict(lasso_model, type = 'coefficients', s = lasso_cv$lambda.min)
Table 6.10: Coefficients for the model fitted by lasso.
(Intercept) \(X_{1}\) \(X_{2}\) \(X_{3}\) \(X_{4}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\) \(X_{10}\)
5.10693 3.06612 0.18405 0.91642 0 0 0 0 0 0 0

The lasso gives good coefficients estimation (and set to 0 all the coefficient that are not significant in the model). We can do model selection using the lasso.

  • Question (f)
set.seed(42)
Y <- 5 + 0.2 * pull(X, 7) + eps
df <- X %>% add_column(Y, .before = 1)

First, we perform best subset selection for the \(Y\) vector with the \(X\) matrix as features.

reg_subset <- regsubsets(Y ~ ., data = df, nvmax = 10)
reg_subset$xnames <- str_sub(reg_subset$xnames, 2, -2)
Selected variables for each criteria for best subset selection.

Figure 6.8: Selected variables for each criteria for best subset selection.

Best models according to $C_p$, $BIC$ and adjusted $R^2$ for best subset selection.

Figure 6.9: Best models according to \(C_p\), \(BIC\) and adjusted \(R^2\) for best subset selection.

Table 6.11: Coefficients for the best model according to \(BIC\).
Intercept \(X_{3}\) \(X_{4}\) \(X_{5}\) \(X_{9}\)
4.95872 -1.1237 -0.03994 0.9405 0.01203
Table 6.12: Coefficients for the best model according to \(C_p\).
Intercept \(X_{3}\) \(X_{5}\) \(X_{6}\) \(X_{9}\) \(X_{10}\)
4.96071 -0.90561 0.8318 -0.02629 0.0146 0.00078
Table 6.13: Coefficients for the best model according to Adjusted \(R^2\).
Intercept \(X_{2}\) \(X_{3}\) \(X_{5}\) \(X_{6}\) \(X_{8}\) \(X_{9}\)
4.87656 0.22826 -0.91874 0.83297 -0.085 0.01372 0.01476

And now, we fit the lasso model on the data.

set.seed(42)
lasso_cv <- cv.glmnet(as.matrix(df[-1]), df$Y, alpha = 1)
Cross validation error as a function of $\lambda$.

Figure 6.10: Cross validation error as a function of \(\lambda\).

lasso_model <- glmnet(as.matrix(df[-1]), df$Y, alpha = 1)
lasso_coef <- predict(lasso_model, type = 'coefficients', s = lasso_cv$lambda.min)
Table 6.14: Coefficients for the model fitted by lasso.
(Intercept) \(X_{1}\) \(X_{2}\) \(X_{3}\) \(X_{4}\) \(X_{5}\) \(X_{6}\) \(X_{7}\) \(X_{8}\) \(X_{9}\) \(X_{10}\)
4.73572 0 0 0 0 0.00427 0 0.19456 0 0 0

The results obtained with the best subset selection are very poor compared to the ones obtained with the lasso. Indeed, we can not recover the true underlying model using the best subset selection. At the opposite, the model fit with the lasso is almost perfect. A large part of the coefficients is set to \(0\) except the one of interest \(\beta_7\) (and another \(\beta_5\), but very close to \(0\)).

6.2.2 Exercise 9.

In this exercise, we will predict the number of applications received (Apps variable) using the other variables in the College dataset.

  • Question (a)
set.seed(42)
df <- as_tibble(College[-1])
idx <- df$Apps %>% createDataPartition(p = 0.7, list = FALSE, times = 1)
train <- slice(df, idx[,1]); Y_train <- as.vector(train$Apps); X_train <- as.matrix(select(train, -Apps))
test <- slice(df, -idx[,1]); Y_test <- as.vector(test$Apps); X_test <- as.matrix(select(test, -Apps))
  • Question (b)
lm_model <- lm(Apps ~ ., data = train)
pred_lm <- predict.lm(lm_model, newdata = test)

The mean squared error obtained using the linear model is \(1.4251116\times 10^{6}\).

  • Question (c)
cv.out <- cv.glmnet(X_train, Y_train, alpha = 0)
ridge_mod <- glmnet(X_train, Y_train, alpha = 0, lambda = cv.out$lambda.min)
pred_ridge <- predict(ridge_mod, newx = X_test)

The mean squared error obtained using the ridge regression model is \(1.3541858\times 10^{6}\).

  • Question (d)
cv.out <- cv.glmnet(X_train, Y_train, alpha = 1)
lasso_mod <- glmnet(X_train, Y_train, alpha = 1, lambda = cv.out$lambda.min)
pred_lasso <- predict(lasso_mod, newx = X_test)

The mean squared error obtained using the ridge regression model is \(1.4228908\times 10^{6}\). The number of non-zero coefficients is 14.

  • Question (e)
pcr_mod <- pcr(Apps ~ ., data = train, scale = TRUE, validation = 'CV')
pred_pcr <- predict(pcr_mod, test, ncomp = which.min(pcr_mod$validation$adj))

The mean squared error obtained using the PCR model is \(1.4251116\times 10^{6}\). The number of components \(M\) selected with cross-validation is 16.

  • Question (f)
pls_mod <- plsr(Apps ~ ., data = train, scale = TRUE, validation = 'CV')
pred_pls <- predict(pls_mod, test, ncomp = which.min(pls_mod$validation$adj))

The mean squared error obtained using the PCR model is \(1.4230893\times 10^{6}\). The number of components \(M\) selected with cross-validation is 13.

  • Question (g)

Let’s compute the \(R^2\) for each model in order to compare them.

\(R^2\) for all the models
Model \(R^2\)
Linear 0.8984327
Ridge 0.9034876
Lasso 0.898591
PCR 0.8984327
PLS 0.8985768

So, all the model are quite comparable with a \(R^2\) around \(0.9\). None of them performs really better than the others. And as the \(R^2\) is quite high, all of them are quite accurate in this case and fit the data pretty well. All the tests arrors are very similar.

6.2.3 Exercise 10.

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated dataset.

  • Question (a)
set.seed(42)
n <- 1000; p <- 20
X <- as.data.frame(matrix(rnorm(n * p), nrow = n))
beta <- runif(20, min = 0.5, max = 1)
beta[sample(1:length(beta), 5)] <- 0
Y <- as.matrix(X) %*% beta + rnorm(n)
df <- data.frame(Y, X)
  • Question (b)
idx <- sample(1:length(Y), size = 100)
train <- df[idx,]
test <- df[-idx,]
  • Question (c)

Let’s perform best subset selection on the train set.

regfit_subset <- regsubsets(Y ~ ., data = train, nvmax = 20)
model_matrix <- model.matrix(Y ~ ., data = train)
val_errors_train <- rep(NA, 20)
for (i in 1:20) {
  coefi <- coef(regfit_subset, id = i)
  pred <- model_matrix[, names(coefi)] %*% coefi
  val_errors_train[i] <- mean((train$Y - pred)**2)
}
Train MSE associated with the best model of each size.

Figure 6.11: Train MSE associated with the best model of each size.

  • Question (d)

Let’s do the same thing on the test set.

regfit_subset <- regsubsets(Y ~ ., data = train, nvmax = 20)
model_matrix <- model.matrix(Y ~ ., data = test)
val_errors_test <- rep(NA, 20)
for (i in 1:20) {
  coefi <- coef(regfit_subset, id = i)
  pred <- model_matrix[, names(coefi)] %*% coefi
  val_errors_test[i] <- mean((test$Y - pred)**2)
}
Test MSE associated with the best model of each size.

Figure 6.12: Test MSE associated with the best model of each size.

  • Question (e)

The minimum value of the test MSE is 1.2052158. This minimum is reach for a model of size 15.

  • Question (f)

The model at which the test set MSE is minimized as the same number of coefficients than the true model. Moreover, all the coefficients that are set to \(0\) in the generating process are equal to \(0\) in the model with the minimum test MSE. But, the coefficients are not to close to true ones (but this is probably due to the generating process).

  • Question (g)
beta <- c(0, beta)
names(beta) <- names(coef(regfit_subset, id = 20))
beta_errors <- rep(NA, 20)
for (i in 1:20) {
  coefi <- coef(regfit_subset, id = i)
  b <- merge(data.frame(beta=names(beta), beta), data.frame(beta=names(coefi),coefi), all.x = TRUE)
  b[is.na(b)] <- 0
  beta_errors[i] <- sqrt(sum((b[,2] - b[,3])**2))
}
Errors on the coefficients compared to the size of the model.

Figure 6.13: Errors on the coefficients compared to the size of the model.

The \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta}_j^r)^2}\) curve looks quite the same than the test MSE. However, the minimum is not reach to the same model size.

6.2.4 Exercise 11.

We will now try to predict per capita crime rate in the Bostondata set.

# Load data
df <- as_tibble(Boston)
# Split intot train and test set
set.seed(42)
idx <- sample(1:nrow(df), size = 100)
train <- df[idx,]; model_train <- model.matrix(crim ~ ., data = train)
test <- df[-idx,]; model_test <- model.matrix(crim ~ ., data = test)
  • Question (a)

Let’s start by best subset selection.

regfit_subset <- regsubsets(crim ~ ., data = train, nvmax = ncol(train) - 1)
Best models according to $BIC$ for best subset selection.

Figure 6.14: Best models according to \(BIC\) for best subset selection.

So, it appears that the best model according to the best subset selection model has two variables.

best_coeff <- coef(regfit_subset, id = 2)
pred_best <- model_test[, names(best_coeff)] %*% best_coeff

Then, let’s perform lasso regression.

set.seed(42)
lasso_cv <- cv.glmnet(x = as.matrix(train[,2:ncol(train)]), y = train[['crim']], alpha = 1)
lasso_mod <- glmnet(as.matrix(train[,2:ncol(train)]), train[['crim']], alpha = 1, lambda = lasso_cv$lambda.min)
pred_lasso <- predict(lasso_mod, newx = as.matrix(test[,2:ncol(test)]))

Another one is the ridge regression.

set.seed(42)
ridge_cv <- cv.glmnet(x = as.matrix(train[,2:ncol(train)]), y = train[['crim']], alpha = 0)
ridge_mod <- glmnet(as.matrix(train[,2:ncol(train)]), train[['crim']], alpha = 0, lambda = ridge_cv$lambda.min)
pred_ridge <- predict(ridge_mod, newx = as.matrix(test[,2:ncol(test)]))

And finally, the PCR model.

set.seed(42)
pcr_mod <- pcr(crim ~ ., data = train, scale = TRUE, validation = 'CV')
pred_pcr <- predict(pcr_mod, test[,2:ncol(test)], ncomp = 7)
  • Question (b)

We are going to compare the different models using the MSE on the test set.

mse_best <- mean((pred_best - test$crim)**2)
mse_lasso <- mean((pred_lasso - test$crim)**2)
mse_ridge <- mean((pred_ridge - test$crim)**2)
mse_pcr <- mean((pred_pcr - test$crim)**2)
MSE for all the models
Model MSE
Best subset 41.8357193
Ridge 42.4799642
Lasso 40.5281645
PCR 45.7516391

So, it appears that the lasso has the smallest MSE on the test set. Thus, we choose the lasso model as the final model.

  • Question (c)

Our model does not involve all the features because the lasso model shrinks some coefficients towards 0.

predict(lasso_mod, type = 'coefficient')
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) -2.57705466
## zn           .         
## indus        .         
## chas         .         
## nox          .         
## rm           .         
## age          .         
## dis          .         
## rad          0.46978606
## tax          .         
## ptratio      .         
## black        .         
## lstat        0.25150257
## medv        -0.06290058