Chapter 3 Linear Regression

3.1 Conceptual Exercises

3.1.1 Exercise 1.

Recall the model for the Advertising data:

\[sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times newspaper + \epsilon\]

After fitting the model, we found out this results:

Least square coefficient estimates of the linear model
	Coefficient	Std. error	t-statistic	p-value
Intercept	2.939	0.3119	9.42	< 0.001
TV	0.046	0.0014	32.81	< 0.001
radio	0.189	0.0086	21.89	< 0.001
newspaper	-0.001	0.0059	-0.18	0.8599

The null hypotheses to which the p-values given in the table correspond is

\[H_0 : \beta_0 = \beta_1 = \beta_2 = \beta_3 = 0\]

versus

\[H_1 : \beta_0 \neq 0, \beta_1 \neq 0, \beta_2 \neq 0 ~\text{and}~ \beta_3 \neq 0\]

This p-values means that the two features TV and radio are significant for the Advertising data. However, the feature newspaper is not statistically significant. So, this varible does not have an influence on the sales.

3.1.2 Exercise 2.

In both the $K$ nearest neighbors regression and classification, the beginning point is to identifies the $K$ points that are the closest to a given points $x_0$. Let call this set $\mathcal{N}_0$. But, the quantity of interest is to not same. In the regression setting, we are interested by the estimation of $\hat{f}(x_0)$. This is done by averaging the values of the $K$ closest points of $x_0$. \[\hat{f}(x_0) = \frac{1}{K}\sum_{x_i \in \mathcal{N}_0} y_i\]

In the classification setting, we first aim to predict the probability of belonging to some class $j$. This is done by counting the proportion of the $K$ closest points of $x_0$ that belong to the class $j$.

\[\mathbb{P}(Y = j \mid X = x_0) = \frac{1}{K}\sum_{x_i \in \mathcal{N}_0}\mathcal{1}(y_i = j)\]

Finally, using the Bayes rules, we classify $x_0$ in the class with the largest probability.

3.1.3 Exercise 3.

The linear model is:

\[Y = \widehat{\beta}_0 + \sum_{i = 1}^{5}\widehat{\beta}_iX_i.\]

Question (a)

For a fixed value of IQ ($X_2$) and GPA ($X_1$), females earn more on average than males because the coefficient of Gender ($X_3$) is positive. However, for a fixed value of IQ ($X_2$) and GPA ($X_1$), males earn more on average than females provided that the GPA is high enough (larger than 3.5) because the coefficient of the interaction between GPA and Gender ($X_5$) is negative.

Question (b)

The salary of a woman with IQ of 110 and a GPA of 4.0 is (in thousands of dollars):

\[Y = 50 + 20\times 4.0 + 0.07\times 110 + 35\times 1 + 0.01\times 4.0\times 110 - 10\times 4.0\times 1 = 137.1.\]

Question (c)

With only the value of $\widehat{\beta}_4$, we can not conclude on the evidence of an GPA/IQ interaction effect on the starting salary after graduation even if this value is very small. To test that, we should compute the F-statistic and the corresponding p-value. If the p-value is very low (usually under 0.05), there is a clear evidence of the relationship between the GPA/IQ interaction and the salary.

3.1.4 Exercise 4.

Question (a) and (b)

The true relationship between $X$ and $Y$ is linear. So, the cubic regression model will overfit. For the training, we expect the RSS for the cubic model to be lower than the one of the linear model because it will follow the noise more closely. However, on the test set, it should be the opposite.

Question (c) and (d)

The true relationship between $X$ and $Y$ is not linear. As before, the cubic regression should have a lower train RSS because this model is more flexible than the linear model. However, we can not tell which one will be lower for the test set because it depends on “how far the data is from linear”. If the true relationship is almost linear, the test RSS will be lower for the linear model and conversely, if the true relationship is closer to the cubic model, the test RSS will be lower for the cubic model.

3.1.5 Exercise 5.

\[ \widehat{y}_i = x_i\widehat{\beta} = x_i \frac{\sum_{j=1}^n x_jy_j}{\sum_{k=1}^n x^2_k} = \sum_{j=1}^n\frac{x_jx_i}{\sum_{k=1}^n x^2_k}y_j\]

So, $a_j = \frac{x_jx_i}{\sum_{k=1}^n x^2_k}$. We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

3.1.6 Exercise 6.

The simple linear model is: $y_i = \beta_0 + \beta_1x_i$. The estimator of $\beta_0$ and $\beta_1$ are:

\[\widehat{\beta_0} = \bar{y} - \widehat{\beta_1}\bar{x} \quad\text{and}\quad \widehat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Let’s predict the value of $y_i$ for $x_i = \bar{x}$.

\[\widehat{\beta_0} + \widehat{\beta_1}\bar{x} = \bar{y} - \widehat{\beta_1}\bar{x} + \widehat{\beta_1}\bar{x} = \bar{y}\]

So, the least squares line always passes through the point $(\bar{x}, \bar{y})$.

3.1.7 Exercise 7.

Assume that $\bar{x} = \bar{y} = 0$.

\[R^2 = 1 - \frac{\sum(y_i - \widehat{y}_i)^2}{\sum y_i^2} = \frac{\sum y_i^2 - \sum y_i^2 + 2\sum y_i\widehat{y}_i - \sum\widehat{y}_i^2}{\sum y_i^2}\]

We know that $\widehat{y}_i = x_i\widehat{\beta}$. So

\[ R^2 = \frac{2\sum y_ix_i\widehat{\beta} - \sum x_i^2\widehat{\beta^2}}{\sum y_i^2} = \widehat{\beta}\frac{2\sum y_ix_i - \sum x_i^2\widehat{\beta}}{\sum y_i^2}\]

Replace $\widehat{\beta}$ by $\frac{\sum y_ix_i}{\sum x_i^2}$, and $\sum x_i^2\widehat{\beta} = \sum x_iy_i$. Finally,

\[R^2 = \frac{2(\sum x_iy_i)^2 - (\sum x_iy_i)^2}{\sum x_i^2\sum y_i^2} = \frac{(\sum x_iy_i)^2}{\sum x_i^2\sum y_i^2} = Corr^2\]

In the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic is equal to the square of the correlation between $X$ and $Y$.

3.2 Applied Exercises

3.2.1 Exercise 8.

This exercise is about fitting a simple linear model to the Auto dataset. It contains 392 observations of 9 variables about vehicles. For a description of the variables, please refer to R by typing help(Auto) after loading the package ISLR.

Question (a)

lm_model <- lm(mpg ~ horsepower, data = auto)
lm_model %>% summary() %>% print_summary_lm()

Results of the linear model on the auto dataset.

Formula: mpg ~ horsepower
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	333	30.5	0	24	-13.57	-6.99	-5.97	-3.26	-0.34	2.76	7.11	8.77	16.92

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	39.93586	0.71750	55.65984	< 2.22e-16
horsepower	-0.15784	0.00645	-24.48914	< 2.22e-16

Residual standard error: 4.906 on 390 degrees of freedom.
Multiple $R^2$: 0.606.
Adjusted $R^2$: 0.605.
F-statistic: 599.718 on 1 and 390, p-value: <2e-16.

We test the hypothesis $H_0: \beta_1 = 0$. We use the F-statistic to determine whether or not we should reject the null hypothesis. As the p-value associated with the the F-statistic is very low, there is a clear evidence of a relationship between mpg and horsepower. There are two measures of model accuracy to characterize how strong is the relationship between the response and the variable, the RSE and the $R^2$. For the linear model, the RSE is 4.91 mpg. The mean value for mpg is 23.45, it indicates a percentage error of roughly 21%. Then, the $R^2$ statistic is equal to 0.61. So, horsepower explains roughly 61% of the variance in mpg. The variable horsepower has a negative relationship with the response mpg because $\beta_1 < 0$. The predicted value of mpg for horsepower = 98 is 24.47. The 95% confidence interval is (23.97, 24.96) and the 95% prediction interval is (14.81, 34.12).

Question (b)

Figure 3.1: Least square regression plot for the auto dataset.

Question (c)

Figure 3.2: Diagnostic plots of the least square regression fit.

The plot of the fitted values agains the residuals shows a non-linear pattern in the data. So, non-linear transformation of the data, such as a quadratic transformation, could improve the fit. It does not seem to have another trouble with the fit, maybe one observation with a quite large Cook distance but that’s all.

3.2.2 Exercise 9.

This exercise is about fitting a multiple linear model to the Auto dataset. It contains 392 observations of 9 variables about vehicles. For a description of the variables, please refer to R by typing help(Auto) after loading the package ISLR.

Question (a)

Figure 3.3: Scatterplot matrix.

Question (b)

Figure 3.4: Correlation plot.

Question (c)

lm_model <- lm(mpg ~ ., data = auto)
lm_model %>% summary() %>% print_summary_lm()

Results of the linear model on the auto dataset.

Formula: mpg ~ .
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	392	22.65	0	10.88	-9.59	-5.09	-3.79	-2.16	-0.12	1.87	3.79	5.27	13.06

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-17.21843	4.64429	-3.70744	0.00024018
cylinders	-0.49338	0.32328	-1.52615	0.12779647
displacement	0.01990	0.00752	2.64743	0.00844465
horsepower	-0.01695	0.01379	-1.22951	0.21963282
weight	-0.00647	0.00065	-9.92879	< 2.22e-16
acceleration	0.08058	0.09884	0.81517	0.41547802
year	0.75077	0.05097	14.72880	< 2.22e-16
origin	1.42614	0.27814	5.12749	4.6657e-07

Residual standard error: 3.328 on 384 degrees of freedom.
Multiple $R^2$: 0.821.
Adjusted $R^2$: 0.818.
F-statistic: 252.428 on 7 and 384, p-value: <2e-16.

We have fitted a multiple regression model of mpg onto all the other variables in the auto dataset. In order to say that there is a relationship between the features and the response, we should test the hypothesis of all $\beta_i = 0$. For the model, the p-value associated with the F-statistic is less than $2\times 10^{-16}$. So, there is a clear evidence of a relationship between mpg and the features. There are four variables that have a statistically significant relationship with the reponse (because they have a very low p-value). These variables are displacement, weight, year and origin. The coefficient for the year variable suggest that the mpg grows with the year, so recent cars consume less than old cars.

Question (d)

Figure 3.5: Diagnotistic plots of the multiple least square regression fit.

There is a little trend in the residuals, meaning that the linear model may not be the best model to fit on the data. Moreover, the variance of error terms seems to be non-constant. Based on this two observations, some transformation of the data could be a good idea. There is no outliers, but the point 14 appears to be a high leverage point.

Question (e)

# Recall that cylinders:year stands for the inclusion of the interaction term between 
# cylinders and years in the regression. cylinders*year adds the cylinders, year and 
# cylinders:year to the model. .:. adds all the interactions to the model and .*. adds 
# all the variable and their interaction to the model.
lm_model_int <- lm(mpg ~ .*., data = auto)
lm_model_int %>% summary() %>% print_summary_lm()

Results of the linear model on the auto dataset.

Formula: mpg ~ . * .
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	392	18.77	0	6.74	-7.63	-4.12	-3.2	-1.45	0.06	1.27	2.73	4	11.14

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	35.47889	53.13579	0.66770	0.5047479
cylinders	6.98858	8.24797	0.84731	0.3973815
displacement	-0.47854	0.18935	-2.52722	0.0119207
horsepower	0.50343	0.34700	1.45082	0.1476933
weight	0.00413	0.01759	0.23490	0.8144183
acceleration	-5.85917	2.17362	-2.69558	0.0073536
year	0.69743	0.60967	1.14395	0.2533996
origin	-20.89557	7.09709	-2.94424	0.0034459
cylinders:displacement	-0.00338	0.00646	-0.52412	0.6005130
cylinders:horsepower	0.01161	0.02420	0.47993	0.6315682
cylinders:weight	0.00036	0.00090	0.39918	0.6899953
cylinders:acceleration	0.27787	0.16642	1.66969	0.0958433
cylinders:year	-0.17413	0.09714	-1.79250	0.0738852
cylinders:origin	0.40217	0.49262	0.81638	0.4148169
displacement:horsepower	-0.00008	0.00029	-0.29434	0.7686659
displacement:weight	0.00002	0.00001	1.68205	0.0934193
displacement:acceleration	-0.00348	0.00334	-1.04107	0.2985339
displacement:year	0.00593	0.00239	2.48202	0.0135156
displacement:origin	0.02398	0.01947	1.23200	0.2187484
horsepower:weight	-0.00002	0.00003	-0.67321	0.5012433
horsepower:acceleration	-0.00721	0.00372	-1.93920	0.0532521
horsepower:year	-0.00584	0.00394	-1.48218	0.1391606
horsepower:origin	0.00223	0.02930	0.07619	0.9393091
weight:acceleration	0.00023	0.00023	1.02518	0.3059597
weight:year	-0.00022	0.00021	-1.05568	0.2918156
weight:origin	-0.00058	0.00159	-0.36379	0.7162292
acceleration:year	0.05562	0.02558	2.17427	0.0303306
acceleration:origin	0.45832	0.15666	2.92555	0.0036547
year:origin	0.13926	0.07399	1.88212	0.0606197

Residual standard error: 2.695 on 363 degrees of freedom.
Multiple $R^2$: 0.889.
Adjusted $R^2$: 0.881.
F-statistic: 104.2 on 28 and 363, p-value: <2e-16.

Three interactions appears to be statistically significant, displacement:year, acceleration:origin and acceleration:year because they have p-value smaller than 0.05.

Question (f)

lm_model_squ <- lm(mpg ~ . + I(horsepower^2), data = auto)
lm_model_squ %>% summary() %>% print_summary_lm()

Results of the linear model on the auto dataset.

Formula: mpg ~ . + I(horsepower^2)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	392	20.55	0	8.82	-8.55	-4.62	-3.6	-1.73	-0.22	1.59	3.22	5.26	12

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.32366	4.62477	0.28621	0.77487184
cylinders	0.34891	0.30483	1.14459	0.25309407
displacement	-0.00756	0.00737	-1.02598	0.30554979
horsepower	-0.31946	0.03434	-9.30168	< 2.22e-16
weight	-0.00327	0.00068	-4.82000	2.0735e-06
acceleration	-0.33060	0.09918	-3.33315	0.00094235
year	0.73534	0.04599	15.98852	< 2.22e-16
origin	1.01441	0.25455	3.98505	8.0778e-05
I(horsepower^2)	0.00101	0.00011	9.44884	< 2.22e-16

Residual standard error: 3.001 on 383 degrees of freedom.
Multiple $R^2$: 0.855.
Adjusted $R^2$: 0.852.
F-statistic: 282.813 on 8 and 383, p-value: <2e-16.

If we add the square transformation of the horsepower variable to the model, the model seems to be significative (the p-value of the F-statistic is still very low). The RSE is smaller for this model than for the linear model, however, it is still larger than the one of the model with all the interactions.

3.2.3 Exercise 10.

This exercise is about fitting a multiple linear model to the Carseats dataset. It contains 400 observations of 11 variables about sales of child car seats at different stores. For a description of the variables, please refer to R by typing help(Carseats) after loading the package ISLR.

Question (a)

lm_model <- lm(Sales ~ Price + Urban + US, data = carseats)
lm_model %>% summary() %>% print_summary_lm()

Results of the linear model on the carseats dataset.

Formula: Sales ~ Price + Urban + US
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	400	13.98	0	6.07	-6.92	-3.93	-3.19	-1.62	-0.06	1.58	3.18	4.24	7.06

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	13.04347	0.65101	20.03567	< 2.22e-16
Price	-0.05446	0.00524	-10.38923	< 2.22e-16
UrbanYes	-0.02192	0.27165	-0.08068	0.93574
USYes	1.20057	0.25904	4.63467	4.8602e-06

Residual standard error: 2.472 on 396 degrees of freedom.
Multiple $R^2$: 0.239.
Adjusted $R^2$: 0.234.
F-statistic: 41.519 on 3 and 396, p-value: <2e-16.

Question (b)

The variable Price seems to be very significative (p-value very close to 0) and its coefficient is negative (-0.0544588). So, a $100 increase in price company charges for car seats is associated with an average decrease in sales by around 5446 units. The p-value of the coefficient for the dummy variable UrbanYes is very close to 1. So, this indicates that there is no statistical evidence of a difference in unit sales (in thousands) of child car seats between the stores that are in urban or rural location. Finally, the variable USYes is also significative and its coefficient is positive (1.2005727). So, if the store is in the US will increase the sales by around 1201 units.

Question (c)

The fitted model is:

\[ y_i = \beta_0 + \beta_1x_{1, i} + \beta_2x_{2, i} + \beta_3x_{3, i} + \epsilon_i\]

where

$y_i$ represents the Sales (Unit sales in thousands at each location);
$x_{1, i}$ represents the Price (Price company charges for seats at each site);
$x_{2, i} = 1$ if the store is in an urban location, $0$ is the store is in an rural location (UrbanYes);
$x_{3, i} = 1$ if the store is in the US, $0$ if not (USYes).
Question (d)

We can reject the null hypothesis for the variable Price and USYes (cf. question (b)).

Question (e)

lm_model2 <- lm(Sales ~ Price + US, data = carseats)
lm_model2 %>% summary() %>% print_summary_lm()

Results of the linear model on the carseats dataset.

Formula: Sales ~ Price + US
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	400	13.98	0	6.07	-6.93	-3.94	-3.17	-1.63	-0.06	1.58	3.17	4.23	7.05

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	13.03079	0.63098	20.65179	< 2.22e-16
Price	-0.05448	0.00523	-10.41612	< 2.22e-16
USYes	1.19964	0.25846	4.64148	4.7072e-06

Residual standard error: 2.469 on 397 degrees of freedom.
Multiple $R^2$: 0.239.
Adjusted $R^2$: 0.235.
F-statistic: 62.431 on 2 and 397, p-value: <2e-16.

Question (f)

The two models fit equally well on the data, they have the same $R^2$ (this is another argument for dropping the variable Urban). However, we can consider that the second model is better because it uses less features.

Question (g)

Table 3.1: 95% confidence intervals for the coefficients
	2.5 %	97.5 %
(Intercept)	11.7903202	14.2712653
Price	-0.0647598	-0.0441954
USYes	0.6915196	1.7077663

Question (h)

Figure 3.6: Diagnotistic plots of the multiple least square regression fit.

After looking at the different diagnostic plots, there is no particular evidence of outliers or high leverage observations in the model. However, the linear model may not be the better model for this data.

3.2.4 Exercise 11.

This exercise is about the t-statistic on models without intercept. For that, we will generate some data with the following “very” simple model:

\[y = 2x + \epsilon \quad\text{where}\quad \epsilon \sim \mathcal{N}(0,1).\]

set.seed(42)
x <- rnorm(100)
y <- 2*x + rnorm(100)

Question (a)

lm_model_no_inter <- lm(y ~ x + 0)
lm_model_no_inter %>% summary() %>% print_summary_lm()

Formula: y ~ x + 0
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	4.75	-0.09	0.82	-1.98	-1.49	-1.3	-0.59	-0.07	0.45	1.13	1.4	2.77

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
x	2.02449	0.0876	23.11114	< 2.22e-16

Residual standard error: 0.908 on 99 degrees of freedom.
Multiple $R^2$: 0.844.
Adjusted $R^2$: 0.842.
F-statistic: 534.125 on 1 and 99, p-value: <2e-16.

The p-value associated with the null hypothesis $\beta = 0$ is very close to 0, so there is a strong evidence that the coefficient is statistically different from 0. The estimation of $\beta$ is 2.0244855 which is very close to the true value.

Question (b)

lm_model_no_inter2 <- lm(x ~ y + 0)
lm_model_no_inter2 %>% summary() %>% print_summary_lm()

Formula: x ~ y + 0
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	2.4	0.04	0.17	-1.57	-0.6	-0.49	-0.21	0.07	0.32	0.55	0.66	0.83

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
y	0.41671	0.01803	23.11114	< 2.22e-16

Residual standard error: 0.412 on 99 degrees of freedom.
Multiple $R^2$: 0.844.
Adjusted $R^2$: 0.842.
F-statistic: 534.125 on 1 and 99, p-value: <2e-16.

The value of the t-statistic is the same than for the previous model. So, we can reject the null hypothesis and the estimation of $\beta$ is, as before, quite close to the true value.

Question (c)

We can write the model from the question (a) as $Y= \beta_1X + \epsilon$ and the one from the question (b) as $X = \beta_2Y + \epsilon$. The least square estimation of the parameters leads to $\widehat{\beta}_1 = (X^\top X)^{-1}X^\top Y$ and $\widehat{\beta}_2 = (Y^\top Y)^{-1}Y^\top X$. Moreover, we have that $X^\top Y = Y^\top X$. So, we have that:

\[\widehat{\beta}_1 = (X^\top X)^{-1}X^\top Y = Y^\top X(X^\top X)^{-1} = (Y^\top Y)\widehat{\beta}_2(X^\top X)^{-1}.\]

And finally, the relationship between $\widehat{\beta}_1$ and $\widehat{\beta}_2$ is:

\[\frac{\widehat{\beta}_1}{\widehat{\beta}_2} = (Y^\top Y)(X^\top X)^{-1},\]

which can be reduce to $\frac{Var(Y)}{Var(X)}$ in the univariate case.

Question (d)

t_stat <- (sqrt(100 - 1) * sum(x * y)) / (sqrt(sum(x * x)*sum(y * y) - sum(x * y)**2))

Numerically, we found out that the t-statistic is 23.111142 which is coherent with the result given by the lm funtion.

Algebraically, the derivations are pretty straightforward: \[\begin{align} \frac{\widehat{\beta}}{SE(\widehat{\beta})} &= \frac{\sum x_iy_i}{\sum x_i^2} \sqrt{\frac{(n-1)\sum x_i^2}{\sum (y_i - x_i\widehat{\beta})^2}} \\ &= \frac{\sqrt{n-1}\sum x_iy_i}{\sqrt{x_i^2}\sqrt{\sum y_i^2 - 2\frac{\sum x_iy_i}{\sum x_i^2}\sum x_iy_i + \left(\frac{\sum x_iy_i}{\sum x_i^2}\right)^2\sum x_i^2}} \\ &= \frac{\sqrt{n-1}\sum x_iy_i}{\sqrt{x_i^2}\sqrt{\sum y_i^2 - 2\frac{\left(\sum x_iy_i\right)^2}{\sum x_i^2} + \left(\frac{\sum x_iy_i}{\sum x_i^2}\right)^2\sum x_i^2}} \\ &= \frac{\sqrt{n-1}\sum x_iy_i}{\sqrt{\sum x_i^2\sum y_i^2 - 2\left(\sum x_iy_i\right)^2 + \left(\sum x_iy_i\right)^2}} \\ &= \frac{\sqrt{n-1}\sum x_iy_i}{\sqrt{\sum x_i^2\sum y_i^2 - \left(\sum x_iy_i\right)^2}} \end{align}\]

Question (e)

We can exchange $X$ and $Y$ in the formula of the t-statistic given in (d) without changing the results (which is the t-statistic). So, the t-statistic for the regression of $Y$ onto $X$ is the same as the t-statistic for the regression of $X$ onto $Y$.

Question (f)

lm_model <- lm(y ~ x)
lm_model %>% summary() %>% print_summary_lm()

Formula: y ~ x
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	4.75	0	0.82	-1.89	-1.4	-1.21	-0.51	0.01	0.54	1.21	1.49	2.86

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.08837	0.09088	-0.97237	0.33326
x	2.02716	0.08767	23.12391	< 2e-16

Residual standard error: 0.908 on 98 degrees of freedom.
Multiple $R^2$: 0.845.
Adjusted $R^2$: 0.844.
F-statistic: 534.715 on 1 and 98, p-value: <2e-16.

lm_model2 <- lm(x ~ y)
lm_model2 %>% summary() %>% print_summary_lm()

Formula: x ~ y
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	2.4	0	0.17	-1.61	-0.64	-0.53	-0.25	0.03	0.27	0.51	0.62	0.79

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.04188	0.04119	1.01655	0.31187
y	0.41689	0.01803	23.12391	< 2e-16

Residual standard error: 0.412 on 98 degrees of freedom.
Multiple $R^2$: 0.845.
Adjusted $R^2$: 0.844.
F-statistic: 534.715 on 1 and 98, p-value: <2e-16.

Considering the two tables, we can argue that the t-statistic for $H_0 : \beta_1 = 0$ is the same for both regressions.

3.2.5 Exercise 12.

This exercise involves simple linear regression without an intercept.

Question (a)

If we regress $Y$ onto $X$ without an intercept, we found that: $\widehat{\beta}_1 = \sum x_iy_i / \sum x_i^2$.

If we regress $X$ onto $Y$ without an intercept, we found that: $\widehat{\beta}_2 = \sum x_iy_i / \sum y_i^2$.

So, for $\widehat{\beta}_1$ to be equal to $\widehat{\beta}_2$, we should have $\sum x_i^2 = \sum y_i^2$.

First, assume that $\mathbb{E}(X) = \mathbb{E}(Y)$. So, $\widehat{\beta}_1 = \widehat{\beta}_2 \iff \text{Var}(X) = \text{Var}(Y)$. If there is a linear relationship between $X$ and $Y$, it should exists $a \in \mathbb{R}$ such that $Y = aX$. And then \[\begin{align} \text{Var}(X) = \text{Var}(Y) &\iff \text{Var}(X) = \text{Var}(aX) \\ &\iff \text{Var}(X) = a^2\text{Var}(X) \\ &\iff a = 1 \text{ or } a = -1. \end{align}\]

Second, remove the assumption. If $\sum x_i^2 = \sum y_i^2$, then

\[\text{Var}(X) + \mathbb{E}(X)^2 = \text{Var}(Y) + \mathbb{E}(Y)^2.\]

Again, if we assume linear relationship between $X$ and $Y$, it should exists $a \in \mathbb{R}$ such that $Y = aX$. And then \[\begin{align} \text{Var}(X) + \mathbb{E}(X)^2 = \text{Var}(Y) + \mathbb{E}(Y)^2 &\iff \text{Var}(X) + \mathbb{E}(X)^2 = a^2\left(\text{Var}(X) + \mathbb{E}(X)^2\right)\\ &\iff a = 1 \text{ or } a = -1. \end{align}\]

Finally, if we do not assume a particular relationship between $X$ and $Y$, we can simulate two random variables that appears to be i.i.d. but if we regress one on each other, we found out the same coefficients estimate (see question (c)).

Question (b)

See exercise 11.

Question (c)

x <- rnorm(10000, 2, 2)
y <- rnorm(10000, sqrt(6), sqrt(2))

lm(y ~ x + 0) %>% summary() %>% print_summary_lm()

Formula: y ~ x + 0
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	10000	16.3	1.25	3.55	-6.02	-1.89	-1.13	-0.01	1.25	2.52	3.64	4.34	10.27

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
x	0.60567	0.008	75.66767	< 2.22e-16

Residual standard error: 2.261 on 9999 degrees of freedom.
Multiple $R^2$: 0.364.
Adjusted $R^2$: 0.364.
F-statistic: 5725.596 on 1 and 9999, p-value: <2e-16.

lm(x ~ y + 0) %>% summary() %>% print_summary_lm()

Formula: x ~ y + 0
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	10000	17.97	0.51	4.81	-8.95	-3.12	-2.32	-0.96	0.53	1.96	3.3	4.15	9.02

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
y	0.60118	0.00795	75.66767	< 2.22e-16

Residual standard error: 2.252 on 9999 degrees of freedom.
Multiple $R^2$: 0.364.
Adjusted $R^2$: 0.364.
F-statistic: 5725.596 on 1 and 9999, p-value: <2e-16.

3.2.6 Exercise 13.

Question (a)

set.seed(42)
x <- rnorm(n = 100, mean = 0, sd = 1)

Question (b)

eps <- rnorm(n = 100, mean = 0, sd = sqrt(0.25))

Question (c)

y <- -1 + 0.5*x + eps
df <- tibble(x, y)

The length of the vector y is 100 (same as x and eps). The true value of $\beta_0$ is -1 and the one of $\beta_1$ is 0.5.

Question (d)

Figure 3.7: Scatterplot of y against x.

Question (e)

lm <- lm(y ~ x)
lm %>% summary() %>% print_summary_lm()

Formula: y ~ x
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	2.38	0	0.2	-0.94	-0.7	-0.61	-0.25	0.01	0.27	0.61	0.74	1.43

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1.04418	0.04544	-22.97996	< 2.22e-16
x	0.51358	0.04383	11.71686	< 2.22e-16

Residual standard error: 0.454 on 98 degrees of freedom.
Multiple $R^2$: 0.583.
Adjusted $R^2$: 0.579.
F-statistic: 137.285 on 1 and 98, p-value: <2e-16.

The estimation of the coefficients are very closed to the true ones, and both are statistically significative. However, the $R^2$ is quite poor (around 0.5), which seems to mean that the model does not explain the data very well.

Question (f)

Figure 3.8: Results of the linear regression of y against x.

Question (g)

lm_square <- lm(y ~ x + I(x^2))
lm_square %>% summary() %>% print_summary_lm()

Formula: y ~ x + I(x^2)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	2.36	0	0.2	-0.95	-0.7	-0.62	-0.25	0	0.27	0.6	0.74	1.41

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1.04812	0.05627	-18.62702	< 2e-16
x	0.51513	0.04591	11.21951	< 2e-16
I(x^2)	0.00362	0.03020	0.11990	0.90481

Residual standard error: 0.456 on 97 degrees of freedom.
Multiple $R^2$: 0.584.
Adjusted $R^2$: 0.575.
F-statistic: 67.959 on 2 and 97, p-value: <2e-16.

Based on the table, there is no evidence that the quadratic term improves the model fit. In particular, the p-value for the coefficient of the quadratic term is very high, and thus this term is not statistically significative. Moreover, the $R^2$ does not increase a lot when we add the quadratic term.

Question (h), (i) and (j)

Figure 3.9: Different results for the $R^2$, the adjusted $R^2$ and the RSE

As we see on the graph, the $R^2$ and the adjusted $R^2$ will decresase when the variance of the noise increase, and conversely for the residual standard errors. This phenomena could be predicted because the more noise we add the less we can recover the true underlying function.

Figure 3.10: Confidence intervals for the estimated coefficients

The confidence intervals for the two coefficients show the same phenomena than the $R^2$, the adjusted $R^2$ and the residual standard errors. The more the variance of the noise increases, the more difficult it is to estimate the coefficients, and to have a good confidence interval on it. In fact, the confidence interval is larger for large noise and the coefficient can be further from the true one.

3.2.7 Exercise 14.

This exercise is about the problem of collinearity in the features.

Question (a)

set.seed(42)
x1 <- runif(100)
x2 <- 0.5*x1 + rnorm(100)/10
y <- 2 + 2*x1 + 0.3*x2 + rnorm(100)

The linear model is the following:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon \text{ where } \beta_0 = 2,~ \beta_1 = 2,~ \beta_2 = 0.3 \text{ and } \epsilon \sim \mathcal{N}(0, 1).\]

Question (b)

ggplot() + 
  geom_point(aes(x = x1, y = x2), size = 6) +
  annotate("text", x = 0.1, y = 0.6, label = paste('Correlation:', round(cor(x1, x2), 2)), size = 8, col = 'red') +
  xlab("$X_1$") +
  ylab("$X_2$") +
  theme_custom()

Figure 3.11: Scatterplot of $X_1$ against $X_2$.

Question (c)

lm <- lm(y ~ x1 + x2)
lm %>% summary() %>% print_summary_lm()

Formula: y ~ x1 + x2
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	4.18	0	0.87	-2.1	-1.38	-1.26	-0.66	-0.03	0.65	1.27	1.41	2.08

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.04617	0.19102	10.71198	< 2e-16
x1	1.55979	0.63982	2.43787	0.016595
x2	0.98083	1.02652	0.95549	0.341706

Residual standard error: 0.941 on 97 degrees of freedom.
Multiple $R^2$: 0.32.
Adjusted $R^2$: 0.306.
F-statistic: 22.802 on 2 and 97, p-value: 7.64e-09.

The value of $\widehat{\beta}_0$ is 2.0461711 which is quite close to the true value of $\beta_0$. However, both the value of $\widehat{\beta}_1$ (1.5597949) and $\widehat{\beta}_2$ (0.9808272) appear to be quite far from their true value even if we obviously know that the true relationship between $Y$, $X_1$ and $X_2$ is linear. Thanks to the p-value, we can reject the null hypothesis only for $\beta_1$, and so $\beta_2$ does not seem to be statistically significant. Finaly, the $R^2$ is not very good.

Question (d)

lm_1 <- lm(y ~ x1)
lm_1 %>% summary() %>% print_summary_lm()

Formula: y ~ x1
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	4.11	0	0.87	-1.97	-1.42	-1.22	-0.69	-0.02	0.56	1.24	1.59	2.14

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.02128	0.18915	10.68623	< 2.22e-16
x1	2.09295	0.31293	6.68812	1.4022e-09

Residual standard error: 0.94 on 98 degrees of freedom.
Multiple $R^2$: 0.313.
Adjusted $R^2$: 0.306.
F-statistic: 44.731 on 1 and 98, p-value: 1.4e-09.

The estimated coefficients are both statistically significant. The residual standard error and the $R^2$ are the same than for the model with $X_2$. We can reject the null hypothesis.

Question (e)

lm_2 <- lm(y ~ x2)
lm_2 %>% summary() %>% print_summary_lm()

Formula: y ~ x2
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	100	4.72	0	0.92	-2.45	-1.46	-1.23	-0.7	-0.06	0.66	1.23	1.58	2.27

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.29744	0.16483	13.93816	< 2.22e-16
x2	3.16330	0.51481	6.14463	1.7269e-08

Residual standard error: 0.964 on 98 degrees of freedom.
Multiple $R^2$: 0.278.
Adjusted $R^2$: 0.271.
F-statistic: 37.756 on 1 and 98, p-value: 1.73e-08.

The estimated coefficients are both statistically significant. The residual standard error and the $R^2$ are a little smaller than for the model with $X_1$. We can reject the null hypothesis.

Question (f)

At a first sight, the different results can contradict each other, because it result to very different coefficients estimation and statistical significance. But, if we look at the true model, we see that $X_1$ and $X_2$ are very correlated. So, they explained the same variance in the data. In the complete model, they more or less share the explication of the variance but as they can explain the same amount of variance, the model with only $X_1$ or $X_2$ will lead to the same results. Another problem that can arrive in this case is that the matrix $X^\prime X$ may not be invertible.

Question (g)

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

Question (h)

lm <- lm(y ~ x1 + x2)
lm %>% summary() %>% print_summary_lm()

Formula: y ~ x1 + x2
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	101	4.45	0	0.91	-2.34	-1.41	-1.26	-0.71	-0.03	0.72	1.27	1.69	2.11

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.14668	0.19109	11.23357	< 2.22e-16
x1	0.61282	0.51998	1.17854	0.2414353
x2	2.57296	0.80975	3.17746	0.0019877

Residual standard error: 0.964 on 98 degrees of freedom.
Multiple $R^2$: 0.323.
Adjusted $R^2$: 0.309.
F-statistic: 23.327 on 2 and 98, p-value: 5.18e-09.

The new point changes the significance of the coefficient for this model. Now, $\beta_1$ is not significant anymore, while $\beta_2$ is. The other values do not seem to change. We do not have an increasing of the RSE, so the new point is not an outlier. The leverage statistic of the new point is 0.1 while the average leverage is 0.03. So, the new point is a high-leverage point.

# Fit the model with only X_1
lm_1 <- lm(y ~ x1)
lm_1 %>% summary() %>% print_summary_lm()

Formula: y ~ x1
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	101	5.7	0	1	-2.04	-1.48	-1.26	-0.73	-0.08	0.54	1.25	1.73	3.66

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.14814	0.19968	10.75791	< 2.22e-16
x1	1.92083	0.33196	5.78629	8.4547e-08

Residual standard error: 1.007 on 99 degrees of freedom.
Multiple $R^2$: 0.253.
Adjusted $R^2$: 0.245.
F-statistic: 33.481 on 1 and 99, p-value: 8.45e-08.

The new point changes a lot the RSE and the $R^2$. So, it should be consider as an outlier in this case. The leverage statistic of the new point is 0.03 while the average leverage is 0.02. So, the new point is a high-leverage point.

# Fit the model with only X_2
lm_2 <- lm(y ~ x2)
lm_2 %>% summary() %>% print_summary_lm()

Formula: y ~ x2
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	101	4.68	0	0.92	-2.47	-1.47	-1.2	-0.68	-0.05	0.68	1.23	1.55	2.22

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.26526	0.16278	13.91614	< 2.22e-16
x2	3.32847	0.49570	6.71473	1.1974e-09

Residual standard error: 0.966 on 99 degrees of freedom.
Multiple $R^2$: 0.313.
Adjusted $R^2$: 0.306.
F-statistic: 45.088 on 1 and 99, p-value: 1.2e-09.

The new point improve this model. So, this point is not an outlier. The leverage statistic of the new point is 0.02 while the average leverage is 0.02. So, the new point is not a high-leverage point.

3.2.8 Exercise 15.

This exercise is about the Boston dataset. We would like to predict per capita crime rate based on the other features. It contains 506 observations of 14 variables of suburbs in Boston. For a description of the variables, please refer to R by typing help(Boston) after loading the pacakge MASS.

boston <- as_tibble(Boston)
variable <- c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv")

Question (a)

lm_list <- vector("list", length = length(boston)-1)
for(i in seq(1, length(boston)-1)){
  name <- variable[i]
  lm_list[[i]] <- lm(boston$crim ~ boston[[name]])
}
names(lm_list) <- variable

Variable zn

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	88.95	0	71.01	-4.43	-4.4	-4.37	-4.22	-2.62	1.25	6.3	11.34	84.52

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.45369	0.41722	10.67475	< 2.22e-16
boston[[name]]	-0.07393	0.01609	-4.59378	5.5065e-06

Residual standard error: 8.435 on 504 degrees of freedom.
Multiple $R^2$: 0.04.
Adjusted $R^2$: 0.038.
F-statistic: 21.103 on 1 and 504, p-value: 5.51e-06.

Variable indus

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	93.78	0	61.76	-11.97	-7.79	-5.54	-2.7	-0.74	0.71	3.59	8.63	81.81

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.06374	0.66723	-3.09301	0.0020913
boston[[name]]	0.50978	0.05102	9.99085	< 2.22e-16

Residual standard error: 7.866 on 504 degrees of freedom.
Multiple $R^2$: 0.165.
Adjusted $R^2$: 0.164.
F-statistic: 99.817 on 1 and 504, p-value: <2e-16.

Variable chas

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	88.97	0	73.76	-3.74	-3.72	-3.71	-3.66	-3.44	0.02	7.11	12.04	85.23

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.74445	0.39611	9.45302	< 2e-16
boston[[name]]	-1.89278	1.50612	-1.25673	0.20943

Residual standard error: 8.597 on 504 degrees of freedom.
Multiple $R^2$: 0.003.
Adjusted $R^2$: 0.001.
F-statistic: 1.579 on 1 and 504, p-value: 0.209.

Variable nox

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	94.1	0	60.87	-12.37	-5.88	-4.96	-2.74	-0.97	0.56	3.41	9.41	81.73

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-13.71988	1.69948	-8.07299	5.0768e-15
boston[[name]]	31.24853	2.99919	10.41899	< 2.22e-16

Residual standard error: 7.81 on 504 degrees of freedom.
Multiple $R^2$: 0.177.
Adjusted $R^2$: 0.176.
F-statistic: 108.555 on 1 and 504, p-value: <2e-16.

Variable rm

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	93.8	0	70.43	-6.6	-4.98	-4.64	-3.95	-2.65	0.99	6.92	11.26	87.2

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	20.48180	3.36447	6.08767	2.2720e-09
boston[[name]]	-2.68405	0.53204	-5.04482	6.3467e-07

Residual standard error: 8.401 on 504 degrees of freedom.
Multiple $R^2$: 0.048.
Adjusted $R^2$: 0.046.
F-statistic: 25.45 on 1 and 504, p-value: 6.35e-07.

Variable age

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	89.64	0	64.78	-6.79	-6.16	-5.6	-4.26	-1.23	1.53	4.65	9.91	82.85

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-3.77791	0.94398	-4.00208	7.2217e-05
boston[[name]]	0.10779	0.01274	8.46282	2.8549e-16

Residual standard error: 8.057 on 504 degrees of freedom.
Multiple $R^2$: 0.124.
Adjusted $R^2$: 0.123.
F-statistic: 71.619 on 1 and 504, p-value: 2.85e-16.

Variable dis

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	88.38	0	63.32	-6.71	-5.9	-5.44	-4.13	-1.53	1.52	4.9	9.32	81.67

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	9.49926	0.73040	13.00561	< 2.22e-16
boston[[name]]	-1.55090	0.16833	-9.21346	< 2.22e-16

Residual standard error: 7.965 on 504 degrees of freedom.
Multiple $R^2$: 0.144.
Adjusted $R^2$: 0.142.
F-statistic: 84.888 on 1 and 504, p-value: <2e-16.

Variable rad

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	86.6	0	45.04	-10.16	-7.6	-5.08	-1.38	-0.14	0.66	1.7	3.31	76.43

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.28716	0.44348	-5.15735	3.6058e-07
boston[[name]]	0.61791	0.03433	17.99820	< 2.22e-16

Residual standard error: 6.718 on 504 degrees of freedom.
Multiple $R^2$: 0.391.
Adjusted $R^2$: 0.39.
F-statistic: 323.935 on 1 and 504, p-value: <2e-16.

Variable tax

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	90.21	0	48.86	-12.51	-6.59	-4.52	-2.74	-0.19	1.07	2.81	4.51	77.7

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-8.52837	0.81581	-10.45387	< 2.22e-16
boston[[name]]	0.02974	0.00185	16.09939	< 2.22e-16

Residual standard error: 6.997 on 504 degrees of freedom.
Multiple $R^2$: 0.34.
Adjusted $R^2$: 0.338.
F-statistic: 259.19 on 1 and 504, p-value: <2e-16.

Variable ptratio

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	91.01	0	67.77	-7.65	-6.22	-5.57	-3.99	-1.91	1.82	5.13	10.17	83.35

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-17.64693	3.14727	-5.60706	3.3953e-08
boston[[name]]	1.15198	0.16937	6.80143	2.9429e-11

Residual standard error: 8.24 on 504 degrees of freedom.
Multiple $R^2$: 0.084.
Adjusted $R^2$: 0.082.
F-statistic: 46.259 on 1 and 504, p-value: 2.94e-11.

Variable black

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	100.58	0	63.02	-13.76	-4.87	-3	-2.3	-2.09	-1.3	5.76	11.85	86.82

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	16.55353	1.42590	11.60916	< 2.22e-16
boston[[name]]	-0.03628	0.00387	-9.36695	< 2.22e-16

Residual standard error: 7.946 on 504 degrees of freedom.
Multiple $R^2$: 0.148.
Adjusted $R^2$: 0.147.
F-statistic: 87.74 on 1 and 504, p-value: <2e-16.

Variable lstat

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	96.79	0	58.63	-13.93	-6.75	-5.38	-2.82	-0.66	1.08	3.91	8.14	82.86

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-3.33054	0.69376	-4.80072	2.087e-06
boston[[name]]	0.54880	0.04776	11.49065	< 2.22e-16

Residual standard error: 7.664 on 504 degrees of freedom.
Multiple $R^2$: 0.208.
Adjusted $R^2$: 0.206.
F-statistic: 132.035 on 1 and 504, p-value: <2e-16.

Variable medv

Formula: boston$crim ~ boston[[name]]
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	90.03	0	62.83	-9.07	-5.47	-4.9	-4.02	-2.34	1.3	6.4	11.43	80.96

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	11.79654	0.93419	12.62757	< 2.22e-16
boston[[name]]	-0.36316	0.03839	-9.45971	< 2.22e-16

Residual standard error: 7.934 on 504 degrees of freedom.
Multiple $R^2$: 0.151.
Adjusted $R^2$: 0.149.
F-statistic: 89.486 on 1 and 504, p-value: <2e-16.

If we look at the results of the different fitted linear model, we can see that almost every variable is statistically significative (except the variable chas). However, We can split up the other ones into three categories based on the $R^2$. The variable with a large $R^2$ ($> 0.3$) have a larger contribution to the variance explication (rad and tax). Then, there are variables with a moderate contribution to the variance: indus, nox, age, dis, black, lstat and medv (between $0.1$ and $0.2$). Finally, some of them have a $R^2$ very close to $0$ (zn, rm and ptratio).

Question (b)

lm <- lm(crim ~ ., data = boston)
lm %>% summary() %>% print_summary_lm()

Results of the linear model on the boston dataset.

Formula: crim ~ .
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	84.97	0	40.4	-9.92	-6.37	-4.51	-2.12	-0.35	1.02	2.49	4.66	75.05

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	17.03323	7.23490	2.35431	0.0189491
zn	0.04486	0.01873	2.39431	0.0170249
indus	-0.06385	0.08341	-0.76558	0.4442940
chas	-0.74913	1.18015	-0.63478	0.5258670
nox	-10.31353	5.27554	-1.95497	0.0511520
rm	0.43013	0.61283	0.70188	0.4830888
age	0.00145	0.01793	0.08098	0.9354878
dis	-0.98718	0.28182	-3.50289	0.0005022
rad	0.58821	0.08805	6.68045	6.4605e-11
tax	-0.00378	0.00516	-0.73319	0.4637927
ptratio	-0.27108	0.18645	-1.45390	0.1466113
black	-0.00754	0.00367	-2.05196	0.0407023
lstat	0.12621	0.07572	1.66671	0.0962084
medv	-0.19889	0.06052	-3.28652	0.0010868

Residual standard error: 6.439 on 492 degrees of freedom.
Multiple $R^2$: 0.454.
Adjusted $R^2$: 0.44.
F-statistic: 31.47 on 13 and 492, p-value: <2e-16.

The multiple linear regression appears to be quite good on these data. The model is significative and the $R^2$ is better than the one for every univariate linear model. We can reject the null hypothesis for the features zn, dis, rad, black and medv. The significative variable are not the same than the one for the univariate models. This may be the result of some colinearity problems.

Question (c)

coefs <- tibble(variable = names(lm$coefficients), beta_multi = lm$coefficients) %>% slice(-1)
coefs$beta_uni <- lm_list %>% map(~ .x$coefficients[-1]) %>% unlist()

Figure 3.12: Plot of the coefficients from the univariate models against the multivariate model.

As we see, the estimated coefficients are very different if we fit a multivariate model or multiple univariate models.

Question (d)

lm_list <- vector("list", length = length(boston)-1)
for(i in seq(1, length(boston)-1)){
  name <- variable[i]
  lm_list[[i]] <- lm(boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3))
}
names(lm_list) <- variable

Variable zn

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	88.95	0	69.68	-4.82	-4.79	-4.76	-4.61	-1.29	0.47	5.91	10.94	84.13

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.84605	0.43298	11.19220	< 2.22e-16
boston[[name]]	-0.33219	0.10981	-3.02517	0.0026123
I(boston[[name]]^2)	0.00648	0.00386	1.67912	0.0937505
I(boston[[name]]^3)	-0.00004	0.00003	-1.20301	0.2295386

Residual standard error: 8.372 on 502 degrees of freedom.
Multiple $R^2$: 0.058.
Adjusted $R^2$: 0.053.
F-statistic: 10.349 on 3 and 502, p-value: 1.28e-06.

Variable indus

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	87.99	0	54.78	-8.28	-7.33	-6.22	-2.51	0.05	0.76	2.43	6.53	79.71

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.66257	1.57398	2.32694	0.020365
boston[[name]]	-1.96521	0.48199	-4.07729	5.2971e-05
I(boston[[name]]^2)	0.25194	0.03932	6.40701	3.4202e-10
I(boston[[name]]^3)	-0.00698	0.00096	-7.29205	1.1964e-12

Residual standard error: 7.423 on 502 degrees of freedom.
Multiple $R^2$: 0.26.
Adjusted $R^2$: 0.255.
F-statistic: 58.688 on 3 and 502, p-value: <2e-16.

Variable chas

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	88.97	0	73.76	-3.74	-3.72	-3.71	-3.66	-3.44	0.02	7.11	12.04	85.23

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.74445	0.39611	9.45302	< 2e-16
boston[[name]]	-1.89278	1.50612	-1.25673	0.20943

Residual standard error: 8.597 on 504 degrees of freedom.
Multiple $R^2$: 0.003.
Adjusted $R^2$: 0.001.
F-statistic: 1.579 on 1 and 504, p-value: 0.209.

Variable nox

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	87.41	0	52.01	-9.11	-7.35	-6.16	-2.07	-0.25	0.74	1.92	7.88	78.3

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	233.0866	33.6431	6.92821	1.3119e-11
boston[[name]]	-1279.3712	170.3975	-7.50816	2.7584e-13
I(boston[[name]]^2)	2248.5440	279.8993	8.03340	6.8113e-15
I(boston[[name]]^3)	-1245.7029	149.2816	-8.34465	6.9611e-16

Residual standard error: 7.234 on 502 degrees of freedom.
Multiple $R^2$: 0.297.
Adjusted $R^2$: 0.293.
F-statistic: 70.687 on 3 and 502, p-value: <2e-16.

Variable rm

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	105.7	0	68.97	-18.49	-5.13	-4.34	-3.47	-2.22	-0.01	6.82	11.58	87.22

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	112.62460	64.51724	1.74565	0.081483
boston[[name]]	-39.15014	31.31149	-1.25034	0.211756
I(boston[[name]]^2)	4.55090	5.00986	0.90839	0.364109
I(boston[[name]]^3)	-0.17448	0.26375	-0.66153	0.508575

Residual standard error: 8.33 on 502 degrees of freedom.
Multiple $R^2$: 0.068.
Adjusted $R^2$: 0.062.
F-statistic: 12.168 on 3 and 502, p-value: 1.07e-07.

Variable age

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	92.6	0	61.1	-9.76	-7.8	-6.42	-2.67	-0.52	0.02	4.38	8.77	82.84

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.54876	2.76914	-0.92042	0.3577971
boston[[name]]	0.27365	0.18638	1.46826	0.1426608
I(boston[[name]]^2)	-0.00723	0.00364	-1.98779	0.0473773
I(boston[[name]]^3)	0.00006	0.00002	2.72373	0.0066799

Residual standard error: 7.84 on 502 degrees of freedom.
Multiple $R^2$: 0.174.
Adjusted $R^2$: 0.169.
F-statistic: 35.306 on 3 and 502, p-value: <2e-16.

Variable dis

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	87.14	0	53.43	-10.76	-8.45	-6.51	-2.59	0.03	1.27	3.41	6.72	76.38

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	30.04761	2.44587	12.28504	< 2.22e-16
boston[[name]]	-15.55435	1.73597	-8.96005	< 2.22e-16
I(boston[[name]]^2)	2.45207	0.34642	7.07833	4.9412e-12
I(boston[[name]]^3)	-0.11860	0.02040	-5.81354	1.0888e-08

Residual standard error: 7.331 on 502 degrees of freedom.
Multiple $R^2$: 0.278.
Adjusted $R^2$: 0.274.
F-statistic: 64.374 on 3 and 502, p-value: <2e-16.

Variable rad

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	86.6	0	44.39	-10.38	-7.81	-5.29	-0.41	-0.27	0.18	1.55	3.11	76.22

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.60554	2.05011	-0.29537	0.76783
boston[[name]]	0.51274	1.04360	0.49132	0.62342
I(boston[[name]]^2)	-0.07518	0.14854	-0.50610	0.61301
I(boston[[name]]^3)	0.00321	0.00456	0.70311	0.48231

Residual standard error: 6.682 on 502 degrees of freedom.
Multiple $R^2$: 0.4.
Adjusted $R^2$: 0.396.
F-statistic: 111.573 on 3 and 502, p-value: <2e-16.

Variable tax

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	90.22	0	46.69	-13.27	-7.34	-5.14	-1.39	0.05	0.54	1.34	3.76	76.95

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	19.18358	11.79555	1.62634	0.10450
boston[[name]]	-0.15331	0.09568	-1.60235	0.10971
I(boston[[name]]^2)	0.00036	0.00024	1.48766	0.13747
I(boston[[name]]^3)	0.00000	0.00000	-1.16679	0.24385

Residual standard error: 6.854 on 502 degrees of freedom.
Multiple $R^2$: 0.369.
Adjusted $R^2$: 0.365.
F-statistic: 97.805 on 3 and 502, p-value: <2e-16.

Variable ptratio

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	505	89.53	0	65.57	-6.83	-6.3	-5.99	-4.15	-1.65	1.41	4.73	9.51	82.7

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	477.18405	156.79498	3.04336	0.0024621
boston[[name]]	-82.36054	27.64394	-2.97933	0.0030287
I(boston[[name]]^2)	4.63535	1.60832	2.88210	0.0041196
I(boston[[name]]^3)	-0.08476	0.03090	-2.74328	0.0063005

Residual standard error: 8.122 on 502 degrees of freedom.
Multiple $R^2$: 0.114.
Adjusted $R^2$: 0.108.
F-statistic: 21.484 on 3 and 502, p-value: 4.17e-13.

Variable black

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	99.89	0	62.9	-13.1	-4.32	-2.99	-2.34	-2.13	-1.44	5.87	11.82	86.79

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	18.26370	2.30490	7.92385	1.4971e-14
boston[[name]]	-0.08356	0.05633	-1.48343	0.13859
I(boston[[name]]^2)	0.00021	0.00030	0.71624	0.47418
I(boston[[name]]^3)	0.00000	0.00000	-0.60777	0.54362

Residual standard error: 7.955 on 502 degrees of freedom.
Multiple $R^2$: 0.15.
Adjusted $R^2$: 0.145.
F-statistic: 29.492 on 3 and 502, p-value: <2e-16.

Variable lstat

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	98.59	0	57.86	-15.23	-6.58	-4.93	-2.15	-0.49	0.07	4.03	8.3	83.35

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.20097	2.02865	0.59200	0.554115
boston[[name]]	-0.44907	0.46489	-0.96596	0.334530
I(boston[[name]]^2)	0.05578	0.03012	1.85218	0.064587
I(boston[[name]]^3)	-0.00086	0.00057	-1.51702	0.129891

Residual standard error: 7.629 on 502 degrees of freedom.
Multiple $R^2$: 0.218.
Adjusted $R^2$: 0.213.
F-statistic: 46.629 on 3 and 502, p-value: <2e-16.

Variable medv

Formula: boston$crim ~ boston[[name]] + I(boston[[name]]^2) + I(boston[[name]]^3)
Residuals

Name	NA_num	Unique	Range	Mean	Variance	Minimum	Q05	Q10	Q25	Q50	Q75	Q90	Q95	Maximum
Residuals	0	506	98.08	0	42.9	-24.43	-6.47	-4.42	-1.98	-0.44	0.44	3.59	7.06	73.65

Coefficients

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	53.16554	3.35631	15.84047	< 2.22e-16
boston[[name]]	-5.09483	0.43383	-11.74379	< 2.22e-16
I(boston[[name]]^2)	0.15550	0.01719	9.04552	< 2.22e-16
I(boston[[name]]^3)	-0.00149	0.00020	-7.31197	1.0465e-12

Residual standard error: 6.569 on 502 degrees of freedom.
Multiple $R^2$: 0.42.
Adjusted $R^2$: 0.417.
F-statistic: 121.272 on 3 and 502, p-value: <2e-16.

Some variable are statistically significative until their cubic version (indus, nox, dis, ptratio and medv). So, we can conclude to some non linear relationship between the response and some of the predictors. However, we should also take a look at the colinearity between the predictors because of them seem to be very correlated, and that can change a lot the results.