Linear Regression and ANOVA shaken and stirred (Part 2)

R

Linear models

These concepts are understood as separate concepts most of the times. The truth is they are extremely related to each other.

Author

Mauricio “Pachá” Vargas S.

Published

March 21, 2017

Updated 2020-03-02

Motivation

In the first part of this entry I showed some mtcars examples, where am can be useful to explain ANOVA as its observations are defined as: \[
am_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases}
\]

Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involving more variables.

ANOVA with one dummy variable

Consider a model where the outcome is mpg and the design matrix is \(\renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}} X = (\vec{1} \: \vec{x}_2)\).

From the last entry let \[
x_2 = \begin{cases}1 &\text{ if car } i \text{ is automatic} \cr 0 &\text{ if car } i \text{ is manual}\end{cases}
\]

This will lead to this estimate: \[
\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 - \bar{y}_1\end{bmatrix}
\] Fitting the model gives:

In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I’ll verify the equivalency between lm and aov in this particular case:

Here it’s verified that \(TSS = ESS + RSS\) but aside from that I can extract information from aov:

summary(fit2)

Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.2 405.2 16.86 0.000285 ***
Residuals 30 720.9 24.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And check that, as expected, \(ESS\) is the variance explained by x2.

I also can run ANOVA over lm with:

anova(fit)

Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.15 405.15 16.86 0.000285 ***
Residuals 30 720.90 24.03
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table provides information on the effect of x2 over y. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.

Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.

ANOVA with three dummy variables

Now let’s explore something more complex than am. Reading the documentation I wonder if cyl has an impact on mpg so I explore that variable:

Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-6.2476 -2.2846 -0.4556 2.6774 7.2364
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.6476 0.7987 20.844 < 2e-16 ***
x1 10.0160 1.3622 7.353 3.44e-08 ***
x2 NA NA NA NA
x3 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.66 on 30 degrees of freedom
Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312
F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08

Here the NAs mean that there are variables that are linearly related to the other variables (e.g. the variable pointed with NA is an average of one or more of the rest of the variables, like \(x_2 = 2x_1 + x_3\) or another linear combination), then there’s no unique solution to the regression without dropping variables.

My model will include these variables: \[
x_2 = \begin{cases}1 &\text{ if car } i \text{ has 6 cylinders} \cr 0 &\text{ otherwise }\end{cases}
\quad \quad
x_3 = \begin{cases}1 &\text{ if car } i \text{ has 8 cylinders} \cr 0 &\text{ otherwise }\end{cases}
\]

In this particular case regression coefficients are given by this estimate: \[
\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 \end{bmatrix}
\]

But R has a command called as.factor() that is useful in these cases and also can save you some lines of code in other cases:

fit2 <-lm(mpg ~as.factor(cyl), data = mtcars)summary(fit2)

Call:
lm(formula = mpg ~ as.factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

The aov version of this is:

fit3 <-aov(mpg ~as.factor(cyl), data = mtcars)TukeyHSD(fit3)

Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars)
$`as.factor(cyl)`
diff lwr upr p adj
6-4 -6.920779 -10.769350 -3.0722086 0.0003424
8-4 -11.563636 -14.770779 -8.3564942 0.0000000
8-6 -4.642857 -8.327583 -0.9581313 0.0112287

As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you.