# Linear Regression and ANOVA shaken and stirred (Part 2)

Tue, Mar 21, 2017*Updated 2020-03-02*

# Motivation

In the first part of this entry I showed some `mtcars`

examples, where `am`

can be useful to explain ANOVA as its observations are defined as:
\[
am_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases}
\]

Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involving more variables.

# ANOVA with one dummy variable

Consider a model where the outcome is `mpg`

and the design matrix is \(\renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}} X = (\vec{1} \: \vec{x}_2)\).

From the last entry let \[ x_2 = \begin{cases}1 &\text{ if car } i \text{ is automatic} \cr 0 &\text{ if car } i \text{ is manual}\end{cases} \]

This will lead to this estimate: \[ \hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 - \bar{y}_1\end{bmatrix} \] Fitting the model gives:

```
y <- mtcars$mpg
x1 <- mtcars$am
x2 <- ifelse(x1 == 1, 0, 1)
fit <- lm(y ~ x2)
summary(fit)
```

```
Call:
lm(formula = y ~ x2)
Residuals:
Min 1Q Median 3Q Max
-9.3923 -3.0923 -0.2974 3.2439 9.5077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.392 1.360 17.941 < 2e-16 ***
x2 -7.245 1.764 -4.106 0.000285 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
```

To see the relationship between the estimates and the group means I need additional steps:

```
x0 <- rep(1,length(y))
X <- cbind(x0,x2)
beta <- solve(t(X)%*%X) %*% (t(X)%*%y)
beta
```

```
[,1]
x0 24.392308
x2 -7.244939
```

I obtained the same estimates with the `lm`

command so now I calculate the group means:

```
x1 <- ifelse(x1 == 0, NA, x1)
x2 <- ifelse(x2 == 0, NA, x2)
m1 <- mean(y*x1, na.rm = TRUE)
m2 <- mean(y*x2, na.rm = TRUE)
beta0 <- m1
beta2 <- m2 - m1
beta0;beta2
```

`[1] 24.39231`

`[1] -7.244939`

In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I’ll verify the equivalency between `lm`

and `aov`

in this particular case:

```
y <- mtcars$mpg
x1 <- mtcars$am
x2 <- ifelse(x1 == 1, 0, 1)
fit2 <- aov(y ~ x2)
fit2$coefficients
```

```
(Intercept) x2
24.392308 -7.244939
```

I can calculate the residuals by hand:

```
fit3 <- lm(mpg ~ am, data = mtcars)
mean_mpg <- mean(mtcars$mpg)
fitted_mpg <- fit3$coefficients[1] + fit3$coefficients[2]*mtcars$am
observed_mpg <- mtcars$mpg
TSS <- sum((observed_mpg - mean_mpg)^2)
ESS <- sum((fitted_mpg - mean_mpg)^2)
RSS <- sum((observed_mpg - fitted_mpg)^2)
TSS;ESS;RSS
```

`[1] 1126.047`

`[1] 405.1506`

`[1] 720.8966`

Here it’s verified that \(TSS = ESS + RSS\) but aside from that I can extract information from `aov`

:

`summary(fit2)`

```
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.2 405.2 16.86 0.000285 ***
Residuals 30 720.9 24.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

And check that, as expected, \(ESS\) is the variance explained by `x2`

.

I also can run ANOVA over `lm`

with:

`anova(fit)`

```
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.15 405.15 16.86 0.000285 ***
Residuals 30 720.90 24.03
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The table provides information on the effect of `x2`

over `y`

. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.

Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.

# ANOVA with three dummy variables

Now let’s explore something more complex than `am`

. Reading the documentation I wonder if `cyl`

has an impact on `mpg`

so I explore that variable:

`str(mtcars)`

```
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```

`unique(mtcars$cyl)`

`[1] 6 4 8`

One (wrong) possibility is to write:

```
y <- mtcars$mpg
x1 <- mtcars$cyl; x1 <- ifelse(x1 == 4, 1, 0)
x2 <- mtcars$cyl; x2 <- ifelse(x1 == 6, 1, 0)
x3 <- mtcars$cyl; x3 <- ifelse(x1 == 8, 1, 0)
fit <- lm(y ~ x1 + x2 + x3)
summary(fit)
```

```
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-6.2476 -2.2846 -0.4556 2.6774 7.2364
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.6476 0.7987 20.844 < 2e-16 ***
x1 10.0160 1.3622 7.353 3.44e-08 ***
x2 NA NA NA NA
x3 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.66 on 30 degrees of freedom
Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312
F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08
```

Here the `NAs`

mean that there are variables that are linearly related to the other variables (e.g. the variable pointed with `NA`

is an average of one or more of the rest of the variables, like \(x_2 = 2x_1 + x_3\) or another linear combination), then there’s no unique solution to the regression without dropping variables.

My model will include these variables: \[ x_2 = \begin{cases}1 &\text{ if car } i \text{ has 6 cylinders} \cr 0 &\text{ otherwise }\end{cases} \quad \quad x_3 = \begin{cases}1 &\text{ if car } i \text{ has 8 cylinders} \cr 0 &\text{ otherwise }\end{cases} \]

In this particular case regression coefficients are given by this estimate: \[ \hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 \end{bmatrix} \]

But R has a command called `as.factor()`

that is useful in these cases and also can save you some lines of code in other cases:

```
fit2 <- lm(mpg ~ as.factor(cyl), data = mtcars)
summary(fit2)
```

```
Call:
lm(formula = mpg ~ as.factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
```

The `aov`

version of this is:

```
fit3 <- aov(mpg ~ as.factor(cyl), data = mtcars)
TukeyHSD(fit3)
```

```
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars)
$`as.factor(cyl)`
diff lwr upr p adj
6-4 -6.920779 -10.769350 -3.0722086 0.0003424
8-4 -11.563636 -14.770779 -8.3564942 0.0000000
8-6 -4.642857 -8.327583 -0.9581313 0.0112287
```

As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you.