`coef(glm(am ~ mpg, data = mtcars, family = binomial))`

```
(Intercept) mpg
-6.6035267 0.3070282
```

R

Statistics

Linear models

Logistic regression is using an exponential function and using the linear model repeated times to predict the probability of a binary outcome.

Author

Mauricio “Pachá” Vargas S.

Published

May 1, 2024

One of the most common comments I hear is that logistic regression (also called Binomial regression) is some kind of “advanced magic”, “machine learning”, “artificial intelligence” or “big data”. This is not true.

In this post, I will show you how logistic regression works and why it is not as complex as some people think.

Logistic regression consists in transforming the data and then using linear regression (or ordinary least squares) repeated multiple times until achieving convergence.

The differences with the classic linear model are:

- It uses a binary dependent variable (i.e., heads or tails when we flip a coin, which we can express as 0/1). The linear model takes a continuous variable (i.e., the temperature in a city) as the dependent variable.
- It uses an exponential function to transform the data before fitting a regression.
- It returns estimated probabilities (i.e., 0.42, 0.91, …, 0.47) that we can convert to 0/1 by using a threshold (i.e., all the values over 0.5 become 1).

Before we continue, and in case you want to read more, the Logistic regression is a part of the Generalized Linear Models (GLM) family. This family includes other models like Gaussian (i.e., “normal” or “classic” regression), Poisson, Gamma, and others that have the exponential function in their formula. Some books such as Casella and Berger refer to them as the “exponential family”.

In base R, we can use the `glm`

function to fit a logistic regression model. Let’s use the `mtcars`

dataset to predict if a car has automatic transmission (`am`

, coded as 0 for automatic and 1 for manual) as a function of the miles per gallon (`mpg`

).

R commands follow the previous explanation about a general family of models. We will use the `family`

argument to specify the type of model we want to fit, which in this case is `binomial`

.

Leaving the statistical significance aside, the second estimated coefficient (0.307) indicates that more miles per gallon are associated with a higher probability of having a manual transmission.

We can predict the probability of having a manual transmission for the cars in the dataset.

```
am_pred <- predict(glm(am ~ mpg, data = mtcars, family = binomial), type = "response")
head(am_pred)
```

```
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
0.4610951 0.4610951 0.5978984 0.4917199
Hornet Sportabout Valiant
0.2969009 0.2599331
```

To convert this probability to a binary outcome, we can use a threshold of 0.6 (arbitrary).

```
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
0 0 0 0
Hornet Sportabout Valiant
0 0
```

This can be compared with the actual data.

For a better comparison, we can use a confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives.

This matrix shows that the proposed model is not good at predicting the transmission type of the cars in the dataset. There are eight cases (out of 32) where the model predicted the incorrect transmission type, which can be fixed by adding more variables to the model, but that would be for another post.

It is possible to obtain the same coefficients from the `glm()`

function by transforming the data following the Binomial regression “recipe” and then using `lm()`

repeated times until reaching convergence. However, this is very unpractical and just for explanatory purposes.

```
# original variables
y <- mtcars$am
x <- mtcars$mpg
# apply the recipe: define the new variables
mu <- (y + 0.5) / 2
eta <- log(mu / (1 - mu))
z <- eta + (y - mu) / mu
# iterate with initial values for the difference and the sum of sq residuals
dif <- 1
rss1 <- 1
tol <- 1e-10
while (abs(dif) > tol) {
fit <- lm(z ~ x, weights = mu)
eta <- z - fit$residuals
mu <- exp(eta) / (1 + exp(eta))
z <- eta + (y - mu) / mu
res <- y - mu
rss2 <- sum(res^2)
dif <- rss2 - rss1
rss1 <- rss2
}
coef(lm(z ~ x, weights = mu))
```

```
(Intercept) x
-6.6035267 0.3070282
```

That’s it, in a few lines we wrote a straightforward code to fit a logistic regression. This is not magic, machine learning, or artificial intelligence. It is just a linear model repeated multiple times and it is tractable.

I wrote the The Hitchhiker’s Guide to Linear Models to covers regressions starting with high school math and then all the explanations until reaching Binomial, Poisson and Tobit models. You can get it for free or paying a suggested price of 10 USD.