coef(glm(am ~ mpg, data = mtcars, family = binomial))
(Intercept) mpg
-6.6035267 0.3070282
Mauricio “Pachá” Vargas S.
May 1, 2024
One of the most common comments I hear is that logistic regression (also called Binomial regression) is some kind of “advanced magic”, “machine learning”, “artificial intelligence” or “big data”. This is not true.
In this post, I will show you how logistic regression works and why it is not as complex as some people think.
Logistic regression consists in transforming the data and then using linear regression (or ordinary least squares) repeated multiple times until achieving convergence.
The differences with the classic linear model are:
Before we continue, and in case you want to read more, the Logistic regression is a part of the Generalized Linear Models (GLM) family. This family includes other models like Gaussian (i.e., “normal” or “classic” regression), Poisson, Gamma, and others that have the exponential function in their formula. Some books such as Casella and Berger refer to them as the “exponential family”.
In base R, we can use the glm
function to fit a logistic regression model. Let’s use the mtcars
dataset to predict if a car has automatic transmission (am
, coded as 0 for automatic and 1 for manual) as a function of the miles per gallon (mpg
).
R commands follow the previous explanation about a general family of models. We will use the family
argument to specify the type of model we want to fit, which in this case is binomial
.
Leaving the statistical significance aside, the second estimated coefficient (0.307) indicates that more miles per gallon are associated with a higher probability of having a manual transmission.
We can predict the probability of having a manual transmission for the cars in the dataset.
am_pred <- predict(glm(am ~ mpg, data = mtcars, family = binomial), type = "response")
head(am_pred)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
0.4610951 0.4610951 0.5978984 0.4917199
Hornet Sportabout Valiant
0.2969009 0.2599331
To convert this probability to a binary outcome, we can use a threshold of 0.6 (arbitrary).
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
0 0 0 0
Hornet Sportabout Valiant
0 0
This can be compared with the actual data.
For a better comparison, we can use a confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives.
This matrix shows that the proposed model is not good at predicting the transmission type of the cars in the dataset. There are eight cases (out of 32) where the model predicted the incorrect transmission type, which can be fixed by adding more variables to the model, but that would be for another post.
It is possible to obtain the same coefficients from the glm()
function by transforming the data following the Binomial regression “recipe” and then using lm()
repeated times until reaching convergence. However, this is very unpractical and just for explanatory purposes.
# original variables
y <- mtcars$am
x <- mtcars$mpg
# apply the recipe: define the new variables
mu <- (y + 0.5) / 2
eta <- log(mu / (1 - mu))
z <- eta + (y - mu) / mu
# iterate with initial values for the difference and the sum of sq residuals
dif <- 1
rss1 <- 1
tol <- 1e-10
while (abs(dif) > tol) {
fit <- lm(z ~ x, weights = mu)
eta <- z - fit$residuals
mu <- exp(eta) / (1 + exp(eta))
z <- eta + (y - mu) / mu
res <- y - mu
rss2 <- sum(res^2)
dif <- rss2 - rss1
rss1 <- rss2
}
coef(lm(z ~ x, weights = mu))
(Intercept) x
-6.6035267 0.3070282
That’s it, in a few lines we wrote a straightforward code to fit a logistic regression. This is not magic, machine learning, or artificial intelligence. It is just a linear model repeated multiple times and it is tractable.
I wrote the The Hitchhiker’s Guide to Linear Models to covers regressions starting with high school math and then all the explanations until reaching Binomial, Poisson and Tobit models. You can get it for free or paying a suggested price of 10 USD.