Logistic regression is not advanced ‘machine learning’ or ‘artificial intelligence’

R
Statistics
Linear models
Logistic regression is using an exponential function and using the linear model repeated times to predict the probability of a binary outcome.
Author

Mauricio “Pachá” Vargas S.

Published

May 1, 2024

Motivation

One of the most common comments I hear is that logistic regression (also called Binomial regression) is some kind of “advanced magic”, “machine learning”, “artificial intelligence” or “big data”. This is not true.

In this post, I will show you how logistic regression works and why it is not as complex as some people think.

Explanation

Logistic regression consists in transforming the data and then using linear regression (or ordinary least squares) repeated multiple times until achieving convergence.

The differences with the classic linear model are:

  1. It uses a binary dependent variable (i.e., heads or tails when we flip a coin, which we can express as 0/1). The linear model takes a continuous variable (i.e., the temperature in a city) as the dependent variable.
  2. It uses an exponential function to transform the data before fitting a regression.
  3. It returns estimated probabilities (i.e., 0.42, 0.91, …, 0.47) that we can convert to 0/1 by using a threshold (i.e., all the values over 0.5 become 1).

Before we continue, and in case you want to read more, the Logistic regression is a part of the Generalized Linear Models (GLM) family. This family includes other models like Gaussian (i.e., “normal” or “classic” regression), Poisson, Gamma, and others that have the exponential function in their formula. Some books such as Casella and Berger refer to them as the “exponential family”.

Worked example

In base R, we can use the glm function to fit a logistic regression model. Let’s use the mtcars dataset to predict if a car has automatic transmission (am, coded as 0 for automatic and 1 for manual) as a function of the miles per gallon (mpg).

R commands follow the previous explanation about a general family of models. We will use the family argument to specify the type of model we want to fit, which in this case is binomial.

coef(glm(am ~ mpg, data = mtcars, family = binomial))
(Intercept)         mpg 
 -6.6035267   0.3070282 

Leaving the statistical significance aside, the second estimated coefficient (0.307) indicates that more miles per gallon are associated with a higher probability of having a manual transmission.

We can predict the probability of having a manual transmission for the cars in the dataset.

am_pred <- predict(glm(am ~ mpg, data = mtcars, family = binomial), type = "response")
head(am_pred)
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
        0.4610951         0.4610951         0.5978984         0.4917199 
Hornet Sportabout           Valiant 
        0.2969009         0.2599331 

To convert this probability to a binary outcome, we can use a threshold of 0.6 (arbitrary).

am_pred_binary <- ifelse(am_pred > 0.6, 1, 0)
head(am_pred_binary)
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
                0                 0                 0                 0 
Hornet Sportabout           Valiant 
                0                 0 

This can be compared with the actual data.

head(mtcars$am)
[1] 1 1 1 0 0 0

For a better comparison, we can use a confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives.

table(am_pred_binary, mtcars$am)
              
am_pred_binary  0  1
             0 18  7
             1  1  6

This matrix shows that the proposed model is not good at predicting the transmission type of the cars in the dataset. There are eight cases (out of 32) where the model predicted the incorrect transmission type, which can be fixed by adding more variables to the model, but that would be for another post.

It is possible to obtain the same coefficients from the glm() function by transforming the data following the Binomial regression “recipe” and then using lm() repeated times until reaching convergence. However, this is very unpractical and just for explanatory purposes.

# original variables
y <- mtcars$am
x <- mtcars$mpg

# apply the recipe: define the new variables
mu <- (y + 0.5) / 2
eta <- log(mu / (1 - mu))
z <- eta + (y - mu) / mu

# iterate with initial values for the difference and the sum of sq residuals

dif <- 1
rss1 <- 1
tol <- 1e-10

while (abs(dif) > tol) {
  fit <- lm(z ~ x, weights = mu)
  eta <- z - fit$residuals
  mu <- exp(eta) / (1 + exp(eta))
  z <- eta + (y - mu) / mu
  res <- y - mu
  rss2 <- sum(res^2)
  dif <- rss2 - rss1
  rss1 <- rss2
}

coef(lm(z ~ x, weights = mu))
(Intercept)           x 
 -6.6035267   0.3070282 

That’s it, in a few lines we wrote a straightforward code to fit a logistic regression. This is not magic, machine learning, or artificial intelligence. It is just a linear model repeated multiple times and it is tractable.

If you liked this post

I wrote the The Hitchhiker’s Guide to Linear Models to covers regressions starting with high school math and then all the explanations until reaching Binomial, Poisson and Tobit models. You can get it for free or paying a suggested price of 10 USD.