Efficient Fitting of Linear and Generalized Linear Models by using just base R. As an alternative to lm() and glm(), this package provides elm() and eglm(), with a significant speedup when the number of observations is larger than the number of parameters to estimate. The speed gains are obtained by reducing the NxP model matrix to a PxP matrix, and the best computational performance is obtained when R is linked against OpenBLAS, Intel MKL or other optimized BLAS library. This implementation aims at being compatible with ‘broom’ and ‘sandwich’ packages for summary statistics and clustering by providing S3 methods.
This package takes ideas from glm2, speedglm, fastglm, and fixest packages, but the implementations here shall keep the functions and outputs as closely as possible to the stats package, therefore making the functions provided here compatible with packages such as sandwich for robust estimation, even if that means to attenuate the speed gains.
The greatest strength of this package is testing. With more than 1600 (and counting) tests, we try to do exactly the same as lm/glm, even in edge cases, but faster.
The ultimate aim of the project is to produce a package that:
Cdqrls function that the stats package uses for model fitting
formula <- "mpg ~ I(wt^2)"
lm(formula, data = mtcars)$coefficients
#> (Intercept) I(wt^2)
#> 28.0510597 -0.7058275
elm(formula, data = mtcars)$coefficients
#> (Intercept) I(wt^2)
#> 28.0510597 -0.7058275Please read the documentation for the full details.
You can install the released version of eflm from CRAN with:
install.packages("eflm")And the development version with:
remotes::install_github("pachamaltese/eflm")Let’s fit two computationally demanding models from Yotov, et al. (2016).
The dataset for the benchmark was also taken from Yotov, et al. and consists in a 28,152 x 8 data frame with 6 numeric and 2 categorical columns:
# A tibble: 28,152 x 8
year trade dist cntg lang clny exp_year imp_year
<int> <dbl> <dbl> <int> <int> <int> <chr> <chr>
1 1986 27.8 12045. 0 0 0 ARG1986 AUS1986
2 1986 3.56 11751. 0 0 0 ARG1986 AUT1986
3 1986 96.1 11305. 0 0 0 ARG1986 BEL1986
4 1986 3.13 12116. 0 0 0 ARG1986 BGR1986
5 1986 52.7 1866. 1 1 0 ARG1986 BOL1986
6 1986 405. 2392. 1 0 0 ARG1986 BRA1986
7 1986 48.3 9391. 0 0 0 ARG1986 CAN1986
8 1986 23.6 11233. 0 0 0 ARG1986 CHE1986
9 1986 109. 1157. 1 1 0 ARG1986 CHL1986
10 1986 161. 19110. 0 0 0 ARG1986 CHN1986
# … with 28,142 more rowsThe variables are:
This test was conducted on a 2 dedicated CPUs and 4GB of RAM DigitalOcean droplet.
The general equation for this model is:
By running regressions with cumulative subset of the data for 1986, …, 2006 (e.g. regress for 1986, then 1986 and 1990, …, then 1986 to 2006), we obtain the next fitting times and memory allocation depending on the design matrix dimensions:


This test was conducted on a 2 dedicated CPUs and 4GB of RAM DigitalOcean droplet.
The general equation for this model is:
By running regressions with cumulative subset of the data for 1986, …, 2006 (e.g. regress for 1986, then 1986 and 1990, …, then 1986 to 2006), we obtain the next fitting times depending on the design matrix dimensions:


“If I have seen further it is by standing on the shoulders of Giants.”
The original generalized linear model implementation via iteratively reweighted least squares for any family was written in R by Simon Davies in Dec, 1995. This implementation was later improved by Thomas Lumley back in Apr, 1997, and then other developers. In 2021, their work was adapted by me, Mauricio ‘Pachá’ Vargas Sepúlveda, for large datasets.
This work got very well received feedback from: