Introducing Capybara: Fast and Memory Efficient Fitting of Linear Models With High-Dimensional Fixed Effects

R
Statistics
Linear models
C++
Short demonstration.
Author

Mauricio “Pachá” Vargas S.

Published

January 19, 2024

About

Capybara is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This technique is particularly useful when estimating linear models with multiple group fixed effects.

The software can estimate GLMs from the Exponential Family and also Negative Binomial models but the focus will be the Poisson estimator because it is the one used for structural counterfactual analysis in International Trade. It is relevant to add that the IWLS estimator is equivalent with the PPML estimator from Santos-Silva et al. 2006

Traditional QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin 1962 article on vector projections offers important time and memory savings without compromising numerical stability in the estimation process.

The software heavily borrows from Gaure 20213 and Stammann 2018 works on the OLS and IWLS estimator with large k-way fixed effects (i.e., the Lfe and Alpaca packages). The differences are that Capybara uses an elementary approach and uses a minimal C++ code without parallelization, which achieves very good results considering its simplicity. I hope it is east to maintain.

The summary tables are nothing like R’s default and borrow from the Broom package and Stata outputs. The default summary from this package is a Markdown table that you can insert in RMarkdown/Quarto or copy and paste to Jupyter.

Demo

Estimating the coefficients of a gravity model with importer-time and exporter-time fixed effects.

library(capybara)

mod <- feglm(
  trade ~ dist + lang + cntg + clny | exp_year + imp_year,
  trade_panel,
  family = poisson(link = "log")
)

summary(mod)
Formula: trade ~ dist + lang + cntg + clny | exp_year + imp_year

Family: Poisson

Estimates:

|      | Estimate | Std. error | z value    | Pr(> |z|)  |
|------|----------|------------|------------|------------|
| dist |  -0.0006 |     0.0000 | -8090.3257 | 0.0000 *** |
| lang |  -0.1081 |     0.0006 |  -181.4917 | 0.0000 *** |
| cntg |  -1.3474 |     0.0005 | -2584.7078 | 0.0000 *** |
| clny |  -1.0101 |     0.0009 | -1127.0449 | 0.0000 *** |

Significance codes: *** 99.9%; ** 99%; * 95%; . 90%

Pseudo R-squared: 0.3092 

Number of observations: Full 28566; Missing 0; Perfect classification 0 

Number of Fisher Scoring iterations: 12 

Installation

You can install the development version of capybara like so:

remotes::install_github("pachadotdev/capybara")

Examples

See the documentation in progress: https://pacha.dev/capybara.

Benchmarks

Median time for the different models in the book An Advanced Guide to Trade Policy Analysis.

package PPML Trade Diversion Endogeneity Reverse Causality Non-linear/Phasing Effects Globalization
Alpaca 282ms 1.78s 1.1s 1.34s 2.18s 4.48s
Base R 36.2s 36.87s 9.81m 10.03m 10.41m 10.4m
Capybara 159.2ms 97.96ms 81.38ms 86.77ms 104.69ms 130.22ms
Fixest 33.6ms 191.04ms 64.38ms 75.2ms 102.18ms 162.28ms

Memory allocation for the same models

package PPML Trade Diversion Endogeneity Reverse Causality Non-linear/Phasing Effects Globalization
Alpaca 282.78MB 321.5MB 270.4MB 308MB 366.5MB 512.1MB
Base R 2.73GB 2.6GB 11.9GB 11.9GB 11.9GB 12GB
Capybara 339.13MB 196.3MB 162.6MB 169.1MB 181.1MB 239.9MB
Fixest 44.79MB 36.6MB 28.1MB 32.4MB 41.1MB 62.9MB