class: center, middle, inverse, title-slide # Econometrics in R ## Estimation of Gravity Models ### Pachá ### satRday Santiago ### Dec 15, 2018 --- # Before we begin --- ## Acknowledgements Thanks to: * Anna-Lenna Woelver and Jan Pablo Burgard fron Trier University (gravity 1st version) * Joshua Kunst from PUC Chile (gravity re-writting using rlang) * Vilma Romero (made this cool template available) --- ## Contents of the talk * Lifegoals (or how to use R well) * A bit of statistical/economic theory * Our new R package `gravity` --- ## Where to reach me **Twitter and Github: `pachamaltese`** **Email: `m vargas at dcc dot uchile dot cl`** **Phone: `+1 XXX XXX XX XX or +56 X X XXX XX XX`** --- ## Lifegoals No matter if you are an expert or beginner user: * Don't use `setwd()` * Writing `setwd("~/github/satrday-talk/file.csv)` goes **against** reproducibility * Use RStudio projects instead --- ## Lifegoals No matter if you are an expert or beginner user: * Don't use `rm(list = ls())` * Restart the session instead --- ## Lifegoals If Jenny Bryan discovers you doing all of the previous, she will come into your office and [set your computer on fire](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) <img src="images/computeonfire.png" width="531" /> --- ## Lifegoals * Please use Git (i.e. Github, Gitlab, Bitbucket, etc) * Commit to save your progress * If you screw up your code you can always go back --- ## Lifegoals * If you work with others, using packrat is desirable * But it is really important that you write well (descriptive names + stylesheets) ```r # Don't do this x=read.csv("data_2_use_this.csv");x=x[,-1];a=x$x2 # Do this latam <- read_csv("latam_region_trade_data.csv") %>% select(-country_name) iso_codes <- select(latam, country_iso) ``` ifyoudontwritewellitshardtobeunderstood --- ## Lifegoals Finally, finish your sandwich or Solenya will come with a laser beam <img src="images/picklerick.jpg" width="614" /> --- ## Questions to the audience **Do you understand (a bit) linear algebra?** **Have you ever fitted a regression (in R) before today?** --- ## Linear regression $$ \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}} $$ Let `\(\vec{y} \in \R^n\)` be the outcome and `\(X \in \R^{n\times p}\)` be the design matrix in the context of a general model with intercept: `$$\vec{y} = X\vec{\beta} + \vec{e}$$` Being: $$ `\begin{equation*} \underset{n\times 1}{\vec{y}} = \begin{pmatrix}y_0 \cr y_1 \cr \vdots \cr y_n\end{pmatrix} \text{ and } \underset{n\times p}{X} = \begin{pmatrix}1 & x_{11} & & x_{1p} \cr 1 & x_{21} & & x_{2p} \cr & \ddots & \cr 1 & x_{n1} & & x_{np}\end{pmatrix} = (\vec{1} \: \vec{x}_1 \: \ldots \: \vec{x}_p) \end{equation*}` $$ --- ## Linear regression In linear models the aim is to minimize the error term by chosing `\(\hat{\vec{\beta}}\)`. One possibility is to minimize the squared error by solving this optimization problem: $$ `\begin{equation} \label{min} \displaystyle \min_{\vec{\beta}} S = \|\vec{y} - X\vec{\beta}\|^2 \end{equation}` $$ Books such as Baltagi (2011) discuss how to solve this, and different equivalent approaches result in this optimal estimator: $$ `\begin{equation} \label{beta} \hat{\vec{\beta}} = (X^tX)^{-1} X^t\vec{y} \end{equation}` $$ With one independent variable and intercept, this is `\(y_i = \beta_0 + \beta_1 x_{i1} + e_i\)`, `\(\vec{\beta}\)` can be written as: $$ `\begin{equation} \label{beta2} \hat{\beta}_1 = cor(\vec{y},\vec{x}) \cdot \frac{sd(\vec{y})}{sd(\vec{x})} \text{ and } \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{\vec{x}} \end{equation}` $$ --- ## Coding example with mtcars dataset Consider the model: `$$mpg_i = \beta_1 wt_i + \beta_2 cyl_i + e_i$$` This is how to write that model in R notation: ```r lm(mpg ~ wt + cyl, data = mtcars) ``` Or written in matrix form: ```r y <- mtcars$mpg x0 <- rep(1, length(y)) x1 <- mtcars$wt x2 <- mtcars$cyl X <- cbind(x0,x1,x2) ``` --- ## Coding example with mtcars dataset It's the same to use `lm` or to perform a matrix multiplication: ```r fit <- lm(y ~ x1 + x2) coefficients(fit) ``` ``` ## (Intercept) x1 x2 ## 39.686261 -3.190972 -1.507795 ``` ```r # versus beta <- solve(t(X)%*%X) %*% (t(X)%*%y) beta ``` ``` ## [,1] ## x0 39.686261 ## x1 -3.190972 ## x2 -1.507795 ``` --- ## Coding example with Galton dataset Now let's check the correlation procedure: ```r if (!require(pacman)) install.packages("pacman") pacman::p_load(HistData) y <- Galton$child x <- Galton$parent beta1 <- cor(y, x) * sd(y) / sd(x) beta0 <- mean(y) - beta1 * mean(x) c(beta0, beta1) ``` ``` ## [1] 23.9415302 0.6462906 ``` <img src="images/komolozupo.jpg" width="231" /> --- # Simple gravity model --- ## Simple gravity model The main reference for this section is Woelwer, Burgard, Kunst, and Vargas (2018) and the references therein. Gravity models in their traditional form are inspired by Newton law of gravitation: $$ `\begin{equation*} F_{ij}=G\frac{M_{i}M_{j}}{D^{2}_{ij}}. \end{equation*}` $$ The force `\(F\)` between two bodies `\(i\)` and `\(j\)` with `\(i \neq j\)` is proportional to the masses `\(M\)` of these bodies and inversely proportional to the square of their geographical distance `\(D\)`. `\(G\)` is a constant and as such of no major concern. --- ## Simple gravity model The underlying idea of a traditional gravity model, shown for international trade, is equally simple: $$ `\begin{equation*} X_{ij}=G\frac{Y_{i}^{\beta_{1}}Y_{j}^{\beta_{2}}}{D_{ij}^{\beta_{3}}}. \end{equation*}` $$ The trade flow `\(X\)` is explained by `\(Y_{i}\)` and `\(Y_{j}\)` that are the masses of the exporting and importing country (e.g. the GDP) and `\(D_{ij}\)` that is the distance between the countries. **This is also used to study urban policies and migration flows!** --- ## Simple gravity model Dummy variables such as common borders `\(contig\)` or regional trade agreements `\(rta\)` can be added to the model. Let `\(t_{ij}\)` be the transaction cost defined as: $$ `\begin{equation*} t_{ij}= D_{ij} \exp(contig_{ij} + rta_{ij}) \end{equation*}` $$ So that the model with friction becomes: $$ `\begin{equation*} X_{ij}=G\frac{Y_{i}^{\beta_{1}}Y_{j}^{\beta_{2}}}{t_{ij}^{\beta_{3}}}. \end{equation*}` $$ A logarithmic operator can be applied to form a log-linear model and use a standard estimation methods such as OLS: $$ `\begin{equation*} \log X_{ij}=\beta_{0}\log G +\beta_{1}\log Y_{i}+\beta_{2}\log Y_{j}+\beta_{3}\log D_{ij}+\beta_{4}contig_{ij}+\beta_{5}rta_{ij} \end{equation*}` $$ --- # Trade barriers model --- ## Trade barriers model * Basically the model proposes that the exports `\(X_{ij}\)` from `\(i\)` to `\(j\)` are determined by the supply factors in `\(i\)`, `\(Y_{i}\)`, and the demand factors in `\(j\)`, `\(Y_{j}\)`, as well as the transaction costs `\(t_{ij}\)`. * Next to information on bilateral partners `\(i\)` and `\(j\)`, information on the rest of the world is included in the gravity model with `\(Y=\sum_{i} Y_{i}= \sum_{j} Y_{j}\)` that represents the worldwide sum of incomes (e.g. the world's GDP). --- ## Trade barriers model * A key assumption is to take a fixed value `\(\sigma > 1\)` in order to account for the preference for a variation of goods (e.g. in this model goods can be replaced for other similar goods). * The Multilateral Resistance terms are included via the terms `\(P\)`, Inward Multilateral Resistance, and `\(\Pi\)`, Outward Multilateral Resistance. * The Inward Multilateral Resistance `\(P_i\)` is a function of the transaction costs of `\(i\)` to all trade partners `\(j\)`. * The Outward Multilateral Resistance `\(\Pi_{j}\)` is a function of the transaction costs of `\(j\)` to all trade partners `\(i\)` and their demand. * The Multilateral Resistance terms dependent on each other. Hence, the estimation of structural gravity models becomes *complex*. <img src="images/pikachu.jpg" width="227" /> --- ## Trade barriers model The econometric literature proposes the Multilateral Resistance model defined by the equations: $$ `\begin{equation*} X_{ij}=\frac{Y_{i}Y_{j}}{Y}\frac{t_{ij}^{1-\sigma}}{P_{j}^{1-\sigma}\Pi_{i}^{1-\sigma}} \end{equation*}` $$ with $$ `\begin{equation*} P_{i}^{1-\sigma}=\sum_{j}\frac{t_{ij}^{1-\sigma}}{\Pi_{j}^{1-\sigma}}\frac{Y_{j}}{Y};\:\Pi_{j}^{1-\sigma}=\sum_{i}\frac{t_{ij}^{1-\sigma}}{P_{i}^{1-\sigma}}\frac{Y_{i}}{Y} \end{equation*}` $$ <img src="images/raichu.jpg" width="224" /> --- # Model estimation --- ## Model estimation * To estimate gravity equations you need a square dataset including bilateral flows defined by the argument `dependent_variable`, a distance measure defined by the argument `distance` that is the key regressor, and other potential influences (e.g. contiguity and common currency) given as a vector in `additional_regressors` are required. * Some estimation methods require ISO codes or similar of type character variables to compute particular country effects. Make sure the origin and destination codes are of type "character". --- ## Model estimation * The rule of thumb for regressors or independent variables consists in: * All dummy variables should be of type numeric (0/1). * If an independent variable is defined as a ratio, it should be logged. * The user should perform some data cleaning beforehand to remove observations that contain entries that can distort estimates, notwithstanding the functions provided within gravity package will remove zero flows and distances. --- # Examples --- ## Double Demeaning * Double Demeaning subtracts importer and exporter averages on the left and right hand side of the respective gravity equation, and all unilateral influences including the Multilateral Resistance terms vanish. * Therefore, no unilateral variables may be added as independent variables for the estimation. --- ## Double Demeaning Our ddm function first logs the dependent variable and the distance variable. Afterwards, the dependent and independent variables are transformed in the following way (exemplary shown for trade flows, `\(X_{ij}\)`): $$ `\begin{equation*} (\log X_{ij})_{\text{DDM}} = (\log X_{ij}) - (\log X_{ij})_{\text{Origin Mean}} - (\log X_{ij})_{\text{Destination Mean}} + (\log X_{ij})_{\text{Mean}}. \end{equation*}` $$ One subtracts the mean value for the origin country and the mean value for the destination country and adds the overall mean value to the logged trade flows. This procedure is repeated for all dependent and independent variables. The transformed variables are then used for the estimation. <!-- DDM is easily applied, but, as shown in @Head2014, its very sensitive to missing data. --> --- ## Double Demeaning An example of how to apply the function ddm to an example dataset in gravity and the resulting output is shown in the following: ```r pacman::p_load(gravity) fit <- ddm( dependent_variable = "flow", distance = "distw", additional_regressors = c("rta", "comcur", "contig"), code_origin = "iso_o", code_destination = "iso_d", data = gravity_no_zeros ) ``` --- ## Double Demeaning The package returns lm or glm objects instead of summaries. Doing that allows to use our functions in conjunction with broom or other packages, for example: ```r pacman::p_load(dplyr, broom) tidy(fit) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 dist_log_ddm -1.60 0.0331 -48.4 0. ## 2 rta_ddm 0.797 0.0700 11.4 6.54e-30 ## 3 comcur_ddm 0.174 0.146 1.19 2.34e- 1 ## 4 contig_ddm 1.00 0.120 8.36 6.62e-17 ``` --- ## Double Demeaning ```r glance(fit) %>% select(matches("squared")) ``` ``` ## # A tibble: 1 x 2 ## r.squared adj.r.squared ## <dbl> <dbl> ## 1 0.254 0.254 ``` --- ## Double Demeaning How to do it without the function? ```r d <- gravity_no_zeros %>% mutate( dist_log = log(distw), # Transforming data, logging distances y_log = log(flow) # Transforming data, logging flows ) ``` --- ## Double Demeaning ```r # Substracting the means d <- d %>% mutate( y_log_ddm = y_log, dist_log_ddm = dist_log ) %>% group_by(iso_o, add = FALSE) %>% mutate( ym1 = mean(y_log_ddm, na.rm = TRUE), dm1 = mean(dist_log_ddm, na.rm = TRUE) ) %>% group_by(iso_d, add = FALSE) %>% mutate( ym2 = mean(y_log_ddm, na.rm = TRUE), dm2 = mean(dist_log_ddm, na.rm = TRUE) ) ``` --- ## Double Demeaning ```r # Substracting the means d <- d %>% group_by(iso_o, add = FALSE) %>% mutate( y_log_ddm = y_log_ddm - ym1, dist_log_ddm = dist_log_ddm - dm1 ) %>% group_by(iso_d, add = FALSE) %>% mutate( y_log_ddm = y_log_ddm - ym2, dist_log_ddm = dist_log_ddm - dm2 ) ``` --- ## Double Demeaning ```r # Substracting the means d <- d %>% ungroup() %>% mutate( y_log_ddm = y_log_ddm + mean(y_log, na.rm = TRUE), dist_log_ddm = dist_log_ddm + mean(dist_log, na.rm = TRUE) ) ``` --- ## Double Demeaning ```r # Substracting the means for the other independent variables pacman::p_load(tidyr) d2 <- d %>% select(iso_o, iso_d, rta, comcur, contig) %>% gather(key, value, -iso_o, -iso_d) %>% mutate(key = paste0(key, "_ddm")) %>% group_by(iso_o, key, add = FALSE) %>% mutate(ddm = value - mean(value, na.rm = TRUE)) %>% group_by(iso_d, key, add = FALSE) %>% mutate(ddm = ddm - mean(value, na.rm = TRUE)) %>% ungroup() %>% mutate(value = ddm + mean(value, na.rm = TRUE)) %>% select(iso_o, iso_d, key, value) %>% spread(key, value) ``` --- ## Double Demeaning ```r # Model d <- left_join(d, d2, by = c("iso_o", "iso_d")) %>% select(y_log_ddm, ends_with("_ddm")) vars <- paste(c("dist_log_ddm", paste0(c("rta", "comcur", "contig"), "_ddm"), 0), collapse = " + ") form <- stats::as.formula(paste("y_log_ddm", "~", vars)) model_ddm <- stats::lm(form, data = d) tidy(model_ddm) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 dist_log_ddm -1.60 0.0331 -48.4 0. ## 2 rta_ddm 0.797 0.0700 11.4 6.54e-30 ## 3 comcur_ddm 0.174 0.146 1.19 2.34e- 1 ## 4 contig_ddm 1.00 0.120 8.36 6.62e-17 ``` --- ## Double Demeaning Inside `ddm()` the argument `robust=TRUE` is equivalent to replace `lm` by `rlm` in the last part: ```r model_ddm <- MASS::rlm(form, data = d) tidy(model_ddm) ``` ``` ## # A tibble: 4 x 4 ## term estimate std.error statistic ## <chr> <dbl> <dbl> <dbl> ## 1 dist_log_ddm -1.60 0.0311 -51.7 ## 2 rta_ddm 0.656 0.0657 9.99 ## 3 comcur_ddm 0.507 0.137 3.70 ## 4 contig_ddm 1.12 0.112 9.93 ``` --- ## Double Demeaning ```r fit2 <- ddm( dependent_variable = "flow", distance = "distw", additional_regressors = c("rta", "comcur", "contig"), code_origin = "iso_o", code_destination = "iso_d", data = gravity_no_zeros, robust = TRUE ) tidy(fit2) ``` ``` ## # A tibble: 4 x 4 ## term estimate std.error statistic ## <chr> <dbl> <dbl> <dbl> ## 1 dist_log_ddm -1.60 0.0311 -51.7 ## 2 rta_ddm 0.656 0.0657 9.99 ## 3 comcur_ddm 0.507 0.137 3.70 ## 4 contig_ddm 1.12 0.112 9.93 ``` --- # Code and documentation **github.com/pachamaltese/gravity** **pacha.hk/gravity** --- # Questions? --- # References <a name=bib-Baltagi2011></a>[Baltagi, B. H.](#cite-Baltagi2011) (2011). _Econometrics_. Springer Texts in Business and Economics 978-3-642-20059-5. Springer. DOI: [10.1007/978-1-4614-3169-5](https://doi.org/10.1007%2F978-1-4614-3169-5). <a name=bib-Woelver2018></a>[Woelwer, A. L, J. P. Burgard, J. Kunst, et al.](#cite-Woelver2018) (2018). "Gravity: Estimation Methods for Gravity Models in R". In: _Journal of Open Source Software_ 31.3, p. 1038. DOI: [10.21105/joss.01038](https://doi.org/10.21105%2Fjoss.01038). --- <center> <h3> This work is licensed as **Creative Commons Attribution-NonCommercial 4.0 International** To view a copy of this license visit https://creativecommons.org/licenses/by-nc/4.0/ </h3> <center>