Thanks to:
gravity
Twitter and Github: pachamaltese
Email: m vargas at dcc dot uchile dot cl
Phone: +1 XXX XXX XX XX or +56 X X XXX XX XX
No matter if you are an expert or beginner user:
setwd()
setwd("~/github/satrday-talk/file.csv)
goes against reproducibilityNo matter if you are an expert or beginner user:
rm(list = ls())
If Jenny Bryan discovers you doing all of the previous, she will come into your office and set your computer on fire
# Don't do thisx=read.csv("data_2_use_this.csv");x=x[,-1];a=x$x2# Do thislatam <- read_csv("latam_region_trade_data.csv") %>% select(-country_name)iso_codes <- select(latam, country_iso)
ifyoudontwritewellitshardtobeunderstood
Finally, finish your sandwich or Solenya will come with a laser beam
Do you understand (a bit) linear algebra?
Have you ever fitted a regression (in R) before today?
\renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}}
Let \vec{y} \in \R^n be the outcome and X \in \R^{n\times p} be the design matrix in the context of a general model with intercept: \vec{y} = X\vec{\beta} + \vec{e}
Being:
\begin{equation*} \underset{n\times 1}{\vec{y}} = \begin{pmatrix}y_0 \cr y_1 \cr \vdots \cr y_n\end{pmatrix} \text{ and } \underset{n\times p}{X} = \begin{pmatrix}1 & x_{11} & & x_{1p} \cr 1 & x_{21} & & x_{2p} \cr & \ddots & \cr 1 & x_{n1} & & x_{np}\end{pmatrix} = (\vec{1} \: \vec{x}_1 \: \ldots \: \vec{x}_p) \end{equation*}
In linear models the aim is to minimize the error term by chosing \hat{\vec{\beta}}. One possibility is to minimize the squared error by solving this optimization problem:
\begin{equation} \label{min} \displaystyle \min_{\vec{\beta}} S = \|\vec{y} - X\vec{\beta}\|^2 \end{equation}
Books such as Baltagi (2011) discuss how to solve this, and different equivalent approaches result in this optimal estimator:
\begin{equation} \label{beta} \hat{\vec{\beta}} = (X^tX)^{-1} X^t\vec{y} \end{equation}
With one independent variable and intercept, this is y_i = \beta_0 + \beta_1 x_{i1} + e_i, \vec{\beta} can be written as:
\begin{equation} \label{beta2} \hat{\beta}_1 = cor(\vec{y},\vec{x}) \cdot \frac{sd(\vec{y})}{sd(\vec{x})} \text{ and } \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{\vec{x}} \end{equation}
Consider the model: mpg_i = \beta_1 wt_i + \beta_2 cyl_i + e_i
This is how to write that model in R notation:
lm(mpg ~ wt + cyl, data = mtcars)
Or written in matrix form:
y <- mtcars$mpgx0 <- rep(1, length(y))x1 <- mtcars$wtx2 <- mtcars$cylX <- cbind(x0,x1,x2)
It's the same to use lm
or to perform a matrix multiplication:
fit <- lm(y ~ x1 + x2)coefficients(fit)
## (Intercept) x1 x2 ## 39.686261 -3.190972 -1.507795
# versusbeta <- solve(t(X)%*%X) %*% (t(X)%*%y)beta
## [,1]## x0 39.686261## x1 -3.190972## x2 -1.507795
Now let's check the correlation procedure:
if (!require(pacman)) install.packages("pacman")pacman::p_load(HistData)y <- Galton$childx <- Galton$parentbeta1 <- cor(y, x) * sd(y) / sd(x)beta0 <- mean(y) - beta1 * mean(x)c(beta0, beta1)
## [1] 23.9415302 0.6462906
The main reference for this section is Woelwer, Burgard, Kunst, and Vargas (2018) and the references therein.
Gravity models in their traditional form are inspired by Newton law of gravitation:
\begin{equation*} F_{ij}=G\frac{M_{i}M_{j}}{D^{2}_{ij}}. \end{equation*}
The force F between two bodies i and j with i \neq j is proportional to the masses M of these bodies and inversely proportional to the square of their geographical distance D. G is a constant and as such of no major concern.
The underlying idea of a traditional gravity model, shown for international trade, is equally simple:
\begin{equation*} X_{ij}=G\frac{Y_{i}^{\beta_{1}}Y_{j}^{\beta_{2}}}{D_{ij}^{\beta_{3}}}. \end{equation*}
The trade flow X is explained by Y_{i} and Y_{j} that are the masses of the exporting and importing country (e.g. the GDP) and D_{ij} that is the distance between the countries.
This is also used to study urban policies and migration flows!
Dummy variables such as common borders contig or regional trade agreements rta can be added to the model. Let t_{ij} be the transaction cost defined as:
\begin{equation*} t_{ij}= D_{ij} \exp(contig_{ij} + rta_{ij}) \end{equation*}
So that the model with friction becomes:
\begin{equation*} X_{ij}=G\frac{Y_{i}^{\beta_{1}}Y_{j}^{\beta_{2}}}{t_{ij}^{\beta_{3}}}. \end{equation*}
A logarithmic operator can be applied to form a log-linear model and use a standard estimation methods such as OLS:
\begin{equation*} \log X_{ij}=\beta_{0}\log G +\beta_{1}\log Y_{i}+\beta_{2}\log Y_{j}+\beta_{3}\log D_{ij}+\beta_{4}contig_{ij}+\beta_{5}rta_{ij} \end{equation*}
Basically the model proposes that the exports X_{ij} from i to j are determined by the supply factors in i, Y_{i}, and the demand factors in j, Y_{j}, as well as the transaction costs t_{ij}.
Next to information on bilateral partners i and j, information on the rest of the world is included in the gravity model with Y=\sum_{i} Y_{i}= \sum_{j} Y_{j} that represents the worldwide sum of incomes (e.g. the world's GDP).
A key assumption is to take a fixed value \sigma > 1 in order to account for the preference for a variation of goods (e.g. in this model goods can be replaced for other similar goods).
The Multilateral Resistance terms are included via the terms P, Inward Multilateral Resistance, and \Pi, Outward Multilateral Resistance.
The Inward Multilateral Resistance P_i is a function of the transaction costs of i to all trade partners j.
The Outward Multilateral Resistance \Pi_{j} is a function of the transaction costs of j to all trade partners i and their demand.
The Multilateral Resistance terms dependent on each other. Hence, the estimation of structural gravity models becomes complex.
The econometric literature proposes the Multilateral Resistance model defined by the equations:
\begin{equation*} X_{ij}=\frac{Y_{i}Y_{j}}{Y}\frac{t_{ij}^{1-\sigma}}{P_{j}^{1-\sigma}\Pi_{i}^{1-\sigma}} \end{equation*} with \begin{equation*} P_{i}^{1-\sigma}=\sum_{j}\frac{t_{ij}^{1-\sigma}}{\Pi_{j}^{1-\sigma}}\frac{Y_{j}}{Y};\:\Pi_{j}^{1-\sigma}=\sum_{i}\frac{t_{ij}^{1-\sigma}}{P_{i}^{1-\sigma}}\frac{Y_{i}}{Y} \end{equation*}
To estimate gravity equations you need a square dataset including bilateral flows defined by the argument dependent_variable
, a distance measure defined by the argument distance
that is the key regressor, and other potential influences (e.g. contiguity and common currency) given as a vector in additional_regressors
are required.
Some estimation methods require ISO codes or similar of type character variables to compute particular country effects. Make sure the origin and destination codes are of type "character".
The rule of thumb for regressors or independent variables consists in:
The user should perform some data cleaning beforehand to remove observations that contain entries that can distort estimates, notwithstanding the functions provided within gravity package will remove zero flows and distances.
Double Demeaning subtracts importer and exporter averages on the left and right hand side of the respective gravity equation, and all unilateral influences including the Multilateral Resistance terms vanish.
Therefore, no unilateral variables may be added as independent variables for the estimation.
Our ddm function first logs the dependent variable and the distance variable.
Afterwards, the dependent and independent variables are transformed in the following way (exemplary shown for trade flows, X_{ij}): \begin{equation*} (\log X_{ij})_{\text{DDM}} = (\log X_{ij}) - (\log X_{ij})_{\text{Origin Mean}} - (\log X_{ij})_{\text{Destination Mean}} + (\log X_{ij})_{\text{Mean}}. \end{equation*}
One subtracts the mean value for the origin country and the mean value for the destination country and adds the overall mean value to the logged trade flows.
This procedure is repeated for all dependent and independent variables. The transformed variables are then used for the estimation.
An example of how to apply the function ddm to an example dataset in gravity and the resulting output is shown in the following:
pacman::p_load(gravity)fit <- ddm( dependent_variable = "flow", distance = "distw", additional_regressors = c("rta", "comcur", "contig"), code_origin = "iso_o", code_destination = "iso_d", data = gravity_no_zeros )
The package returns lm or glm objects instead of summaries. Doing that allows to use our functions in conjunction with broom or other packages, for example:
pacman::p_load(dplyr, broom)tidy(fit)
## # A tibble: 4 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 dist_log_ddm -1.60 0.0331 -48.4 0. ## 2 rta_ddm 0.797 0.0700 11.4 6.54e-30## 3 comcur_ddm 0.174 0.146 1.19 2.34e- 1## 4 contig_ddm 1.00 0.120 8.36 6.62e-17
glance(fit) %>% select(matches("squared"))
## # A tibble: 1 x 2## r.squared adj.r.squared## <dbl> <dbl>## 1 0.254 0.254
How to do it without the function?
d <- gravity_no_zeros %>% mutate( dist_log = log(distw), # Transforming data, logging distances y_log = log(flow) # Transforming data, logging flows )
# Substracting the meansd <- d %>% mutate( y_log_ddm = y_log, dist_log_ddm = dist_log ) %>% group_by(iso_o, add = FALSE) %>% mutate( ym1 = mean(y_log_ddm, na.rm = TRUE), dm1 = mean(dist_log_ddm, na.rm = TRUE) ) %>% group_by(iso_d, add = FALSE) %>% mutate( ym2 = mean(y_log_ddm, na.rm = TRUE), dm2 = mean(dist_log_ddm, na.rm = TRUE) )
# Substracting the meansd <- d %>% group_by(iso_o, add = FALSE) %>% mutate( y_log_ddm = y_log_ddm - ym1, dist_log_ddm = dist_log_ddm - dm1 ) %>% group_by(iso_d, add = FALSE) %>% mutate( y_log_ddm = y_log_ddm - ym2, dist_log_ddm = dist_log_ddm - dm2 )
# Substracting the meansd <- d %>% ungroup() %>% mutate( y_log_ddm = y_log_ddm + mean(y_log, na.rm = TRUE), dist_log_ddm = dist_log_ddm + mean(dist_log, na.rm = TRUE) )
# Substracting the means for the other independent variablespacman::p_load(tidyr)d2 <- d %>% select(iso_o, iso_d, rta, comcur, contig) %>% gather(key, value, -iso_o, -iso_d) %>% mutate(key = paste0(key, "_ddm")) %>% group_by(iso_o, key, add = FALSE) %>% mutate(ddm = value - mean(value, na.rm = TRUE)) %>% group_by(iso_d, key, add = FALSE) %>% mutate(ddm = ddm - mean(value, na.rm = TRUE)) %>% ungroup() %>% mutate(value = ddm + mean(value, na.rm = TRUE)) %>% select(iso_o, iso_d, key, value) %>% spread(key, value)
# Modeld <- left_join(d, d2, by = c("iso_o", "iso_d")) %>% select(y_log_ddm, ends_with("_ddm"))vars <- paste(c("dist_log_ddm", paste0(c("rta", "comcur", "contig"), "_ddm"), 0), collapse = " + ")form <- stats::as.formula(paste("y_log_ddm", "~", vars))model_ddm <- stats::lm(form, data = d)tidy(model_ddm)
## # A tibble: 4 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 dist_log_ddm -1.60 0.0331 -48.4 0. ## 2 rta_ddm 0.797 0.0700 11.4 6.54e-30## 3 comcur_ddm 0.174 0.146 1.19 2.34e- 1## 4 contig_ddm 1.00 0.120 8.36 6.62e-17
Inside ddm()
the argument robust=TRUE
is equivalent to replace lm
by rlm
in the last part:
model_ddm <- MASS::rlm(form, data = d)tidy(model_ddm)
## # A tibble: 4 x 4## term estimate std.error statistic## <chr> <dbl> <dbl> <dbl>## 1 dist_log_ddm -1.60 0.0311 -51.7 ## 2 rta_ddm 0.656 0.0657 9.99## 3 comcur_ddm 0.507 0.137 3.70## 4 contig_ddm 1.12 0.112 9.93
fit2 <- ddm( dependent_variable = "flow", distance = "distw", additional_regressors = c("rta", "comcur", "contig"), code_origin = "iso_o", code_destination = "iso_d", data = gravity_no_zeros, robust = TRUE )tidy(fit2)
## # A tibble: 4 x 4## term estimate std.error statistic## <chr> <dbl> <dbl> <dbl>## 1 dist_log_ddm -1.60 0.0311 -51.7 ## 2 rta_ddm 0.656 0.0657 9.99## 3 comcur_ddm 0.507 0.137 3.70## 4 contig_ddm 1.12 0.112 9.93
github.com/pachamaltese/gravity
pacha.hk/gravity
Baltagi, B. H. (2011). Econometrics. Springer Texts in Business and Economics 978-3-642-20059-5. Springer. DOI: 10.1007/978-1-4614-3169-5.
Woelwer, A. L, J. P. Burgard, J. Kunst, et al. (2018). "Gravity: Estimation Methods for Gravity Models in R". In: Journal of Open Source Software 31.3, p. 1038. DOI: 10.21105/joss.01038.
Creative Commons Attribution-NonCommercial 4.0 International
To view a copy of this license visit https://creativecommons.org/licenses/by-nc/4.0/
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |