Chapter 3 OLS estimates

The general equation for this model is: \[ \begin{align} \log X_{ij,t} =& \:\beta_0 + \beta_1 DIST_{i,j} + \beta_2 CNTG_{i,j} + \beta_3 LANG_{i,j} + \beta_4 CLNY_{i,j} + \beta_5 \log Y_{i,t} +\\ \text{ }& \beta_6 \log E_{j,t} + \varepsilon_{ij,t} \end{align} \]

We start by removing 0 flows. IMPORTANT: If we don’t do this, lm fails because log(0) = -Inf.

gravity2 <- gravity2 %>%
  filter(exporter != importer, trade > 0)

3.1 Adjust

We start with a linear model with usual standard errors:

model1 <- lm(log_trade ~ log_dist + cntg + lang + clny + log_output +
               log_expenditure, data = gravity2)

summary(model1) # no clustered std error here!
## 
## Call:
## lm(formula = log_trade ~ log_dist + cntg + lang + clny + log_output + 
##     log_expenditure, data = gravity2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5421  -0.8281   0.1578   1.0476   7.6585 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11.283080   0.151732  -74.36  < 2e-16 ***
## log_dist         -1.001607   0.014159  -70.74  < 2e-16 ***
## cntg              0.573805   0.074427    7.71 1.31e-14 ***
## lang              0.801548   0.033748   23.75  < 2e-16 ***
## clny              0.734853   0.070387   10.44  < 2e-16 ***
## log_output        1.190236   0.005402  220.32  < 2e-16 ***
## log_expenditure   0.907588   0.005577  162.73  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.743 on 25682 degrees of freedom
## Multiple R-squared:  0.7585, Adjusted R-squared:  0.7585 
## F-statistic: 1.345e+04 on 6 and 25682 DF,  p-value: < 2.2e-16

3.2 Compute clustered standard errors

In R we can work on top of the previous model to obtain a clustered variance matrix:

vcov_cluster1 <- vcovCL(model1, cluster = gravity2[, "pair"], 
                       df_correction = TRUE)

coef_test1 <- tidy(coeftest(model1, vcov_cluster1))

From here we can obtain correct (clustered) F distribution based statistics

wald1 <- tidy(waldtest(model1, vcov = vcov_cluster1, test = "F"))
fs1 <- wald1$statistic[2]
fp1 <- wald1$p.value[2]

Same for obtaining the root MSE

rss1 <- as.numeric(crossprod(model1$residuals))
rmse1 <- sqrt(rss1 / length(model1$residuals))

Now we can create a list with all the previous information:

model1_results <- list(
  summary = tibble(
    n_obs = nrow(gravity2),
    f_stat = fs1,
    prob_f = fp1,
    r_sq = summary(model1)$r.squared,
    root_mse = rmse1
  ),
  coefficients = coef_test1
)

model1_results
## $summary
## # A tibble: 1 × 5
##   n_obs f_stat prob_f  r_sq root_mse
##   <int>  <dbl>  <dbl> <dbl>    <dbl>
## 1 25689  4893.      0 0.759     1.74
## 
## $coefficients
## # A tibble: 7 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      -11.3     0.256      -44.1  0       
## 2 log_dist          -1.00    0.0230     -43.5  0       
## 3 cntg               0.574   0.137        4.18 2.90e- 5
## 4 lang               0.802   0.0682      11.8  7.88e-32
## 5 clny               0.735   0.114        6.46 1.05e-10
## 6 log_output         1.19    0.00931    128.   0       
## 7 log_expenditure    0.908   0.00973     93.3  0

3.3 RESET test

It is (extremely) important to conduct a misspecification test, this is easier to do in Stata but we can still work it out in R:

gravity2 <- gravity2 %>% 
  mutate(fit2 = predict(model1)^2)
 
model2 <- lm(log_trade ~ log_dist + cntg + lang + clny + log_output +
               log_expenditure + fit2, data = gravity2)

vcov_cluster2 <- vcovCL(model2, cluster = gravity2[, "pair"], 
                        df_correction = TRUE)

coef_test2 <- tidy(coeftest(model2, vcov_cluster2))

coef_test2
## # A tibble: 8 × 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     -11.6       0.275      -42.3  0       
## 2 log_dist         -1.05      0.0252     -41.6  0       
## 3 cntg              0.702     0.133        5.28 1.28e- 7
## 4 lang              0.835     0.0682      12.2  2.51e-34
## 5 clny              0.797     0.110        7.26 4.04e-13
## 6 log_output        1.24      0.0143      86.4  0       
## 7 log_expenditure   0.944     0.0135      70.1  0       
## 8 fit2             -0.00699   0.00143     -4.88 1.04e- 6