Chapter 3 OLS estimates
The general equation for this model is: \[ \begin{align} \log X_{ij,t} =& \:\beta_0 + \beta_1 DIST_{i,j} + \beta_2 CNTG_{i,j} + \beta_3 LANG_{i,j} + \beta_4 CLNY_{i,j} + \beta_5 \log Y_{i,t} +\\ \text{ }& \beta_6 \log E_{j,t} + \varepsilon_{ij,t} \end{align} \]
We start by removing 0 flows. IMPORTANT: If we don’t do this, lm fails because
log(0) = -Inf
.
<- gravity2 %>%
gravity2 filter(exporter != importer, trade > 0)
3.1 Adjust
We start with a linear model with usual standard errors:
<- lm(log_trade ~ log_dist + cntg + lang + clny + log_output +
model1 data = gravity2)
log_expenditure,
summary(model1) # no clustered std error here!
##
## Call:
## lm(formula = log_trade ~ log_dist + cntg + lang + clny + log_output +
## log_expenditure, data = gravity2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5421 -0.8281 0.1578 1.0476 7.6585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.283080 0.151732 -74.36 < 2e-16 ***
## log_dist -1.001607 0.014159 -70.74 < 2e-16 ***
## cntg 0.573805 0.074427 7.71 1.31e-14 ***
## lang 0.801548 0.033748 23.75 < 2e-16 ***
## clny 0.734853 0.070387 10.44 < 2e-16 ***
## log_output 1.190236 0.005402 220.32 < 2e-16 ***
## log_expenditure 0.907588 0.005577 162.73 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.743 on 25682 degrees of freedom
## Multiple R-squared: 0.7585, Adjusted R-squared: 0.7585
## F-statistic: 1.345e+04 on 6 and 25682 DF, p-value: < 2.2e-16
3.2 Compute clustered standard errors
In R we can work on top of the previous model to obtain a clustered variance matrix:
<- vcovCL(model1, cluster = gravity2[, "pair"],
vcov_cluster1 df_correction = TRUE)
<- tidy(coeftest(model1, vcov_cluster1)) coef_test1
From here we can obtain correct (clustered) F distribution based statistics
<- tidy(waldtest(model1, vcov = vcov_cluster1, test = "F"))
wald1 <- wald1$statistic[2]
fs1 <- wald1$p.value[2] fp1
Same for obtaining the root MSE
<- as.numeric(crossprod(model1$residuals))
rss1 <- sqrt(rss1 / length(model1$residuals)) rmse1
Now we can create a list with all the previous information:
<- list(
model1_results summary = tibble(
n_obs = nrow(gravity2),
f_stat = fs1,
prob_f = fp1,
r_sq = summary(model1)$r.squared,
root_mse = rmse1
),coefficients = coef_test1
)
model1_results
## $summary
## # A tibble: 1 × 5
## n_obs f_stat prob_f r_sq root_mse
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 25689 4893. 0 0.759 1.74
##
## $coefficients
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -11.3 0.256 -44.1 0
## 2 log_dist -1.00 0.0230 -43.5 0
## 3 cntg 0.574 0.137 4.18 2.90e- 5
## 4 lang 0.802 0.0682 11.8 7.88e-32
## 5 clny 0.735 0.114 6.46 1.05e-10
## 6 log_output 1.19 0.00931 128. 0
## 7 log_expenditure 0.908 0.00973 93.3 0
3.3 RESET test
It is (extremely) important to conduct a misspecification test, this is easier to do in Stata but we can still work it out in R:
<- gravity2 %>%
gravity2 mutate(fit2 = predict(model1)^2)
<- lm(log_trade ~ log_dist + cntg + lang + clny + log_output +
model2 + fit2, data = gravity2)
log_expenditure
<- vcovCL(model2, cluster = gravity2[, "pair"],
vcov_cluster2 df_correction = TRUE)
<- tidy(coeftest(model2, vcov_cluster2))
coef_test2
coef_test2
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -11.6 0.275 -42.3 0
## 2 log_dist -1.05 0.0252 -41.6 0
## 3 cntg 0.702 0.133 5.28 1.28e- 7
## 4 lang 0.835 0.0682 12.2 2.51e-34
## 5 clny 0.797 0.110 7.26 4.04e-13
## 6 log_output 1.24 0.0143 86.4 0
## 7 log_expenditure 0.944 0.0135 70.1 0
## 8 fit2 -0.00699 0.00143 -4.88 1.04e- 6