I am trying to understand which sample it is correct to use in the first stage when estimating the models using the control function (CF) approach and lagged explanatory variables. Below, I explain in detail what I mean.
The CF approach is an alternative to
xtivreg, fe estimation. Suppose
X is an endogenous independent variable. In the CF approach, we first run
xtreg X Z C1 C2, fe, where
C1 and
C2 are controls from the first stage and
Z is an instrument); then predict residuals with
predict CF, resid
and then insert CF in the first stage:
xtreg Y X C1 C2 CF, fe
In this case, coefficients for
X,
C1, and
C2 should be the same in both
xtreg Y X C1 C2 CF, fe and
xtivreg Y C1 C2 (X = Z), fe, while standard errors will differ if we do not adjust the ones from
xtreg, fe via bootsrapping (I did not use bootstrapping in order not to create additional confusion).
Indeed, here are the results of
xtreg, fe and
xtivreg, fe I derived using the nlswork data:
xtreg, fe (errors not bootstrapped)
Code:
webuse nlswork, clear
quietly xtreg tenure union south age c.age#c.age not_smsa, fe
predict cf, resid
xtreg ln_w tenure age c.age#c.age not_smsa cf, fe
Fixed-effects (within) regression Number of obs = 19,007
Group variable: idcode Number of groups = 4,134
R-sq: Obs per group:
within = 0.1328 min = 1
between = 0.2365 avg = 4.6
overall = 0.2073 max = 12
F(5,14868) = 455.53
corr(u_i, Xb) = 0.2033 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure | .2403531 .0151385 15.88 0.000 .2106797 .2700264
age | .0118437 .0036499 3.24 0.001 .0046894 .018998
|
c.age#c.age | -.0012145 .0000798 -15.22 0.000 -.0013709 -.001058
|
not_smsa | -.0167178 .0137527 -1.22 0.224 -.0436748 .0102393
cf | -.2227325 .0151602 -14.69 0.000 -.2524484 -.1930167
_cons | 1.678287 .0659452 25.45 0.000 1.549027 1.807548
-------------+----------------------------------------------------------------
sigma_u | .38999138
sigma_e | .25552281
rho | .69964877 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4133, 14868) = 8.30 Prob > F = 0.0000
xtivreg, fe:
Code:
xtivreg ln_w age c.age#c.age not_smsa (tenure = union south), fe
Fixed-effects (within) IV regression Number of obs = 19,007
Group variable: idcode Number of groups = 4,134
R-sq: Obs per group:
within = . min = 1
between = 0.1304 avg = 4.6
overall = 0.0897 max = 12
Wald chi2(4) = 147926.58
corr(u_i, Xb) = -0.6843 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure | .2403531 .0373419 6.44 0.000 .1671643 .3135419
age | .0118437 .0090032 1.32 0.188 -.0058023 .0294897
|
c.age#c.age | -.0012145 .0001968 -6.17 0.000 -.0016003 -.0008286
|
not_smsa | -.0167178 .0339236 -0.49 0.622 -.0832069 .0497713
_cons | 1.678287 .1626657 10.32 0.000 1.359468 1.997106
-------------+----------------------------------------------------------------
sigma_u | .70661941
sigma_e | .63029359
rho | .55690561 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4133,14869) = 1.44 Prob > F = 0.0000
------------------------------------------------------------------------------
Instrumented: tenure
Instruments: age c.age#c.age not_smsa union south
------------------------------------------------------------------------------
As you could see, coefficients are the same, just standard errors differ (standard errors equalize once bootstrapped that confirms that both approaches yield the exact same results when the same instrument is used).
However, my question is
which sample in the first stage it is correct to use once our explanatory variables are lagged?
When explanatory variable are (one year) lagged, fixed-effects IV estimator produces the following:
Code:
xtivreg ln_w l.age cl.age#cl.age l.not_smsa (l.tenure = l.union l.south), fe
Fixed-effects (within) IV regression Number of obs = 7,500
Group variable: idcode Number of groups = 3,294
R-sq: Obs per group:
within = . min = 1
between = 0.0685 avg = 2.3
overall = 0.0571 max = 6
Wald chi2(4) = 80781.56
corr(u_i, Xb) = -0.5474 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure |
L1. | .1755435 .0389611 4.51 0.000 .0991811 .2519059
|
age |
L1. | .0106753 .0134104 0.80 0.426 -.0156085 .0369592
|
cL.age#|
cL.age | -.0008867 .0002305 -3.85 0.000 -.0013384 -.0004351
|
not_smsa |
L1. | -.0452809 .0509685 -0.89 0.374 -.1451773 .0546154
|
_cons | 1.671945 .2302329 7.26 0.000 1.220697 2.123194
-------------+----------------------------------------------------------------
sigma_u | .59050356
sigma_e | .54146412
rho | .54324114 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3293,4202) = 1.08 Prob > F = 0.0089
------------------------------------------------------------------------------
Instrumented: L.tenure
Instruments: L.age cL.age#cL.age L.not_smsa L.union L.south
------------------------------------------------------------------------------
The following CF model provides the same results:
Code:
quietly xtreg l.tenure l.union l.south l.age cl.age#cl.age l.not_smsa, fe
predict cf, resid
xtreg ln_w l.tenure l.age cl.age#cl.age l.not_smsa cf, fe
Fixed-effects (within) regression Number of obs = 7,500
Group variable: idcode Number of groups = 3,294
R-sq: Obs per group:
within = 0.1351 min = 1
between = 0.1783 avg = 2.3
overall = 0.1770 max = 6
F(5,4201) = 131.21
corr(u_i, Xb) = 0.1436 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure |
L1. | .1755435 .0205221 8.55 0.000 .1353094 .2157776
|
age |
L1. | .0106753 .0070637 1.51 0.131 -.0031732 .0245239
|
cL.age#|
cL.age | -.0008867 .0001214 -7.30 0.000 -.0011247 -.0006488
|
not_smsa |
L1. | -.0452809 .0268467 -1.69 0.092 -.0979147 .0073528
|
cf | -.1641325 .020582 -7.97 0.000 -.204484 -.1237809
_cons | 1.671945 .1212711 13.79 0.000 1.43419 1.909701
-------------+----------------------------------------------------------------
sigma_u | .41441731
sigma_e | .2852065
rho | .67859444 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3293, 4201) = 3.72 Prob > F = 0.0000
However, if I do not use lags in the first stage and lag the residual in the second stage instead, the coefficients differ (because different samples were used in the first stage).
Code:
quietly xtreg tenure union south age c.age#c.age not_smsa, fe
predict cf, resid
xtreg ln_w l.tenure l.age cl.age#cl.age l.not_smsa l.cf, fe
Fixed-effects (within) regression Number of obs = 7,500
Group variable: idcode Number of groups = 3,294
R-sq: Obs per group:
within = 0.1353 min = 1
between = 0.1785 avg = 2.3
overall = 0.1767 max = 6
F(5,4201) = 131.45
corr(u_i, Xb) = 0.1454 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure |
L1. | .2566965 .0304213 8.44 0.000 .1970547 .3163383
|
age |
L1. | .0144529 .006859 2.11 0.035 .0010056 .0279002
|
cL.age#|
cL.age | -.0013382 .0001577 -8.48 0.000 -.0016475 -.001029
|
not_smsa |
L1. | -.0346281 .027326 -1.27 0.205 -.0882015 .0189453
|
cf |
L1. | -.2452925 .0305005 -8.04 0.000 -.3050896 -.1854954
|
_cons | 1.710315 .1238945 13.80 0.000 1.467417 1.953214
-------------+----------------------------------------------------------------
sigma_u | .41454272
sigma_e | .28517027
rho | .67878182 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3293, 4201) = 3.72 Prob > F = 0.0000
Is it completely incorrect to do this
Code:
quietly xtreg tenure union south age c.age#c.age not_smsa, fe
predict cf, resid
xtreg ln_w l.tenure l.age cl.age#cl.age l.not_smsa l.cf, fe
instead of this?
Code:
quietly xtreg l.tenure l.union l.south l.age cl.age#cl.age l.not_smsa, fe
predict cf, resid
xtreg ln_w l.tenure l.age cl.age#cl.age l.not_smsa cf, fe
Sorry for a long post. I just wanted to demonstrate my reasoning with examples.