764 Microeconometrics: Methods and Developments
exogenous samplingg(x)does not involveθ, so that inference onθcan be based
on the conditional log-likelihood based only onf(y|x,θ).
Under endogenous stratification, however, it can be shown thatg(y,x|θ)takes a
more complicated form, and ML estimation needs to be based on the joint log-
likelihood based ong(y,x|θ). Standard estimators that instead continue to use
f(y|x,θ)are inconsistent. Examples include truncated regression (for example,
hours of work are modeled and only workers are surveyed), choice-based sampling
(for example, commute mode choice is modeled and bus-riders are deliberately
oversampled as there are relatively few bus-riders), on site sampling and case-
control studies. Much of the econometrics literature has focused on choice-based
sampling in discrete choice models, with estimation by weighted MLE (see Manski
and Lerman, 1977), or more efficient GMM methods, (see Imbens, 1992). A more
general presentation for endogenous stratification is given by Imbens and Lan-
caster (1996). Wooldridge (2001) considers inverse-probability weighted estimators
for m-estimators.
Stratified surveys usually provide sample weights that can be used to obtain
population representative statistics. Under exogenous stratification, these sample
weights need not be used in the typical situation where correct specification of a
regression model is assumed. For example, assume that the regression function is
linear inx, so thaty=x′β+u,E[u|x]=0 and E[y|x]=x′β. Then OLS is consistent
even if the regressorsxare not representative of the population inx. The reason
for using these weights in estimation is if we wish to relax the assumption that
E[y|x]=x′β, due to nonlinearity or becauseβvaries across strata. Then weighted
OLS should be used as it provides an estimate of the so-called census coefficientβ∗
that has probability limit equal to the regression coefficient that would be obtained
by regression ofyonxusing the entire population (see DuMouchel and Duncan,
1983). For example, a weighted OLS regression of earnings on years of schooling
provides a consistent estimate of the population marginal effect on earnings of
one more year of schooling, without assuming that the model is linear. Note that
even if unweighted estimation is appropriate, weights may still be used in making
predictions from the model. For example, if E[y|x]is nonlinear inxthen marginal
effects vary with evaluation pointx, so that weights should be used to compute an
estimate of the population marginal effect.
A big reason for stratification is to improve efficiency of estimates of the pop-
ulation mean of a single variable, such as earnings or unemployment, when the
mean of that variable differs across strata. This efficiency gain can carry over to
regression, and some regression packages include commands to do so. These are
widely used in biostatistics but not in econometrics, in part because the efficiency
gains are felt to be small and in part because not all datasets provide the necessary
information on the strata. Bhattacharya (2005) presents results for m-estimation
and a good discussion of the issues.
In addition to stratification, survey methods often induce dependence for
sub-groups of observations. For example, several households on the same block
may be interviewed. Then data in that sub-group are likely to be positively cor-
related and, even after controlling for regressors, model errors are likely to be