806 Computational Considerations in Microeconometrics
where each term in the sum on the right-hand side is the product of the mixing
probabilityπjand the component (sub-population) densityfj(yi|θj). Sometimes
such models are referred to as models of permanent unobserved heterogeneity. In
general theπjare unknown and hence need to be estimated along with all the
other parameters, denoted. AlsoπC=( 1 −
∑C− 1
j= 1 πj). For identifiability the (label
switching) restrictionπ 1 ≥π 2 ≥···≥πCis imposed; this can always be satisfied
by rearrangement, post-estimation. Therefore, it plays no role in estimation. The
parameterπjmay be further parameterized in terms of observed covariates using,
e.g., the logit function.
For givenC, maximum likelihood is a natural estimator for the FM model (see
McLachlan and Peel, 2000, Ch. 2). Lindsay (1983) showed that finding the MLE
involved a standard convex maximization problem in which a concave func-
tion is maximized over a convex set. An implication that follows is that if the
likelihood is bounded, there exists a distribution, in the class of discrete dis-
tribution functionsGwithnor fewer points of support, that maximizes the
likelihood.
There are two commonly used computational approaches – direct gradient-based
optimization based on the roots of the likelihood equations, or the expectation
maximization (EM) algorithm described below (see McLachlan and Peel, 2000).
If the likelihood is bounded, then under correct specification of the model it
has a global maximum. But it may be difficult to locate the global maximum
if the component distributions are not well separated, or the convergence may
be sensitive to the starting values. One way to guard against such a possibil-
ity is to check for robustness of convergence from different starting values. In
some cases, such as the mixture of normals, the likelihood is unbounded and
no global maximizer exists, so convergence will be to a local maximum. In prac-
tice, especially when the sample is small, the presence of local maxima cannot be
ruled out.
In practiceCis unknown. For a given sample sizen, the standard way of selecting
Cis to treat this as a model selection problem and to use information criteria such
as Akaike information criterion (AIC) or Bayesian information criterion (BIC) (see
Deb and Trivedi, 2002, for a detailed application).
15.5.2.1 EM algorithm for model estimation
IfCis given, the problem is to maximize the log-likelihoodL(π,|C,y). Let
di = (di 1 ,...,diC)$define an indicator (dummy) variable such thatdij = 1,
∑
jdij=1, indicating thatyiwas drawn from thejth (latent) group or class for
i=1,...,n. That is, each observation may be regarded as a draw from one of the
Clatent classes or “types,” each with its own distribution. The FM model speci-
fies that(yi|di,θ,π)are independently distributed with densities
∏C
j= 1 f(yi|θj)
dij,
and(dij|θ,π)are independent and identically distributed (i.i.d.) with multi-
nomial distribution
∏C
j= 1 π
dij
j ,0<πj <1,
∑C
j= 1 πj=1. Hence the likelihood