584 Panel Data Methods
self-reported consumption. Cotinine has another advantage in studies of smoking
and health as it provides a way of measuring passive smoking, especially among
children.
12.3.3 Modeling costs and expenditure
Individual-level data on medical expenditures and costs of treatment is typically
distinguished by a spike at zero, if there are non-users in the data, and a strongly
skewed distribution with heavy tails. This kind of data is most often used in two
areas of application: risk adjustment and cost-effectiveness analysis. In risk adjust-
ment the emphasis is on predicting the treatment costs for particular types of
patient, often with very large datasets. Cost-effectiveness analyses tend to work
with smaller datasets and the scope for parametric modeling may be more limited
(Briggset al.,2005). In the context of clinical trials, attention has focused on meth-
ods to deal with censoring of cost data due to limited follow-up (e.g., Baseret al.,
2006; Raikou and McGuire, 2004, 2006).
The presence of a substantial proportion of zeros in the data has typically been
handled by using a two-part model, which distinguishes between a binary indi-
cator, used to model the probability of any costs, and a conditional regression
model for the positive costs. OLS applied to the level of costs (y)can perform
poorly, due to the high degree of skewness and excess kurtosis, and the positive
observations are often transformed prior to estimation. The most common trans-
formation is the logarithm ofy, although the square root is sometimes used as
well. As the policy interest typically focuses on modeling costs on the original
scale, the regression results have to be retransformed back to that scale. This weak-
ens the case for working with transformed data and, in particular, problems arise
with the retransformation if there is heteroskedasticity in the data on the trans-
formed scale (Manning, 1998; Manning and Mullahy, 2001; Mullahy, 1998). Ai and
Norton (2000) provide standard errors for the retransformed estimates when there
is heteroskedasicity.
More recently, attention has shifted to other estimators. Basuet al.(2004) com-
pare log-transformed models to the Cox proportional hazard model. Gilleskie and
Mroz (2004) propose a flexible approach that divides the data into discrete intervals
then applies discrete hazard models, implemented as sequential logits. Conway and
Deb (2005) use a finite mixture model. Cooperet al.(2007) use hierarchical regres-
sions implemented using Bayesian Markov chain Monte Carlo (MCMC) estimation.
But the dominant approach in the recent literature has been the use of generalized
linear models (GLMs) (e.g., Buntin and Zaslavsky, 2004; Manning, 2006; Manning
et al.,2005; Manning and Mullahy, 2001). The GLM specifies a link function for
the relationship between the conditional mean,μ=E(y|x), and a linear function
of the covariates and specifies the form of the conditional variance,V(y|x),usually
assuming that it can be specified as a simple function of the mean. The models
are estimated using a quasi-likelihood approach derived from the quasi-score or
“estimating equations.” The most popular specification of the GLM for costs has
been the log-link with a gamma error (Bloughet al.1999; Manninget al.,2005;
Manning and Mullahy, 2001, 2005). Cantoni and Ronchetti (2006) propose a