Model Selection 297
The real danger in data snooping is the possibility that by trial and
error, one hits upon a model that casually performs well both in-sample and
out-of sample but that will perform poorly in real-world forecasts. In the
next chapter, we explore at length different ways in which data snooping
and other biases might enter the model discovery process and we propose a
methodology to minimize the risk of biases, as will be explained in the last
section of this chapter.
Survivorship Biases and Other Sample Defects
Let us now see how samples might be subject to biases that reduce our ability
to correctly estimate model parameters. In addition to errors and missing data,
a well-known type of bias in financial econometrics is survivorship bias, a bias
exhibited by samples selected on the basis of criteria valid at the last date in the
sample time series. In the presence of survivorship biases in our data, return
processes relative to firms that ceased to exist prior to that date are ignored.
For example, in the study of the performance of mutual funds, poorly per-
forming mutual funds often close down (and therefore drop out of the sample)
while better performing mutual funds continue to exist (and therefore remain
in the sample). In this situation, estimating past returns from the full sample
would result in overestimation due to survivorship bias. As another example,
suppose a sample contains 10 years of price data for all stocks that are in the
S&P 500 today and that existed for the last 10 years. This sample, apparently
well formed, is, however, biased. The selection, in fact, is made on the stocks of
companies that are in the S&P 500 today, that is, those companies that have
“survived” in sufficiently good shape to still be in the S&P 500 aggregate.
Survivorship bias arises from the fact that many of the surviving entities
(mutual funds or individual stocks) successfully passed through some dif-
ficult period. Surviving the difficulty is a form of reversion to the mean. An
asset manager may indeed produce trading profits buying cheap when the
company is facing difficulty and exploiting the subsequent recovery. At the
end of the period, we know what firms recovered.
Survivorship bias is a consequence of selecting time series, asset price
time series in particular, based on criteria that apply at the end of the period.
Avoiding the survivorship bias seems simple in principle. It might seem suf-
ficient to base any sample selection at the moment where the forecast begins,
so that no invalid information enters the strategy prior to trading. However,
the fact that companies are founded, merged, and closed plays havoc with
simple models. In fact, calibrating a simple model requires data for assets
that exist over the entire training period. This in itself introduces a poten-
tially substantial training bias.