296 The Basics of financial economeTrics
Data Snooping
One of the most serious mistakes that a financial econometrician seeking
to formulate an investment strategy can make is to look for rare or unique
patterns that look profitable in-sample but produce losses out-of-sample.
This mistake is made easy by the availability of powerful computers that can
explore large amounts of data: any large data set contains a huge number
of patterns, many of which look very profitable. Otherwise expressed, any
large set of data, even if randomly generated, can be represented by models
that appear to produce large profits.
Given the scarcity of data and the basically uncertain nature of any
financial econometric model, it is generally necessary to calibrate models
on some data set, the so-called training set, and test them on another data
set, the test set. In other words, it is necessary to perform an out-of-sample
validation on a separate test set. The rationale for this procedure is that
any machine learning process—or even the calibration mechanism itself—is
a heuristic methodology, not a true discovery process. Models determined
through a machine learning process must be checked against the reality of
out-of-sample validation. Failure to do so is referred to as data snooping,
that is, performing training and tests on the same data set.
Out-of-sample validation is typical of machine learning methods.
Learning entails models with unbounded capabilities of approximation
constrained by somewhat artificial mechanisms such as a penalty function.
This learning mechanism is often effective but there is no guarantee that it
will produce a good model. Therefore, the learning process is considered
an example of discovery heuristics. The true validation test, say the experi-
ments, has to be performed on the test set. Needless to say, the test set must
be large and cover all possible patterns, at least in some approximate sense.
Data snooping is not always easy to understand or detect. It is a result
of a defect of training processes which must be controlled but which is very
difficult to avoid given the size of data samples currently available. Suppose
samples in the range of 10 years are available.^7 One can partition these data
and perform a single test free from data snooping biases. However, if the test
fails, one has to start all over again and design a new strategy. The process
of redesigning the modeling strategy might have to be repeated several times
over before an acceptable solution is found. Inevitably, repeating the process
on the same data includes the risk of data snooping.
(^7) Technically much longer data sets on financial markets, up to 50 years of price data,
are available. While useful for some applications, these data may be of limited use
for many financial econometric applications due to problems faced by asset manag-
ers given the changes in the structure of the economy.