5.6 PREDICTING PROBABILITIES 157
In practice there is usually only a single dataset of limited size. What can be
done? We could split the data into (perhaps 10) subsets and perform a cross-
validation on each. However, the overall result will only tell us whether a learn-
ing scheme is preferable for that particular size—perhaps one-tenth of the
original dataset. Alternatively, the original dataset could be reused—for
example, with different randomizations of the dataset for each cross-validation.^2
However, the resulting cross-validation estimates will not be independent
because they are not based on independent datasets. In practice, this means that
a difference may be judged to be significant when in fact it is not. In fact, just
increasing the number of samples k,that is, the number of cross-validation runs,
will eventually yield an apparently significant difference because the value of the
t-statistic increases without bound.
Various modifications of the standard t-test have been proposed to circum-
vent this problem, all of them heuristic and lacking sound theoretical justifica-
tion. One that appears to work well in practice is the corrected resampled t-test.
Assume for the moment that the repeated holdout method is used instead of
cross-validation, repeated ktimes on different random splits of the same dataset
to obtain accuracy estimates for two learning methods. Each time,n 1 instances
are used for training and n 2 for testing, and differences diare computed from
performance on the test data. The corrected resampled t-test uses the modified
statistic
in exactly the same way as the standard t-statistic. A closer look at the formula
shows that its value cannot be increased simply by increasing k.The same mod-
ified statistic can be used with repeated cross-validation, which is just a special
case of repeated holdout in which the individual test sets for one cross-
validation do not overlap. For 10-fold cross-validation repeated 10 times,
k=100,n 2 /n 1 =0.1/0.9, and s^2 dis based on 100 differences.
5.6 Predicting probabilities
Throughout this section we have tacitly assumed that the goal is to maximize
the success rate of the predictions. The outcome for each test instance is either
correct, if the prediction agrees with the actual value for that instance, or incor-
rect,if it does not. There are no grays: everything is black or white, correct or
t
d
k
n
n d
=
ÊË + ˆ ̄
(^12)
1
s^2
(^2) The method was advocated in the first edition of this book.