Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
First, reduce the difference to a zero-mean, unit-variance variable called the
t-statistic:

where s^2 dis the variance of the difference samples. Then, decide on a confidence
level—generally, 5% or 1% is used in practice. From this the confidence limit z
is determined using Table 5.2 ifkis 10; if it is not, a confidence table of the
Student’s distribution for the kvalue in question is used. A two-tailed test is
appropriate because we do not know in advance whether the mean of the x’s is
likely to be greater than that of the y’s or vice versa: thus for a 1% test we use
the value corresponding to 0.5% in Table 5.2. If the value oftaccording to the
preceding formula is greater than z,or less than -z,we reject the null hypothe-
sis that the means are the same and conclude that there really is a significant dif-
ference between the two learning methods on that domain for that dataset size.
Two observations are worth making on this procedure. The first is technical:
what if the observations were not paired? That is, what if we were unable, for
some reason, to assess the error of each learning scheme on the same datasets?
What if the number of datasets for each scheme was not even the same? These
conditions could arise if someone else had evaluated one of the methods and
published several different estimates for a particular domain and dataset size—
or perhaps just their mean and variance—and we wished to compare this with
a different learning method. Then it is necessary to use a regular, nonpaired t-
test. If the means are normally distributed, as we are assuming, the difference
between the means is also normally distributed. Instead of taking the mean of
the difference,d


  • , we use the difference of the means,x–-y–. Of course, that’s the
    same thing: the mean of the difference isthe difference of the means. But the
    variance of the difference d


  • is notthe same. If the variance of the samples x 1 ,x 2 ,
    ...,xkis s^2 xand the variance of the samples y 1 ,y 2 ,...,y 1 is s^2 y, the best esti-
    mate of the variance of the difference of the means is




It is this variance (or rather, its square root) that should be used as the denom-
inator of the t-statistic given previously. The degrees of freedom, necessary for
consulting Student’s confidence tables, should be taken conservatively to be the
minimum of the degrees of freedom of the two samples. Essentially, knowing
that the observations are paired allows the use of a better estimate for the vari-
ance, which will produce tighter confidence bounds.
The second observation concerns the assumption that there is essentially
unlimited data so that several independent datasets of the right size can be used.

sx sy
k

2 2

1

+.

t

d
d k

=
s^2

156 CHAPTER 5| CREDIBILITY: EVALUATING WHAT’S BEEN LEARNED

Free download pdf