Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.4 INSTANCE-BASED LEARNING 237


the exemplar set. If its performance exceeds the upper threshold, it is used for
predicting the class of new instances. If its performance lies between the two, it
is not used for prediction but, whenever it is the closest exemplar to the new
instance (and thus would have been used for prediction if its performance
record had been good enough), its success statistics are updated as though it
had been used to classify that new instance.
To accomplish this, we use the confidence limits on the success probability
of a Bernoulli process that we derived in Section 5.2. Recall that we took a certain
number of successes Sout of a total number of trials Nas evidence on which
to base confidence limits on the true underlying success rate p. Given a certain
confidence level of, say, 5%, we can calculate upper and lower bounds and be
95% sure that plies between them.
To apply this to the problem of deciding when to accept a particular exem-
plar, suppose that it has been used ntimes to classify other instances and that s
of these have been successes. That allows us to estimate bounds, at a particular
confidence level, on the true success rate of this exemplar. Now suppose that the
exemplar’s class has occurred ctimes out of a total number Nof training
instances. This allows us to estimate bounds on the default success rate, that is,
the probability of successfully classifying an instance of this class without any
information about other instances. We insist that the lower confidence bound
on its success rate exceeds the upper confidence bound on the default success
rate. We use the same method to devise a criterion for rejecting a poorly per-
forming exemplar, requiring that the upper confidence bound on its success rate
lies below the lower confidence bound on the default success rate.
With a suitable choice of thresholds, this scheme works well. In a particular
implementation, called IB3for Instance-Based Learner version 3,a confidence
level of 5% is used to determine acceptance, whereas a level of 12.5% is used
for rejection. The lower percentage figure produces a wider confidence interval,
which makes a more stringent criterion because it is harder for the lower bound
of one interval to lie above the upper bound of the other. The criterion for
acceptance is more stringent than that for rejection, making it more difficult for
an instance to be accepted. The reason for a less stringent rejection criterion is
that there is little to be lost by dropping instances with only moderately poor
classification accuracies: they will probably be replaced by a similar instance
later. Using these thresholds the method has been found to improve the per-
formance of instance-based learning and, at the same time, dramatically reduce
the number of exemplars—particularly noisy exemplars—that are stored.


Weighting attributes


The Euclidean distance function, modified to scale all attribute values to
between 0 and 1, works well in domains in which the attributes are equally rel-

Free download pdf