Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.4 INSTANCE-BASED LEARNING 241


third region is where the boundary meets the lower border of the larger rec-
tangle when projected upward and the left border of the smaller one when pro-
jected to the right. The boundary is linear in this region, because it is equidistant
from these two borders. The fourth is where the boundary lies to the right of
the larger rectangle but below the bottom of that rectangle. In this case the
boundary is parabolic because it is the locus of points equidistant from the lower
right corner of the larger rectangle and the left side of the smaller one. The fifth
region lies between the two rectangles: here the boundary is vertical. The pattern
is repeated in the upper right part of the diagram: first parabolic, then linear,
then parabolic (although this particular parabola is almost indistinguishable
from a straight line), and finally linear as the boundary finally escapes from the
scope of both rectangles.
This simple situation certainly defines a complex boundary! Of course, it is
not necessary to represent the boundary explicitly; it is generated implicitly by
the nearest-neighbor calculation. Nevertheless, the solution is still not a very
good one. Whereas taking the distance from the nearest instance within a hyper-
rectangle is overly dependent on the position of that particular instance, taking
the distance to the nearest point of the hyperrectangle is overly dependent on
that corner of the rectangle—the nearest example might be a long way from the
corner.
A final problem concerns measuring the distance to hyperrectangles that
overlap or are nested. This complicates the situation because an instance may
fall within more than one hyperrectangle. A suitable heuristic for use in this case
is to choose the class of the most specific hyperrectangle containing the instance,
that is, the one covering the smallest area of instance space.
Whether or not overlap or nesting is permitted, the distance function should
be modified to take account of both the observed prediction accuracy of exem-
plars and the relative importance of different features, as described in the pre-
ceding sections on pruning noisy exemplars and attribute weighting.


Generalized distance functions


There are many different ways of defining a distance function, and it is hard to
find rational grounds for any particular choice. An elegant solution is to con-
sider one instance being transformed into another through a sequence of pre-
defined elementary operations and to calculate the probability of such a
sequence occurring if operations are chosen randomly. Robustness is improved
if all possible transformation paths are considered, weighted by their probabil-
ities, and the scheme generalizes naturally to the problem of calculating the
distance between an instance and a set of other instances by considering trans-
formations to all instances in the set. Through such a technique it is possible to
consider each instance as exerting a “sphere of influence,” but a sphere with soft

Free download pdf