Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

4.7 INSTANCE-BASED LEARNING 133


Figure 4.13 were any bigger, which it would be if the black instance were a little
further from the target, it would intersect the lower right-hand corner of the
rectangle at the top left and then that rectangle would have to be investigated,
too—despite the fact that the training instances that define it are a long way
from the corner in question. The corners of rectangular regions are awkward.
The solution? Use hyperspheres, not hyperrectangles. Neighboring spheres
may overlap whereas rectangles can abut, but this is not a problem because the
nearest-neighbor algorithm for kD-trees described previously does not depend
on the regions being disjoint. A data structure called a ball treedefines k-
dimensional hyperspheres (“balls”) that cover the data points, and arranges
them into a tree.
Figure 4.14(a) shows 16 training instances in two-dimensional space, over-
laid by a pattern of overlapping circles, and Figure 4.14(b) shows a tree formed
from these circles. Circles at different levels of the tree are indicated by differ-
ent styles of dash, and the smaller circles are drawn in shades of gray. Each node
of the tree represents a ball, and the node is dashed or shaded according to the
same convention so that you can identify which level the balls are at. To help
you understand the tree, numbers are placed on the nodes to show how many
data points are deemed to be inside that ball. But be careful: this is not neces-
sarily the same as the number of points falling within the spatial region that the
ball represents. The regions at each level sometimes overlap, but points that fall
into the overlap area are assigned to only one of the overlapping balls (the
diagram does not show which one). Instead of the occupancy counts in Figure
4.14(b) the nodes of actual ball trees store the center and radius of their ball;
leaf nodes record the points they contain as well.
To use a ball tree to find the nearest neighbor to a given target, start by tra-
versing the tree from the top down to locate the leaf that contains the target and
find the closest point to the target in that ball. This gives an upper bound for
the target’s distance from its nearest neighbor. Then, just as for the kD-tree,
examine the sibling node. If the distance from the target to the sibling’s center
exceeds its radius plus the current upper bound, it cannot possibly contain a
closer point; otherwise the sibling must be examined by descending the tree
further. In Figure 4.15 the target is marked with a star and the black dot is its
closest currently known neighbor. The entire contents of the gray ball can be
ruled out: it cannot contain a closer point because its center is too far away.
Proceed recursively back up the tree to its root, examining any ball that may
possibly contain a point nearer than the current upper bound.
Ball trees are built from the top down, and as with kD-trees the basic problem
is to find a good way of splitting a ball containing a set of data points into two.
In practice you do not have to continue until the leaf balls contain just
two points: you can stop earlier, once a predetermined minimum number is
reached—and the same goes for kD-trees. Here is one possible splitting method.

Free download pdf