Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
In a typical case, this algorithm is far faster than examining all points to find
the nearest neighbor. The work involved in finding the initial approximate
nearest neighbor—the black point in Figure 4.13—depends on the depth of the
tree, given by the logarithm of the number of nodes, log 2 n.The amount of work
involved in backtracking to check whether this really is the nearest neighbor
depends a bit on the tree, and on how good the initial approximation is. But for
a well-constructed tree whose nodes are approximately square, rather than long
skinny rectangles, it can also be shown to be logarithmic in the number of nodes.
How do you build a good tree for a set of training examples? The problem
boils down to selecting the first training instance to split at and the direction of
the split. Once you can do that, apply the same method recursively to each child
of the initial split to construct the entire tree.
To find a good direction for the split, calculate the variance of the data points
along each axis individually, select the axis with the greatest variance, and create
a splitting hyperplane perpendicular to it. To find a good place for the hyper-
plane, locate the median value along that axis and select the corresponding
point. This makes the split perpendicular to the direction of greatest spread,
with half the points lying on either side. This produces a well-balanced tree. To
avoid long skinny regions it is best for successive splits to be along different axes,
which is likely because the dimension of greatest variance is chosen at each stage.
However, if the distribution of points is badly skewed, choosing the median
value may generate several successive splits in the same direction, yielding long,
skinny hyperrectangles. A better strategy is to calculate the mean rather than the
median and use the point closest to that. The tree will not be perfectly balanced,
but its regions will tend to be squarish because there is a greater chance that dif-
ferent directions will be chosen for successive splits.
An advantage of instance-based learning over most other machine learning
methods is that new examples can be added to the training set at any time. To
retain this advantage when using a kD-tree, we need to be able to update it incre-
mentally with new data points. To do this, determine which leaf node contains
the new point and find its hyperrectangle. If it is empty, simply place the new
point there. Otherwise split the hyperrectangle, splitting it along its longest
dimension to preserve squareness. This simple heuristic does not guarantee that
adding a series of points will preserve the tree’s balance, nor that the hyperrec-
tangles will be well shaped for nearest-neighbor search. It is a good idea to
rebuild the tree from scratch occasionally—for example, when its depth grows
to twice the best possible depth.
As we have seen,kD-trees are good data structures for finding nearest neigh-
bors efficiently. However, they are not perfect. Skewed datasets present a basic
conflict between the desire for the tree to be perfectly balanced and the desire
for regions to be squarish. More importantly, rectangles—even squares—are not
the best shape to use anyway, because of their corners. If the dashed circle in

132 CHAPTER 4| ALGORITHMS: THE BASIC METHODS

Free download pdf