Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

and select the smallest. This procedure is linear in the number of training instances: in other words, the time it takes to make a single prediction is proportional to the number of training instances. Processing an entire test set takes time proportional to the product of the number of instances in the training and test sets. Nearest neighbors can be found more efficiently by representing the training set as a tree, although it is not quite obvious how. One suitable structure is a kD-tree.This is a binary tree that divides the input space with a hyperplane and then splits each partition again, recursively. All splits are made parallel to one of the axes, either vertically or horizontally, in the two-dimensional case. The data structure is called a kD-treebecause it stores a set of points in k- dimensional space,kbeing the number of attributes. Figure 4.12(a) gives a small example with k=2, and Figure 4.12(b) shows the four training instances it represents, along with the hyperplanes that constitute the tree. Note that these hyperplanes are notdecision boundaries: decisions are made on a nearest-neighbor basis as explained later. The first split is horizon- tal (h),through the point (7,4)—this is the tree’s root. The left branch is not split further: it contains the single point (2,2), which is a leaf of the tree. The right branch is split vertically (v) at the point (6,7). Its left child is empty, and its right child contains the point (3,8). As this example illustrates, each region contains just one point—or, perhaps, no points. Sibling branches of the tree— for example, the two daughters of the root in Figure 4.12(a)—are not neces- sarily developed to the same depth. Every point in the training set corresponds to a single node, and up to half are leaf nodes.

130 CHAPTER 4| ALGORITHMS: THE BASIC METHODS

(2,2)

(7,4); h

(6,7); v

(3,8) (a) a

a

(2,2)

(7,4)

(6,7)

(3,8)

2

(b)^1 Figure 4.12A kD-tree for four training instances: (a) the tree and (b) instances and splits.

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Get our desktop app

Company

Features

Documentation

Resources