Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

4.7 INSTANCE-BASED LEARNING 129


When comparing distances it is not necessary to perform the square root oper-
ation; the sums of squares can be compared directly. One alternative to the
Euclidean distance is the Manhattan or city-block metric, where the difference
between attribute values is not squared but just added up (after taking the
absolute value). Others are obtained by taking powers higher than the square.
Higher powers increase the influence of large differences at the expense of small
differences. Generally, the Euclidean distance represents a good compromise.
Other distance metrics may be more appropriate in special circumstances. The
key is to think of actual instances and what it means for them to be separated
by a certain distance—what would twice that distance mean, for example?
Different attributes are measured on different scales, so if the Euclidean
distance formula were used directly, the effects of some attributes might be
completely dwarfed by others that had larger scales of measurement. Conse-
quently, it is usual to normalize all attribute values to lie between 0 and 1, by
calculating


where viis the actual value of attribute i,and the maximum and minimum are
taken over all instances in the training set.
These formulae implicitly assume numeric attributes. Here, the difference
between two values is just the numerical difference between them, and it is this
difference that is squared and added to yield the distance function. For nominal
attributes that take on values that are symbolic rather than numeric, the differ-
ence between two values that are not the same is often taken to be one, whereas
if the values are the same the difference is zero. No scaling is required in this
case because only the values 0 and 1 are used.
A common policy for handling missing values is as follows. For nominal
attributes, assume that a missing feature is maximally different from any other
feature value. Thus if either or both values are missing, or if the values are dif-
ferent, the difference between them is taken as one; the difference is zero only
if they are not missing and both are the same. For numeric attributes, the dif-
ference between two missing values is also taken as one. However, if just one
value is missing, the difference is often taken as either the (normalized) size of
the other value or one minus that size, whichever is larger. This means that if
values are missing, the difference is as large as it can possibly be.


Finding nearest neighbors efficiently


Although instance-based learning is simple and effective, it is often slow. The
obvious way to find which member of the training set is closest to an unknown
test instance is to calculate the distance from every member of the training set


a

vv
i vv

ii
ii

=









min
max min
Free download pdf