Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
remarkably unperturbed. This line has a simple and natural interpretation. Geo-
metrically, it corresponds to finding the narrowest strip covering half of the
observations, where the thickness of the strip is measured in the vertical direc-
tion—this strip is marked gray in Figure 7.6; you need to look closely to see it.
The least median of squares line lies at the exact center of this band. Note that
this notion is often easier to explain and visualize than the normal least-squares
definition of regression. Unfortunately, there is a serious disadvantage to
median-based regression techniques: they incur a high computational cost,
which often makes them infeasible for practical problems.

Detecting anomalies


A serious problem with any form of automatic detection of apparently incor-
rect data is that the baby may be thrown out with the bathwater. Short of con-
sulting a human expert, there is really no way of telling whether a particular
instance really is an error or whether it just does not fit the type of model that
is being applied. In statistical regression, visualizations help. It will usually be
visually apparent, even to the nonexpert, if the wrong kind of curve is being
fitted—a straight line is being fitted to data that lies on a parabola, for example.
The outliers in Figure 7.6 certainly stand out to the eye. But most problems
cannot be so easily visualized: the notion of “model type” is more subtle than a
regression line. And although it is known that good results are obtained on most
standard datasets by discarding instances that do not fit a decision tree model,
this is not necessarily of great comfort when dealing with a particular new

314 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT


-5

0

5

10

15

20

25

1950 1955 1960 1965 1970 1975

least squares

least median
of squares

year

phone cal

ls (tens of

millions)

Figure 7.6Number of international phone calls from Belgium, 1950–1973.
Free download pdf