Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
Figure 1.3(b) is a more complex decision tree that represents the same
dataset. In fact, this is a more accurate representation of the actual dataset that
was used to create the tree. But it is not necessarily a more accurate representa-
tion of the underlying concept of good versus bad contracts. Look down the left
branch. It doesn’t seem to make sense intuitively that, if the working hours
exceed 36, a contract is bad if there is no health-plan contribution or a full
health-plan contribution but is good if there is a half health-plan contribution.
It is certainly reasonable that the health-plan contribution plays a role in the
decision but not if half is good and both full and none are bad. It seems likely
that this is an artifact of the particular values used to create the decision tree
rather than a genuine feature of the good versus bad distinction.
The tree in Figure 1.3(b) is more accurate on the data that was used to train
the classifier but will probably perform less well on an independent set of test
data. It is “overfitted” to the training data—it follows it too slavishly. The tree
in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of
pruning, which we will learn more about in Chapter 6.

Soybean classification: A classic machine learning success

An often-quoted early success story in the application of machine learning to
practical problems is the identification of rules for diagnosing soybean diseases.
The data is taken from questionnaires describing plant diseases. There are about

18 CHAPTER 1| WHAT’S IT ALL ABOUT?


Table 1.6 The labor negotiations data.

Attribute Type 1 2 3... 40


duration years 1 2 3 2
wage increase 1st year percentage 2% 4% 4.3% 4.5
wage increase 2nd year percentage? 5% 4.4% 4.0
wage increase 3rd year percentage????
cost of living adjustment {none, tcf, tc} none tcf? none
working hours per week hours 28 35 38 40
pension {none, ret-allw, empl-cntr} none???
standby pay percentage? 13%??
shift-work supplement percentage? 5% 4% 4
education allowance {yes, no} yes???
statutory holidays days 11 15 12 12
vacation {below-avg, avg, gen} avg gen gen avg
long-term disability assistance {yes, no} no?? yes
dental plan contribution {none, half, full} none? full full
bereavement assistance {yes, no} no?? yes
health plan contribution {none, half, full} none? full half
acceptability of contract {good, bad} bad good good good

Free download pdf