Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
18.2 Decision Tree Algorithms 253

and therefore all splitting rules are of the form (^1) [xi=1]for some featurei∈[d].
We discuss the case of real valued features in Section 18.2.3.
The algorithm works by recursive calls, with the initial call beingID3(S,[d]),
and returns a decision tree. In the pseudocode that follows, we use a call to a
procedureGain(S,i), which receives a training setSand an indexiand evaluates
the gain of a split of the tree according to theith feature. We describe several
gain measures in Section 18.2.1.
ID3(S,A)
Input:training setS, feature subsetA⊆[d]
ifall examples inSare labeled by 1, return a leaf 1
ifall examples inSare labeled by 0, return a leaf 0
ifA=∅, return a leaf whose value = majority of labels inS
else:
Letj= argmaxi∈AGain(S,i)
ifall examples inShave the same label
Return a leaf whose value = majority of labels inS
else
LetT 1 be the tree returned byID3({(x,y)∈S:xj= 1},A{j}).
LetT 2 be the tree returned byID3({(x,y)∈S:xj= 0},A{j}).
Return the tree:
xj= 1?


T 2 T 1

18.2.1 Implementations of the Gain Measure


Different algorithms use different implementations ofGain(S,i). Here we present
three. We use the notationPS[F] to denote the probability that an event holds
with respect to the uniform distribution overS.
Train Error:The simplest definition of gain is the decrease in training error.
Formally, letC(a) = min{a, 1 −a}. Note that the training error before splitting on
featureiisC(PS[y= 1]), since we took a majority vote among labels. Similarly,
the error after splitting on featureiis

PS[xi= 1]C(PS[y= 1|xi= 1]) +PS[xi= 0]C(PS[y= 1|xi= 0]).

Therefore, we can defineGainto be the difference between the two, namely,
Gain(S,i) :=C(P
S

[y= 1])


(

P

S

[xi= 1]C(P
S

[y= 1|xi= 1]) +P
S

[xi= 0]C(P
S

[y= 1|xi= 0])

)

.
Free download pdf