Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
Bayesian multinet. To obtain a prediction for a particular class value, take the
corresponding network’s probability and multiply it by the class’s prior proba-
bility. Do this for each class and normalize the result as we did previously. In
this case we would not use the conditional likelihood to learn the network for
each class value.
All the network learning algorithms introduced previously are score based.
A different strategy, which we will not explain here, is to piece a network
together by testing individual conditional independence assertions based on
subsets of the attributes. This is known as structure learning by conditional inde-
pendence tests.

Data structures for fast learning


Learning Bayesian networks involves a lot of counting. For each network struc-
ture considered in the search, the data must be scanned afresh to obtain the
counts needed to fill out the conditional probability tables. Instead, could they
be stored in a data structure that eliminated the need for scanning the data over
and over again? An obvious way is to precompute the counts and store the
nonzero ones in a table—say, the hash table mentioned in Section 4.5. Even so,
any nontrivial dataset will have a huge number of nonzero counts.
Again, consider the weather data from Table 1.2 (page 11). There are
five attributes, two with three values and three with two values. This gives
4 ¥ 4 ¥ 3 ¥ 3 ¥ 3 =432 possible counts. Each component of the product corre-
sponds to an attribute, and its contribution to the product is one more than the
number of its values because the attribute may be missing from the count. All
these counts can be calculated by treating them as item sets, as explained in
Section 4.5, and setting the minimum coverage to one. But even without storing
counts that are zero, this simple scheme runs into memory problems very quickly.
It turns out that the counts can be stored effectively in a structure called an
all-dimensions (AD) tree,which is analogous to the kD-trees used for nearest-
neighbor search described in Section 4.7. For simplicity, we illustrate this using
a reduced version of the weather data that only has the attributes humidity,
windy,and play. Figure 6.22(a) summarizes the data. The number of possible
counts is 3 ¥ 3 ¥ 3 =27, although only 8 of them are shown. For example, the
count for play =nois 5 (count them!).
Figure 6.22(b) shows an AD tree for this data. Each node says how many
instances exhibit the attribute values that are tested along the path from the root
to that node. For example, the leftmost leaf says that there is one instance with
values humidity =normal, windy =true,and play =no,and the rightmost leaf
says that there are five instances with play =no.
It would be trivial to construct a tree that enumerates all 27 counts explic-
itly. However, that would gain nothing over a plain table and is obviously not

280 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Free download pdf