given network. It turns out that this can be implemented very efficiently by
changing the method for calculating the conditional probability tables so that
the resulting probability estimates implicitly contain information from all sub-
networks. The details of this approach are rather complex and will not be
described here.
The task of searching for a good network structure can be greatly simplified
if the right metric is used for scoring. Recall that the probability of a single
instance based on a network is the product of all the individual probabilities
from the various conditional probability tables. The overall probability of the
dataset is the product of these products for all instances. Because terms in a
product are interchangeable, the product can be rewritten to group together all
factors relating to the same table. The same holds for the log-likelihood, using
sums instead of products. This means that the likelihood can be optimized sep-
arately for each node of the network. This can be done by adding, or removing,
edges from other nodes to the node that is being optimized—the only constraint
is that cycles must not be introduced. The same trick also works if a local scoring
metric such as AIC or MDL is used instead of plain log-likelihood because the
penalty term splits into several components, one for each node, and each node
can be optimized independently.
Specific algorithms
Now we move on to actual algorithms for learning Bayesian networks. One
simple and very fast learning algorithm, called K2,starts with a given ordering
of the attributes (i.e., nodes). Then it processes each node in turn and greedily
considers adding edges from previously processed nodes to the current one. In
each step it adds the edge that maximizes the network’s score. When there is no
further improvement, attention turns to the next node. As an additional mech-
anism for overfitting avoidance, the number of parents for each node can be
restricted to a predefined maximum. Because only edges from previously pro-
cessed nodes are considered and there is a fixed ordering, this procedure cannot
introduce cycles. However, the result depends on the initial ordering, so it makes
sense to run the algorithm several times with different random orderings.
The Naïve Bayes classifier is a network with an edge leading from the class
attribute to each of the other attributes. When building networks for classifica-
tion, it sometimes helps to use this network as a starting point for the search.
This can be done in K2 by forcing the class variable to be the first one in the
ordering and initializing the set of edges appropriately.
Another potentially helpful trick is to ensure that every attribute in the data
is in the Markov blanketof the node that represents the class attribute. A node’s
Markov blanket includes all its parents, children, and children’s parents. It can
be shown that a node is conditionally independent of all other nodes given
278 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES