which is exactly the multiplication rule that we applied previously.
The two Bayesian networks in Figure 6.20 and Figure 6.21 are fundamentally
different. The first (Figure 6.20) makes stronger independence assumptions
because for each of its nodes the set of parents is a subset of the corresponding
set of parents in the second (Figure 6.21). In fact, Figure 6.20 is almost identi-
cal to the simple Naïve Bayes classifier of Section 4.2. (The probabilities are
slightly different but only because each count has been initialized to 0.5 to avoid
the zero-frequency problem.) The network in Figure 6.21 has more rows in the
conditional probability tables and hence more parameters; it may be a more
accurate representation of the underlying domain.
It is tempting to assume that the directed edges in a Bayesian network rep-
resent causal effects. But be careful! In our case, a particular value ofplaymay
enhance the prospects of a particular value ofoutlook,but it certainly doesn’t
cause it—it is more likely to be the other way round. Different Bayesian net-
works can be constructed for the same problem, representing exactly the same
probability distribution. This is done by altering the way in which the joint
probability distribution is factorized to exploit conditional independencies. The
network whose directed edges model causal effects is often the simplest one with
the fewest parameters. Hence, human experts who construct Bayesian networks
for a particular domain often benefit by representing causal effects by directed
edges. However, when machine learning techniques are applied to induce
models from data whose causal structure is unknown, all they can do is con-
struct a network based on the correlations that are observed in the data. Infer-
ring causality from correlation is always a dangerous business.
Learning Bayesian networks
The way to construct a learning algorithm for Bayesian networks is to define
two components: a function for evaluating a given network based on the data
and a method for searching through the space of possible networks. The quality
of a given network is measured by the probability of the data given the network.
We calculate the probability that the network accords to each instance and
multiply these probabilities together over all instances. In practice, this quickly
yields numbers too small to be represented properly (called arithmetic underflow),
so we use the sum of the logarithms of the probabilities rather than their product.
The resulting quantity is the log-likelihood of the network given the data.
Assume that the structure of the network—the set of edges—is given. It’s easy
to estimate the numbers in the conditional probability tables: just compute the
relative frequencies of the associated combinations of attribute values in the
training data. To avoid the zero-frequency problem each count is initialized with
a constant as described in Section 4.2. For example, to find the probability that
humidity =normalgiven that play =yesand temperature =cool(the last number
276 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES