6.7 BAYESIAN NETWORKS 277
of the third row of the humiditynode’s table in Figure 6.21), observe from Table
1.2 (page 11) that there are three instances with this combination of attribute
values in the weather data, and no instances with humidity =highand the
same values for playand temperature. Initializing the counts for the two values
ofhumidityto 0.5 yields the probability (3 +0.5) / (3 + 0 +1) =0.875 for
humidity =normal.
The nodes in the network are predetermined, one for each attribute (includ-
ing the class). Learning the network structure amounts to searching through the
space of possible sets of edges, estimating the conditional probability tables for
each set, and computing the log-likelihood of the resulting network based on
the data as a measure of the network’s quality. Bayesian network learning
algorithms differ mainly in the way in which they search through the space of
network structures. Some algorithms are introduced below.
There is one caveat. If the log-likelihood is maximized based on the training
data, it will always be better to add more edges: the resulting network will simply
overfit. Various methods can be employed to combat this problem. One possi-
bility is to use cross-validation to estimate the goodness of fit. A second is to
add a penalty for the complexity of the network based on the number of param-
eters, that is, the total number of independent estimates in all the probability
tables. For each table, the number of independent probabilities is the total
number of entries minus the number of entries in the last column, which can
be determined from the other columns because all rows must sum to 1. Let K
be the number of parameters,LLthe log-likelihood, and Nthe number of
instances in the data. Two popular measures for evaluating the quality of a
network are the Akaike Information Criterion(AIC),
and the following MDL metricbased on the MDL principle:
In both cases the log-likelihood is negated, so the aim is to minimize these
scores.
A third possibility is to assign a prior distribution over network structures
and find the most likely network by combining its prior probability with the
probability accorded to the network by the data. This is the “Bayesian” approach
to network scoring. Depending on the prior distribution used, it can take
various forms. However, true Bayesians would average over all possible network
structures rather than singling out a particular network for prediction. Unfor-
tunately, this generally requires a great deal of computation. A simplified
approach is to average over all network structures that are substructures of a
MDL score=- +LLK
N
2log.AIC score=- +LL K,