Does this mean that we have to accept our fate and live with these shortcom-
ings? No! There is a statistically based alternative: a theoretically well-founded
way of representing probability distributions concisely and comprehensibly in
a graphical manner. The structures are called Bayesian networks.They are drawn
as a network of nodes, one for each attribute, connected by directed edges in
such a way that there are no cycles—a directed acyclic graph.
In our explanation of how to interpret Bayesian networks and how to learn
them from data, we will make some simplifying assumptions. We assume that
all attributes are nominal and that there are no missing values. Some advanced
learning algorithms can create new attributes in addition to the ones present in
the data—so-called hidden attributes whose values cannot be observed. These
can support better models if they represent salient features of the underlying
problem, and Bayesian networks provide a good way of using them at predic-
tion time. However, they make both learning and prediction far more complex
and time consuming, so we will not consider them here.
Making predictions
Figure 6.20 shows a simple Bayesian network for the weather data. It has a node
for each of the four attributes outlook, temperature, humidity, and windyand
one for the class attribute play. An edge leads from the playnode to each of the
other nodes. But in Bayesian networks the structure of the graph is only half
the story. Figure 6.20 shows a table inside each node. The information in the
tables defines a probability distribution that is used to predict the class proba-
bilities for any given instance.
Before looking at how to compute this probability distribution, consider the
information in the tables. The lower four tables (for outlook, temperature,
humidity,and windy) have two parts separated by a vertical line. On the left are
the values ofplay,and on the right are the corresponding probabilities for each
value of the attribute represented by the node. In general, the left side contains
a column for every edge pointing to the node, in this case just the playattrib-
ute. That is why the table associated with playitself does not have a left side: it
has no parents. In general, each row of probabilities corresponds to one com-
bination of values of the parent attributes, and the entries in the row show the
probability of each value of the node’s attribute given this combination. In
effect, each row defines a probability distribution over the values of the node’s
attribute. The entries in a row always sum to 1.
Figure 6.21 shows a more complex network for the same problem, where
three nodes (windy, temperature,and humidity) have two parents. Again, there
is one column on the left for each parent and as many columns on the right as
the attribute has values. Consider the first row of the table associated with the
temperaturenode. The left side gives a value for each parent attribute,playand
272 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES