Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Does this mean that we have to accept our fate and live with these shortcom- ings? No! There is a statistically based alternative: a theoretically well-founded way of representing probability distributions concisely and comprehensibly in a graphical manner. The structures are called Bayesian networks.They are drawn as a network of nodes, one for each attribute, connected by directed edges in such a way that there are no cycles—a directed acyclic graph. In our explanation of how to interpret Bayesian networks and how to learn them from data, we will make some simplifying assumptions. We assume that all attributes are nominal and that there are no missing values. Some advanced learning algorithms can create new attributes in addition to the ones present in the data—so-called hidden attributes whose values cannot be observed. These can support better models if they represent salient features of the underlying problem, and Bayesian networks provide a good way of using them at prediction time. However, they make both learning and prediction far more complex and time consuming, so we will not consider them here.

Making predictions

Figure 6.20 shows a simple Bayesian network for the weather data. It has a node for each of the four attributes outlook, temperature, humidity, and windyand one for the class attribute play. An edge leads from the playnode to each of the other nodes. But in Bayesian networks the structure of the graph is only half the story. Figure 6.20 shows a table inside each node. The information in the tables defines a probability distribution that is used to predict the class probabilities for any given instance. Before looking at how to compute this probability distribution, consider the information in the tables. The lower four tables (for outlook, temperature, humidity,and windy) have two parts separated by a vertical line. On the left are the values ofplay,and on the right are the corresponding probabilities for each value of the attribute represented by the node. In general, the left side contains a column for every edge pointing to the node, in this case just the playattrib- ute. That is why the table associated with playitself does not have a left side: it has no parents. In general, each row of probabilities corresponds to one combination of values of the parent attributes, and the entries in the row show the probability of each value of the node’s attribute given this combination. In effect, each row defines a probability distribution over the values of the node’s attribute. The entries in a row always sum to 1. Figure 6.21 shows a more complex network for the same problem, where three nodes (windy, temperature,and humidity) have two parents. Again, there is one column on the left for each parent and as many columns on the right as the attribute has values. Consider the first row of the table associated with the temperaturenode. The left side gives a value for each parent attribute,playand

272 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Making predictions

Get our desktop app

Company

Features

Documentation

Resources