Bayes was an eighteenth-century English philosopher who set out his theory
of probability in “An essay towards solving a problem in the doctrine of
chances,” published in the Philosophical Transactions of the Royal Society of
London(Bayes 1763); the rule that bears his name has been a cornerstone
of probability theory ever since. The difficulty with the application of Bayes’s
rule in practice is the assignment of prior probabilities. Some statisticians,
dubbed Bayesians, take the rule as gospel and insist that people make serious
attempts to estimate prior probabilities accurately—although such estimates are
often subjective. Others, non-Bayesians, prefer the kind of prior-free analysis
that typically generates statistical confidence intervals, which we will meet in the
next chapter. With a particular dataset, prior probabilities are usually reason-
ably easy to estimate, which encourages a Bayesian approach to learning. The
independence assumption made by the Naïve Bayes method is a great stumbling
block, however, and some attempts are being made to apply Bayesian analysis
without assuming independence. The resulting models are called Bayesian net-
works(Heckerman et al. 1995), and we describe them in Section 6.7.
Bayesian techniques had been used in the field of pattern recognition (Duda
and Hart 1973) for 20 years before they were adopted by machine learning
researchers (e.g., see Langley et al. 1992) and made to work on datasets with
redundant attributes (Langley and Sage 1994) and numeric attributes (John and
Langley 1995). The label Naïve Bayesis unfortunate because it is hard to use
this method without feeling simpleminded. However, there is nothing naïve
about its use in appropriate circumstances. The multinomial Naïve Bayes model,
which is particularly appropriate for text classification, was investigated by
McCallum and Nigam (1998).
The classic paper on decision tree induction is by Quinlan (1986), who
describes the basic ID3 procedure developed in this chapter. A comprehensive
description of the method, including the improvements that are embodied in
C4.5, appears in a classic book by Quinlan (1993), which gives a listing of the
complete C4.5 system, written in the C programming language. PRISM was
developed by Cendrowska (1987), who also introduced the contact lens dataset.
Association rules are introduced and described in the database literature
rather than in the machine learning literature. Here the emphasis is very much
on dealing with huge amounts of data rather than on sensitive ways of testing
and evaluating algorithms on limited datasets. The algorithm introduced in this
chapter is the Apriori method developed by Agrawal and his associates (Agrawal
et al. 1993a, 1993b; Agrawal and Srikant 1994). A survey of association-rule
mining appears in an article by Chen et al. (1996).
Linear regression is described in most standard statistical texts, and a partic-
ularly comprehensive treatment can be found in a book by Lawson and Hanson
(1995). The use of linear models for classification enjoyed a great deal of pop-
ularity in the 1960s; Nilsson (1965) provides an excellent reference. He defines
4.9 FURTHER READING 141
