Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
to contain attributes that are apparently highly predictive but nevertheless
irrelevant, and specialized statistical tests are needed to compare alternative
hypotheses. A third is that the iterative, improvement-driven development style
that characterizes data mining applications fails. It is impossible in principleto
create a fixed training-and-testing corpus for an interactive problem such as
programming by demonstration because each improvement in the agent alters
the test data by affecting how users react to it. A fourth is that existing applica-
tion programs provide limited access to application and user data: often the raw
material on which successful operation depends is inaccessible, buried deep
within the application program.
Data mining is already widely used at work. Text mining is starting to bring
the techniques in this book into our own lives, as we read our email and surf
the Web. As for the future, it will be stranger than we can imagine. The spread-
ing computing infrastructure will offer untold opportunities for learning. Data
mining will be there, behind the scenes, playing a role that will turn out to be
foundational.

8.6 Further reading

There is a substantial volume of literature that treats the topic of massive
datasets, and we can only point to a few references here. Fayyad and Smith
(1995) describe the application of data mining to voluminous data from scien-
tific experiments. Shafer et al. (1996) describe a parallel version of a top-down
decision tree inducer. A sequential decision tree algorithm for massive disk-
resident datasets has been developed by Mehta et al. (1996). The technique of
applying any algorithm to a large dataset by splitting it into smaller chunks and
bagging or boosting the result is described by Breiman (1999); Frank et al.
(2002) explain the related pruning and selection scheme.
Despite its importance, little seems to have been written about the general
problem of incorporating metadata into practical data mining. A scheme for
encoding domain knowledge into propositional rules and its use for both
deduction and induction has been investigated by Giraud-Carrier (1996). The
related area of inductive logic programming, which deals with knowledge rep-
resented by first-order logic rules, is covered by Bergadano and Gunetti (1996).
Text mining is an emerging area, and there are few comprehensive surveys of
the area as a whole: Witten (2004) provides one. A large number of feature selec-
tion and machine learning techniques have been applied to text categorization
(Sebastiani 2002). Martin (1995) describes applications of document clustering
to information retrieval. Cavnar and Trenkle (1994) show how to use n-gram
profiles to ascertain with high accuracy the language in which a document is
written. The use of support vector machines for authorship ascription is

8.6 FURTHER READING 361

Free download pdf