Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

7.6 USING UNLABELED DATA 337


7.6 Using unlabeled data


When introducing the machine learning process in Chapter 2 we drew a sharp
distinction between supervised and unsupervised learning—classification and
clustering. Recently researchers have begun to explore territory between the two,
sometimes called semisupervised learning,in which the goal is classification but
the input contains both unlabeled and labeled data. You can’t do classification
without labeled data, of course, because only the labels tell what the classes are.
But it is sometimes attractive to augment a small amount of labeled data with
a large pool of unlabeled data. It turns out that the unlabeled data can help you
learn the classes. How can this be?
First, why would you want it? Many situations present huge volumes of raw
data, but assigning classes is expensive because it requires human insight. Text
mining provides some classic examples. Suppose you want to classify Web pages
into predefined groups. In an academic setting you might be interested in faculty
pages, graduate student pages, course information pages, research group pages,
and department pages. You can easily download thousands, or millions, of rel-
evant pages from university Web sites. But labeling the training data is a labo-
rious manual process. Or suppose your job is to use machine learning to spot
names in text, differentiating among personal names, company names, and
place names. You can easily download megabytes, or gigabytes, of text, but
making this into training data by picking out the names and categorizing them
can only be done manually. Cataloging news articles, sorting electronic mail,
learning users’ reading interests—applications are legion. Leaving text aside,
suppose you want to learn to recognize certain famous people in television
broadcast news. You can easily record hundreds or thousands of hours of news-
casts, but again labeling is manual. In any of these scenarios it would be enor-
mously attractive to be able to leverage a large pool of unlabeled data to obtain
excellent performance from just a few labeled examples—particularly if you
were the graduate student who had to do the labeling!

Clustering for classification


How can unlabeled data be used to improve classification? Here’s a simple idea.
Use Naïve Bayes to learn classes from a small labeled dataset, and then extend
it to a large unlabeled dataset using the EM (expectation–maximization) itera-
tive clustering algorithm of Section 6.6. The procedure is this. First, train a clas-
sifier using the labeled data. Second, apply it to the unlabeled data to label it
with class probabilities (the “expectation” step). Third, train a new classifier
using the labels for all the data (the “maximization” step). Fourth, iterate until
convergence. You could think of this as iterative clustering, where starting points
Free download pdf