Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
and cluster labels are gleaned from the labeled data. The EM procedure guar-
antees to find model parameters that have equal or greater likelihood at each
iteration. The key question, which can only be answered empirically, is whether
these higher likelihood parameter estimates will improve classification accuracy.
Intuitively, this might work well. Consider document classification. Certain
phrases are indicative of the classes. Some occur in labeled documents, whereas
others only occur in unlabeled ones. But there are probably some documents
that contain both, and the EM procedure uses these to generalize the learned
model to utilize phrases that do not appear in the labeled dataset. For example,
both supervisorand PhD topicmight indicate a graduate student’s home page.
Suppose that only the former phrase occurs in the labeled documents. EM iter-
atively generalizes the model to correctly classify documents that contain just
the latter.
This might work with any classifier and any iterative clustering algorithm.
But it is basically a bootstrapping procedure, and you must take care to ensure
that the feedback loop is a positive one. Using probabilities rather than hard
decisions seems beneficial because it allows the procedure to converge slowly
instead of jumping to conclusions that may be wrong. Naïve Bayes and the prob-
abilistic EM procedure described in Section 6.6 are particularly apt choices
because they share the same fundamental assumption: independence between
attributes—or, more precisely, conditional independence between attributes
given the class.
Of course, the independence assumption is universally violated. Even our
little example used the two-word phrase PhD topic,whereas actual implemen-
tations would likely use individual words as attributes—and the example would
have been far less compelling if we had substituted either of the single terms
PhDor topic.The phrase PhD studentsis probably more indicative of faculty
than graduate student home pages; the phrase research topicis probably less dis-
criminating. It is the very fact that PhDand topicare notconditionally inde-
pendent given the class that makes the example work: it is their combination
that characterizes graduate student pages.
Nevertheless, coupling Naïve Bayes and EM in this manner works well in the
domain of document classification. In a particular classification task it attained
the performance of a traditional learner using fewer than one-third of the
labeled training instances, as well as five times as many unlabeled ones. This is
a good tradeoff when labeled instances are expensive but unlabeled ones are vir-
tually free. With a small number of labeled documents, classification accuracy
can be improved dramatically by incorporating many unlabeled ones.
Two refinements to the procedure have been shown to improve performance.
The first is motivated by experimental evidence that when there are many
labeled documents the incorporation of unlabeled data may reduce rather than
increase accuracy. Hand-labeled data is (or should be) inherently less noisy than

338 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf