7.6 USING UNLABELED DATA 339
automatically labeled data. The solution is to introduce a weighting parameter
that reduces the contribution of the unlabeled data. This can be incorporated
into the maximization step of EM by maximizing the weighted likelihood of the
labeled and unlabeled instances. When the parameter is close to zero, unlabeled
documents have little influence on the shape of EM’s hill-climbing surface;
when close to one, the algorithm reverts to the original version in which the
surface is equally affected by both kinds of document.
The second refinement is to allow each class to have several clusters. As
explained in Section 6.6, the EM clustering algorithm assumes that the data is
generated randomly from a mixture of different probability distributions, one
per cluster. Until now, a one-to-one correspondence between mixture compo-
nents and classes has been assumed. In many circumstances this is unrealistic—
including document classification, because most documents address multiple
topics. With several clusters per class, each labeled document is initially assigned
randomly to each of its components in a probabilistic fashion. The maximiza-
tion step of the EM algorithm remains as before, but the expectation step is
modified to not only probabilistically label each example with the classes, but
to probabilistically assign it to the components within the class. The number of
clusters per class is a parameter that depends on the domain and can be set by
cross-validation.
Co-training
Another situation in which unlabeled data can improve classification perform-
ance is when there are two different and independent perspectives on the clas-
sification task. The classic example again involves documents, this time Web
documents, in which the two perspectives are the contentof a Web page and the
linksto it from other pages. These two perspectives are well known to be both
useful and different: successful Web search engines capitalize on them both,
using secret recipes. The text that labels a link to another Web page gives a
revealing clue as to what that page is about—perhaps even more revealing than
the page’s own content, particularly if the link is an independent one. Intuitively,
a link labeled my adviseris strong evidence that the target page is a faculty
member’s home page.
The idea, called co-training,is this. Given a few labeled examples, first learn
a different model for each perspective—in this case a content-based and a
hyperlink-based model. Then use each one separately to label the unlabeled
examples. For each model, select the example it most confidently labels as pos-
itive and the one it most confidently labels as negative, and add these to the pool
of labeled examples. Better yet, maintain the ratio of positive and negative exam-
ples in the labeled pool by choosing more of one kind than the other. In either
case, repeat the whole procedure, training both models on the augmented pool
of labeled examples, until the unlabeled pool is exhausted.