Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
There is some experimental evidence, using Naïve Bayes throughout as the
learner, that this bootstrapping procedure outperforms one that employs all the
features from both perspectives to learn a single model from the labeled data.
It relies on having two different views of an instance that are redundant but not
completely correlated. Various domains have been proposed, from spotting
celebrities in televised newscasts using video and audio separately to mobile
robots with vision, sonar, and range sensors. The independence of the views
reduces the likelihood of both hypotheses agreeing on an erroneous label.

EM and co-training


On datasets with two feature sets that are truly independent, experiments have
shown that co-training gives better results than using EM as described previ-
ously. Even better performance, however, can be achieved by combining the two
into a modified version of co-training called co-EM.Co-training trains two clas-
sifiers representing different perspectives, A and B, and uses both to add new
examples to the training pool by choosing whichever unlabeled examples they
classify most positively or negatively. The new examples are few in number and
deterministically labeled. Co-EM, on the other hand, trains perspective A on the
labeled data and uses it to probabilistically label all unlabeled data. Next it trains
classifier B on both the labeled data and the unlabeled data with classifier A’s
tentative labels, and then it probabilistically relabels all the data for use by clas-
sifier A. The process iterates until the classifiers converge. This procedure seems
to perform consistently better than co-training because it does not commit to
the class labels that are generated by classifiers A and B but rather reestimates
their probabilities at each iteration.
The range of applicability of co-EM, like co-training, is still limited by the
requirement for multiple independent perspectives. But there is some experi-
mental evidence to suggest that even when there is no natural split of features
into independent perspectives, benefits can be achieved by manufacturing such
a split and using co-training—or, better yet, co-EM—on the split data. This
seems to work even when the split is made randomly; performance could surely
be improved by engineering the split so that the feature sets are maximally
independent. Why does this work? Researchers have hypothesized that these
algorithms succeed partly because the split makes them more robust to the
assumptions that their underlying classifiers make.
There is no particular reason to restrict the base classifier to Naïve Bayes.
Support vector machines probably represent the most successful technology for
text categorization today. However, for the EM iteration to work it is necessary
that the classifier labels the data probabilistically; it must also be able to use
probabilistically weighted examples for training. Support vector machines can
easily be adapted to do both. We explained how to adapt learning algorithms to

340 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT

Free download pdf