Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

both the number of instances and the number of attributes. For top-down deci-
sion tree inducers, we saw in Section 6.1 (pages 196–198) that training time is
linear in the number of attributes and, if the tree is uniformly bushy, log-linear
in the number of instances (if subtree raising is not used or, if it is, with a further
log factor).
When a dataset is too large for a particular learning algorithm to be applied,
there are three ways to make learning feasible. The first is trivial: instead of
applying the scheme to the full dataset, use just a small subset for training. Of
course, information is lost when subsampling is employed. However, the loss
may be negligible because the predictive performance of a learned model often
flattens out long before all the training data is incorporated into it. If this is the
case, it can easily be verified by observing the model’s performance on a holdout
test set for training sets of different size.
This kind of behavior, called the law of diminishing returns,may arise because
the learning problem is a simple one, so that a small volume of training data is
sufficient to learn an accurate model. Alternatively, the learning algorithm might
be incapable of grasping the detailed structure of the underlying domain. This
is often observed when Naïve Bayes is employed in a complex domain: addi-
tional training data may not improve the performance of the model, whereas
a decision tree’s accuracy may continue to climb. In this case, of course, if
predictive performance is the main objective you should switch to the more
complex learning algorithm. But beware of overfitting! Take care not to assess
performance on the training data.
Parallelization is another way of reducing the time complexity of learning.
The idea is to split the problem into smaller parts, solve each using a separate
processor, and combine the results together. To do this, a parallelized version
of the learning algorithm must be created. Some algorithms lend themselves
naturally to parallelization. Nearest-neighbor methods, for example, can easily
be distributed among several processors by splitting the data into several parts
and letting each processor find the nearest neighbor in its part of the training
set. Decision tree learners can be parallelized by letting each processor build a
subtree of the complete tree. Bagging and stacking (although not boosting) are
naturally parallel algorithms. However, parallelization is only a partial remedy
because with a fixed number of processors, the algorithm’s asymptotic time
complexity cannot be improved.
A simple way to apply any algorithm to a large dataset is to split the data into
chunks of limited size and learn models separately for each one, combining the
result using voting or averaging. Either a parallel bagging-like scheme or a
sequential boosting-like scheme can be employed for this purpose. Boosting has
the advantage that new chunks can be weighted based on the classifiers learned
from previous chunks, thus transferring knowledge between chunks. In both
cases memory consumption increases linearly with dataset size; hence some


8.1 LEARNING FROM MASSIVE DATASETS 347

Free download pdf