form of pruning is necessary for very large datasets. This can be done by setting
aside some validation data and only adding a model from a new chunk to the
committee classifier if it increases the committee’s performance on the valida-
tion set. The validation set can also be used to identify an appropriate chunk
size by running the method with several different chunk sizes in parallel and
monitoring performance on the validation set.
The best but most challenging way to enable a learning paradigm to deal with
very large datasets would be to develop new algorithms with lower computa-
tional complexity. In some cases, it is provably impossible to derive exact algo-
rithms with lower complexity. Decision tree learners that deal with numeric
attributes fall into this category. Their asymptotic time complexity is dominated
by the sorting process for the numeric attribute values, a procedure that must
be performed at least once for any given dataset. However, stochastic algorithms
can sometimes be derived that approximate the true solution but require a much
smaller amount of time.
Background knowledge can make it possible to vastly reduce the amount of
data that needs to be processed by a learning algorithm. Depending on which
attribute is the class, most of the attributes in a huge dataset might turn out to
be irrelevant when background knowledge is taken into account. As usual, it
pays to carefully engineer the data that is passed to the learning scheme and
make the greatest possible use of any prior information about the learning
problem at hand. If insufficient background knowledge is available, the attrib-
ute filtering algorithms described in Section 7.1 can often drastically reduce the
amount of data—possibly at the expense of a minor loss in predictive per-
formance. Some of these—for example, attribute selection using decision trees
or the 1R learning scheme—are linear in the number of attributes.
Just to give you a feeling for the amount of data that can be handled by
straightforward implementations of machine learning algorithms on ordinary
microcomputers, we ran the decision tree learner J4.8 on a dataset with 600,000
instances, 54 attributes (10 numeric and 44 binary), and a class with seven
values. We used a Pentium 4 processor with a 2.8-GHz clock and a Java virtual
machine with a “just-in-time compiler.” It took 40 minutes to load the data
file, build the tree using reduced-error pruning, and classify all the training
instances. The tree had 20,000 nodes. Note that this implementation is written
in Java, and executing a Java program is often several times slower than running
a corresponding program written in C because the Java byte-code must be trans-
lated into machine code before it can be executed. (In our experience the
difference is usually a factor of three to five if the virtual machine uses a just-
in-time compiler.)
There are datasets today that truly deserve the adjective massive. Scientific
datasets from astrophysics, nuclear physics, earth science, and molecular biology
are measured in hundreds of gigabytes—or even terabytes. So are datasets
348 CHAPTER 8| MOVING ON: EXTENSIONS AND APPLICATIONS