calibration spectra and data, which are
based partially on random correlations and
not true chemical/spectral relationships, or
are very specific to the calibration sample.
This can occur because there are so many
wavelengths available. In theory, if one has
700 wavelengths and only 100 samples, by
using enough wavelengths (100 unknowns
require 100 knowns) one gets a perfect fit.
One solution has been to limit the number
of wavelengths selected (i.e. one for every
ten samples in the calibration to a maximum
of ten or so). The problem then is that there
is a lot of potentially useful spectral
information available which is not being
used. The solution used to address both
problems simultaneously has been to use
what are generally known as ‘whole
spectrum’-based procedures.
As a result, most of the chemometric
efforts over the last 10 years or so have
revolved around procedures such as factor
analysis, principal components (PCR) and
partial least squares (PLS) regression
(Sharaf et al., 1986). PCR and PLS, in
particular, have enjoyed great success. In
these two procedures, the entire spectrum
is used. The spectra are decomposed into a
series of factors which represent the
variance in the spectra. In such a manner,
the information in the spectra is com-
pressed into a reduced series of factors
which can then be used in a regression
process to determine the analyte of
interest. Other procedures used include
Neural Networks (McClure et al., 1992),
genetic algorithms (Goldberg, 1989, used to
reduce the number of wavelengths to be
considered) and just about any method
ever devised to extract predictive informa-
tion from data. At present, efforts based on
PLS, PCR and Neural Networks seem to be
the most popular, with genetic algorithms
used to select wavelengths. Considerable
theoretical work has been carried out using
factor analysis, but it has not found much
use because one needs to know what the
spectra of the factors (components in the
samples) are before starting and, in
complex samples, such as feedstuffs, that is
virtually impossible.
Calibration validation and testing
Regardless of what method is used to
develop the calibration, there are many
steps which need to be performed in deter-
mining the final calibration. For example,
how many wavelengths in an MLR are
needed. If five are enough, then using more
just results in over-fitting. The same
applies to PLS; the number of factors
possible is one less than the number of
samples, but rarely are more than a dozen
or so needed, even for large data sets. How
does one decide how many to use? For
each procedure, PLS or MLR, and even in
many cases for each software package, a
number of statistical tests are available,
which in essence determine when the
increase in accuracy obtained by adding an
additional factor reaches the point of
diminishing returns.
One simple method used with MLR is
to divide the calibration set into two sets of
data, a calibration and validation set. The
equations are then developed using the
calibration samples and the validation
samples are predicted. In developing
calibrations, it is not uncommon to try
various data pre-treatments (i.e. deriva-
tives, scatter corrections, etc., which are
used to help extract the information from
the spectra); the result is that one often has
many different calibrations to examine and
choose from. Since the validation set is
involved in the development process, it
becomes likely that one will find a set of
terms which also randomly does well on
the validation set, but not on future
samples. By placing restrictions on the
criteria for selecting how many terms one
can use, experience has shown that one
improves the likelihood that the final
equation selected will be valid for future
samples. The final test comes when one
applies the selected equation to a new set
of samples (test set).
Very similar procedures are used for
Neural Net calibrations. For PLS and PCR a
slight variation, called one-out cross-
validation, is used. In this procedure, each
sample is removed from the data set and a
calibration developed using the remaining
samples. This is repeated Ntimes (each
196 J.B. Reeves III