6.3 EXTENDING LINEAR MODELS 223
This rings a bell! A similar expression for support vector machines enabled the
use of kernels. Indeed, we can apply exactly the same trick here and use a kernel
function instead of the dot product. Writing this function as K(.. .) gives
In this way the perceptron algorithm can learn a nonlinear classifier simply by
keeping track of the instances that have been misclassified during the training
process and using this expression to form each prediction.
If a separating hyperplane exists in the high-dimensional space implicitly
created by the kernel function, this algorithm will learn one. However, it won’t
learn the maximum margin hyperplane found by a support vector machine clas-
sifier. This means that classification performance is usually worse. On the plus
side, the algorithm is easy to implement and supports incremental learning.
This classifier is called the kernel perceptron.It turns out that all sorts of algo-
rithms for learning linear models can be upgraded by applying the kernel trick
in a similar fashion. For example, logistic regression can be turned into kernel
logistic regression.The same applies to regression problems: linear regression can
also be upgraded using kernels. A drawback of these advanced methods for
linear and logistic regression (if they are done in a straightforward manner) is
that the solution is not “sparse”: every training instance contributes to the solu-
tion vector. In support vector machines and the kernel perceptron, only some
of the training instances affect the solution, and this can make a big difference
to computational efficiency.
The solution vector found by the perceptron algorithm depends greatly on
the order in which the instances are encountered. One way to make the algo-
rithm more stable is to use all the weight vectors encountered during learning,
not just the final one, letting them vote on a prediction. Each weight vector con-
tributes a certain number of votes. Intuitively, the “correctness” of a weight
vector can be measured roughly as the number of successive trials after its incep-
tion in which it correctly classified subsequent instances and thus didn’t have to
be changed. This measure can be used as the number of votes given to the weight
vector, giving an algorithm known as the voted perceptronthat performs almost
as well as a support vector machine. (Note that, as previously mentioned, the
various weight vectors in the voted perceptron don’t need to be stored explic-
itly, and the kernel trick can be applied here too.)
Multilayer perceptrons
Using a kernel is not the only way to create a nonlinear classifier based on the
perceptron. In fact, kernel functions are a recent development in machine
ÂjyjK() (aa¢(j),.)