6.1. Dual Representations 2936.1 Dual Representations
Many linear models for regression and classification can be reformulated in terms of
a dual representation in which the kernel function arises naturally. This concept will
play an important role when we consider support vector machines in the next chapter.
Here we consider a linear regression model whose parameters are determined by
minimizing a regularized sum-of-squares error function given byJ(w)=1
2
∑Nn=1{
wTφ(xn)−tn} 2
+λ
2wTw (6.2)whereλ 0. If we set the gradient ofJ(w)with respect towequal to zero, we see
that the solution forwtakes the form of a linear combination of the vectorsφ(xn),
with coefficients that are functions ofw, of the formw=−1
λ∑Nn=1{
wTφ(xn)−tn}
φ(xn)=∑Nn=1anφ(xn)=ΦTa (6.3)whereΦis the design matrix, whosenthrow is given byφ(xn)T. Here the vector
a=(a 1 ,...,aN)T, and we have definedan=−1
λ{
wTφ(xn)−tn}. (6.4)
Instead of working with the parameter vectorw, we can now reformulate the least-
squares algorithm in terms of the parameter vectora, giving rise to adual represen-
tation. If we substitutew=ΦTaintoJ(w), we obtainJ(a)=1
2
aTΦΦTΦΦTa−aTΦΦTt+1
2
tTt+λ
2aTΦΦTa (6.5)wheret=(t 1 ,...,tN)T. We now define theGrammatrixK=ΦΦT, which is an
N×Nsymmetric matrix with elementsKnm=φ(xn)Tφ(xm)=k(xn,xm) (6.6)where we have introduced thekernel functionk(x,x′)defined by (6.1). In terms of
the Gram matrix, the sum-of-squares error function can be written asJ(a)=1
2
aTKKa−aTKt+1
2
tTt+λ
2aTKa. (6.7)Setting the gradient ofJ(a)with respect toato zero, we obtain the following solu-
tion
a=(K+λIN)
− 1
t. (6.8)