216 Kernel Methods
first define a mappingψ:R→R^2 as follows:
ψ(x) = (x,x^2 ).
We use the termfeature spaceto denote the range ofψ. After applyingψthe
data can be easily explained using the halfspaceh(x) = sign(〈w,ψ(x)〉 −b),
wherew= (0,1) andb= 5.
The basic paradigm is as follows:
- Given some domain setXand a learning task, choose a mappingψ:X →F,
for somefeature spaceF, that will usually beRnfor somen(however, the
range of such a mapping can be anyHilbert space, including such spaces of
infinite dimension, as we will show later). - Given a sequence of labeled examples,S= (x 1 ,y 1 ),...,(xm,ym), create the
image sequenceSˆ= (ψ(x 1 ),y 1 ),...,(ψ(xm),ym). - Train a linear predictorhoverSˆ.
- Predict the label of a test point,x, to beh(ψ(x)).
Note that, for every probability distributionDoverX ×Y, we can readily
define its image probability distributionDψ overF ×Y by setting, for every
subsetA⊆ F ×Y,Dψ(A) =D(ψ−^1 (A)).^1 It follows that for every predictorh
over the feature space,LDψ(h) =LD(h◦ψ), whereh◦ψis the composition ofh
ontoψ.
The success of this learning paradigm depends on choosing a goodψfor a
given learning task: that is, aψthat will make the image of the data distribution
(close to being) linearly separable in the feature space, thus making the resulting
algorithm a good learner for a given task. Picking such an embedding requires
prior knowledge about that task. However, often some generic mappings that
enable us to enrich the class of halfspaces and extend its expressiveness are used.
One notable example is polynomial mappings, which are a generalization of the
ψwe have seen in the previous example.
Recall that the prediction of a standard halfspace classifier on an instancex
is based on the linear mappingx7→ 〈w,x〉. We can generalize linear mappings
to a polynomial mapping,x7→p(x), wherepis a multivariate polynomial of
degreek. For simplicity, consider first the case in whichxis 1 dimensional.
In that case,p(x) =
∑k
j=0wjxj, wherew∈Rk+1is the vector of coefficients
of the polynomial we need to learn. We can rewritep(x) =〈w,ψ(x)〉where
ψ :R →Rk+1is the mappingx7→(1, x, x^2 , x^3 ,...,xk). It follows that
learning akdegree polynomial overRcan be done by learning a linear mapping
in the (k+ 1) dimensional feature space.
More generally, a degreekmultivariate polynomial fromRntoRcan be writ-
ten as
p(x) =
∑
J∈[n]r:r≤k
wJ
∏r
i=1
xJi. (16.1)
(^1) This is defined for everyAsuch thatψ− (^1) (A) is measurable with respect toD.