Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

216 Kernel Methods


first define a mappingψ:R→R^2 as follows:
ψ(x) = (x,x^2 ).

We use the termfeature spaceto denote the range ofψ. After applyingψthe
data can be easily explained using the halfspaceh(x) = sign(〈w,ψ(x)〉 −b),
wherew= (0,1) andb= 5.
The basic paradigm is as follows:


  1. Given some domain setXand a learning task, choose a mappingψ:X →F,
    for somefeature spaceF, that will usually beRnfor somen(however, the
    range of such a mapping can be anyHilbert space, including such spaces of
    infinite dimension, as we will show later).

  2. Given a sequence of labeled examples,S= (x 1 ,y 1 ),...,(xm,ym), create the
    image sequenceSˆ= (ψ(x 1 ),y 1 ),...,(ψ(xm),ym).

  3. Train a linear predictorhoverSˆ.

  4. Predict the label of a test point,x, to beh(ψ(x)).
    Note that, for every probability distributionDoverX ×Y, we can readily
    define its image probability distributionDψ overF ×Y by setting, for every
    subsetA⊆ F ×Y,Dψ(A) =D(ψ−^1 (A)).^1 It follows that for every predictorh
    over the feature space,LDψ(h) =LD(h◦ψ), whereh◦ψis the composition ofh
    ontoψ.
    The success of this learning paradigm depends on choosing a goodψfor a
    given learning task: that is, aψthat will make the image of the data distribution
    (close to being) linearly separable in the feature space, thus making the resulting
    algorithm a good learner for a given task. Picking such an embedding requires
    prior knowledge about that task. However, often some generic mappings that
    enable us to enrich the class of halfspaces and extend its expressiveness are used.
    One notable example is polynomial mappings, which are a generalization of the
    ψwe have seen in the previous example.
    Recall that the prediction of a standard halfspace classifier on an instancex
    is based on the linear mappingx7→ 〈w,x〉. We can generalize linear mappings
    to a polynomial mapping,x7→p(x), wherepis a multivariate polynomial of
    degreek. For simplicity, consider first the case in whichxis 1 dimensional.
    In that case,p(x) =


∑k
j=0wjxj, wherew∈Rk+1is the vector of coefficients
of the polynomial we need to learn. We can rewritep(x) =〈w,ψ(x)〉where
ψ :R →Rk+1is the mappingx7→(1, x, x^2 , x^3 ,...,xk). It follows that
learning akdegree polynomial overRcan be done by learning a linear mapping
in the (k+ 1) dimensional feature space.
More generally, a degreekmultivariate polynomial fromRntoRcan be writ-
ten as

p(x) =


J∈[n]r:r≤k

wJ

∏r

i=1

xJi. (16.1)

(^1) This is defined for everyAsuch thatψ− (^1) (A) is measurable with respect toD.

Free download pdf