Understanding Machine Learning: From Theory to Algorithms

216 Kernel Methods

first define a mappingψ:R→R^2 as follows: ψ(x) = (x,x^2 ).

We use the termfeature spaceto denote the range ofψ. After applyingψthe data can be easily explained using the halfspaceh(x) = sign(〈w,ψ(x)〉 −b), wherew= (0,1) andb= 5. The basic paradigm is as follows:

Given some domain setXand a learning task, choose a mappingψ:X →F,
for somefeature spaceF, that will usually beRnfor somen(however, the
range of such a mapping can be anyHilbert space, including such spaces of
infinite dimension, as we will show later).

Given a sequence of labeled examples,S= (x 1 ,y 1 ),...,(xm,ym), create the
image sequenceSˆ= (ψ(x 1 ),y 1 ),...,(ψ(xm),ym).

Train a linear predictorhoverSˆ.

Predict the label of a test point,x, to beh(ψ(x)).
Note that, for every probability distributionDoverX ×Y, we can readily
define its image probability distributionDψ overF ×Y by setting, for every
subsetA⊆ F ×Y,Dψ(A) =D(ψ−^1 (A)).^1 It follows that for every predictorh
over the feature space,LDψ(h) =LD(h◦ψ), whereh◦ψis the composition ofh
ontoψ.
The success of this learning paradigm depends on choosing a goodψfor a
given learning task: that is, aψthat will make the image of the data distribution
(close to being) linearly separable in the feature space, thus making the resulting
algorithm a good learner for a given task. Picking such an embedding requires
prior knowledge about that task. However, often some generic mappings that
enable us to enrich the class of halfspaces and extend its expressiveness are used.
One notable example is polynomial mappings, which are a generalization of the
ψwe have seen in the previous example.
Recall that the prediction of a standard halfspace classifier on an instancex
is based on the linear mappingx7→ 〈w,x〉. We can generalize linear mappings
to a polynomial mapping,x7→p(x), wherepis a multivariate polynomial of
degreek. For simplicity, consider first the case in whichxis 1 dimensional.
In that case,p(x) =

∑k j=0wjxj, wherew∈Rk+1is the vector of coefficients of the polynomial we need to learn. We can rewritep(x) =〈w,ψ(x)〉where ψ :R →Rk+1is the mappingx7→(1, x, x^2 , x^3 ,...,xk). It follows that learning akdegree polynomial overRcan be done by learning a linear mapping in the (k+ 1) dimensional feature space. More generally, a degreekmultivariate polynomial fromRntoRcan be writ- ten as

p(x) =

∑

J∈[n]r:r≤k

wJ

∏r

i=1

xJi. (16.1)

(^1) This is defined for everyAsuch thatψ− (^1) (A) is measurable with respect toD.

Understanding Machine Learning: From Theory to Algorithms

∑

Get our desktop app

Company

Features

Documentation

Resources