Understanding Machine Learning: From Theory to Algorithms

16.3 Implementing Soft-SVM with Kernels 223

directly tackles the Soft-SVM optimization problem in the feature space,

minw

(

λ 2

‖w‖^2 +^1 m

∑m

i=1

max{ 0 , 1 −y〈w,ψ(xi)〉}

)

, (16.5)

while only using kernel evaluations. The basic observation is that the vectorw(t)
maintained by the SGD procedure we have described in Section 15.5 is always in
the linear span of{ψ(x 1 ),...,ψ(xm)}. Therefore, rather than maintainingw(t)
we can maintain the corresponding coefficientsα.
Formally, letK be the kernel function, namely, for allx,x′,K(x,x′) =
〈ψ(x),ψ(x′)〉. We shall maintain two vectors inRm, corresponding to two vectors
θ(t)andw(t)defined in the SGD procedure of Section 15.5. That is,β(t)will be
a vector such that

θ(t)=

∑m

j=1

β(jt)ψ(xj) (16.6)

andα(t)be such that

w(t)=

∑m

j=1

α(jt)ψ(xj). (16.7)

The vectorsβandαare updated according to the following procedure.

SGD for Solving Soft-SVM with Kernels

Goal:Solve Equation (16.5) parameter:T Initialize:β(1)= 0 fort= 1,...,T Letα(t)=λt^1 β(t) Chooseiuniformly at random from [m] For allj 6 =isetβ(jt+1)=βj(t) If (yi

∑m j=1α

(t) j K(xj,xi)<1) Setβi(t+1)=β(it)+yi Else Setβi(t+1)=β(it) Output:w ̄=

∑m j=1α ̄jψ(xj) whereα ̄=

1 T

∑T

t=1α(t)

The following lemma shows that the preceding implementation is equivalent
to running the SGD procedure described in Section 15.5 on the feature space.

lemma 16.3 Letwˆ be the output of the SGD procedure described in Sec-
tion 15.5, when applied on the feature space, and letw ̄ =

∑m
j=1α ̄jψ(xj)be
the output of applying SGD with kernels. Thenw ̄=wˆ.

Proof We will show that for everytEquation (16.6) holds, whereθ(t)is the
result of running the SGD procedure described in Section 15.5 in the feature

Understanding Machine Learning: From Theory to Algorithms

(

)

, (16.5)

∑T

Get our desktop app

Company

Features

Documentation

Resources