Understanding Machine Learning: From Theory to Algorithms

212 Support Vector Machines

problem with respect towis unconstrained and the objective is differentiable; thus, at the optimum, the gradient equals zero:

w−

∑m

i=1

αiyixi = 0 ⇒ w=

∑m

i=1

αiyixi.

This shows us that the solution must be in the linear span of the examples, a fact we will use later to derive SVM with kernels. Plugging the preceding into Equation (15.9) we obtain that the dual problem can be rewritten as

α∈Rmaxm:α≥ 0



^1

2

∥∥

∥

∑m

i=1

αiyixi

∥∥

∥

2 +

∑m

i=1

αi



 1 −yi

〈

∑

j

αjyjxj,xi

〉





. (15.10)

Rearranging yields the dual problem

α∈Rmaxm:α≥ 0





∑m

i=1

αi−

1

2

∑m

i=1

∑m

j=1

αiαjyiyj〈xj,xi〉



. (15.11)

Note that the dual problem only involves inner products between instances and does not require direct access to specific elements within an instance. This prop- erty is important when implementing SVM with kernels, as we will discuss in the next chapter.

15.5 Implementing Soft-SVM Using SGD

In this section we describe a very simple algorithm for solving the optimization problem of Soft-SVM, namely,

minw

(

λ 2

‖w‖^2 +

1

m

∑m

i=1

max{ 0 , 1 −y〈w,xi〉}

)

. (15.12)

We rely on the SGD framework for solving regularized loss minimization prob- lems, as described in Section 14.5.3. Recall that, on the basis of Equation (14.15), we can rewrite the update rule of SGD as

w(t+1)=−^1 λt

∑t

j=1

vj,

wherevjis a subgradient of the loss function atw(j)on the random example chosen at iterationj. For the hinge loss, given an example (x,y), we can choosevj to be 0 ify〈w(j),x〉≥1 andvj=−yxotherwise (see Example 14.2). Denoting θ(t)=−

∑

j<tvjwe obtain the following procedure.

Understanding Machine Learning: From Theory to Algorithms



^1

2

∥∥

∥∥

∥

∥∥

∥∥

∥



〈

∑

〉





. (15.10)





1

2



. (15.11)

15.5 Implementing Soft-SVM Using SGD

(

1

)

. (15.12)

∑

Get our desktop app

Company

Features

Documentation

Resources