Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

212 Support Vector Machines


problem with respect towis unconstrained and the objective is differentiable;
thus, at the optimum, the gradient equals zero:

w−

∑m

i=1

αiyixi = 0 ⇒ w=

∑m

i=1

αiyixi.

This shows us that the solution must be in the linear span of the examples, a
fact we will use later to derive SVM with kernels. Plugging the preceding into
Equation (15.9) we obtain that the dual problem can be rewritten as

α∈Rmaxm:α≥ 0


^1

2

∥∥

∥∥


∑m

i=1

αiyixi

∥∥

∥∥


2
+

∑m

i=1

αi


 1 −yi



j

αjyjxj,xi

〉



. (15.10)

Rearranging yields the dual problem

α∈Rmaxm:α≥ 0



∑m

i=1

αi−

1

2

∑m

i=1

∑m

j=1

αiαjyiyj〈xj,xi〉


. (15.11)

Note that the dual problem only involves inner products between instances and
does not require direct access to specific elements within an instance. This prop-
erty is important when implementing SVM with kernels, as we will discuss in
the next chapter.

15.5 Implementing Soft-SVM Using SGD


In this section we describe a very simple algorithm for solving the optimization
problem of Soft-SVM, namely,

minw

(

λ
2

‖w‖^2 +

1

m

∑m

i=1

max{ 0 , 1 −y〈w,xi〉}

)

. (15.12)

We rely on the SGD framework for solving regularized loss minimization prob-
lems, as described in Section 14.5.3.
Recall that, on the basis of Equation (14.15), we can rewrite the update rule
of SGD as

w(t+1)=−^1
λt

∑t

j=1

vj,

wherevjis a subgradient of the loss function atw(j)on the random example
chosen at iterationj. For the hinge loss, given an example (x,y), we can choosevj
to be 0 ify〈w(j),x〉≥1 andvj=−yxotherwise (see Example 14.2). Denoting
θ(t)=−


j<tvjwe obtain the following procedure.
Free download pdf