212 Support Vector Machines
problem with respect towis unconstrained and the objective is differentiable;
thus, at the optimum, the gradient equals zero:
w−
∑m
i=1
αiyixi = 0 ⇒ w=
∑m
i=1
αiyixi.
This shows us that the solution must be in the linear span of the examples, a
fact we will use later to derive SVM with kernels. Plugging the preceding into
Equation (15.9) we obtain that the dual problem can be rewritten as
α∈Rmaxm:α≥ 0
^1
2
∥∥
∥∥
∥
∑m
i=1
αiyixi
∥∥
∥∥
∥
2
+
∑m
i=1
αi
1 −yi
〈
∑
j
αjyjxj,xi
〉
. (15.10)
Rearranging yields the dual problem
α∈Rmaxm:α≥ 0
∑m
i=1
αi−
1
2
∑m
i=1
∑m
j=1
αiαjyiyj〈xj,xi〉
. (15.11)
Note that the dual problem only involves inner products between instances and
does not require direct access to specific elements within an instance. This prop-
erty is important when implementing SVM with kernels, as we will discuss in
the next chapter.
15.5 Implementing Soft-SVM Using SGD
In this section we describe a very simple algorithm for solving the optimization
problem of Soft-SVM, namely,
minw
(
λ
2
‖w‖^2 +
1
m
∑m
i=1
max{ 0 , 1 −y〈w,xi〉}
)
. (15.12)
We rely on the SGD framework for solving regularized loss minimization prob-
lems, as described in Section 14.5.3.
Recall that, on the basis of Equation (14.15), we can rewrite the update rule
of SGD as
w(t+1)=−^1
λt
∑t
j=1
vj,
wherevjis a subgradient of the loss function atw(j)on the random example
chosen at iterationj. For the hinge loss, given an example (x,y), we can choosevj
to be 0 ify〈w(j),x〉≥1 andvj=−yxotherwise (see Example 14.2). Denoting
θ(t)=−
∑
j<tvjwe obtain the following procedure.