Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

178 Regularization and Stability


Combining the preceding with Equation (13.14) and again using the assumption
β≤λm/2 yield

`(A(S(i)),zi)−`(A(S),zi)



2 β`(A(S),zi)‖A(S(i))−A(S)‖+

β
2

‖A(S(i))−A(S)‖^2


(

4 β
λm

+^8 β

2
(λm)^2

)(√

`(A(S),zi) +


`(A(S(i)),z′)

) 2


8 β
λm

(√

`(A(S),zi) +


`(A(S(i)),z′)

) 2

≤^24 β
λm

(

`(A(S),zi) +`(A(S(i)),z′)

)

,

where in the last step we used the inequality (a+b)^2 ≤3(a^2 +b^2 ). Taking expecta-
tion with respect toS, z′, iand noting thatE[`(A(S),zi)] =E[`(A(S(i)), z′)] =
E[LS(A(S))], we conclude that:

corollary13.7 Assume that the loss function isβ-smooth and nonnegative.
Then, the RLM rule with the regularizerλ‖w‖^2 , whereλ≥^2 mβ, satisfies

E

[

`(A(S(i)),zi)−`(A(S),zi)

]


48 β
λmE[LS(A(S))].
Note that if for allzwe have`( 0 , z)≤C, for some scalarC >0, then for
everyS,

LS(A(S))≤LS(A(S)) +λ‖A(S)‖^2 ≤LS( 0 ) +λ‖ 0 ‖^2 =LS( 0 )≤C.

Hence, Corollary 13.7 also implies that

E

[

`(A(S(i)),zi)−`(A(S),zi)

]


48 β C
λm.

13.4 Controlling the Fitting-Stability Tradeoff


We can rewrite the expected risk of a learning algorithm as

E
S

[LD(A(S))] =E

S

[LS(A(S))] +E

S

[LD(A(S))−LS(A(S))]. (13.15)

The first term reflects how wellA(S) fits the training set while the second term
reflects the difference between the true and empirical risks ofA(S). As we have
shown in Theorem 13.2, the second term is equivalent to the stability ofA. Since
our goal is to minimize the risk of the algorithm, we need that the sum of both
terms will be small.
In the previous section we have bounded the stability term. We have shown
that the stability term decreases as the regularization parameter,λ, increases.
On the other hand, the empirical risk increases withλ. We therefore face a
Free download pdf