Understanding Machine Learning: From Theory to Algorithms

172 Regularization and Stability

tion, and the algorithm balances between low empirical risk and “simpler,” or “less complex,” hypotheses. There are many possible regularization functions one can use, reflecting some prior belief about the problem (similarly to the description language in Minimum Description Length). Throughout this section we will focus on one of the most simple regularization functions:R(w) =λ‖w‖^2 , whereλ >0 is a scalar and the norm is the` 2 norm,‖w‖=

√∑

d i=1w^2 i. This yields the learning rule: A(S) = argmin w

(

LS(w) +λ‖w‖^2

)

. (13.2)

This type of regularization function is often called Tikhonov regularization. As mentioned before, one interpretation of Equation (13.2) is using structural risk minimization, where the norm ofwis a measure of its “complexity.” Recall that in the previous chapter we introduced the notion of bounded hypothesis classes. Therefore, we can define a sequence of hypothesis classes,H 1 ⊂ H 2 ⊂ H 3 ..., whereHi={w:‖w‖ 2 ≤i}. If the sample complexity of eachHidepends onithen the RLM rule is similar to the SRM rule for this sequence of nested classes. A different interpretation of regularization is as a stabilizer. In the next section we define the notion of stability and prove that stable learning rules do not overfit. But first, let us demonstrate the RLM rule for linear regression with the squared loss.

13.1.1 Ridge Regression

Applying the RLM rule with Tikhonov regularization to linear regression with the squared loss, we obtain the following learning rule:

argmin w∈Rd

(

λ‖w‖^22 +

1

m

∑m

i=1

1

2

(〈w,xi〉−yi)^2

)

. (13.3)

Performing linear regression using Equation (13.3) is calledridge regression. To solve Equation (13.3) we compare the gradient of the objective to zero and obtain the set of linear equations

(2λmI+A)w=b,

whereIis the identity matrix andA,bare as defined in Equation (9.6), namely,

A=

(m ∑

i=1

xix>i

)

and b=

∑m

i=1

yixi. (13.4)

SinceAis a positive semidefinite matrix, the matrix 2λmI+Ahas all its eigen- values bounded below by 2λm. Hence, this matrix is invertible and the solution to ridge regression becomes

w= (2λmI+A)−^1 b. (13.5)

Understanding Machine Learning: From Theory to Algorithms

√∑

(

)

. (13.2)

13.1.1 Ridge Regression

(

1

1

2

)

. (13.3)

A=

)

Get our desktop app

Company

Features

Documentation

Resources