Understanding Machine Learning: From Theory to Algorithms

364 Feature Selection and Generation

Equation (25.4) and Equation (25.5) lead to the same solution, the two problems are in some sense equivalent. The` 1 regularization often induces sparse solutions. To illustrate this, let us start with the simple optimization problem

min w∈R

(

1

2

w^2 −xw+λ|w|

)

. (25.6)

It is easy to verify (see Exercise 2) that the solution to this problem is the “soft thresholding” operator w= sign(x) [|x|−λ]+, (25.7)

where [a]+def= max{a, 0 }. That is, as long as the absolute value ofxis smaller thanλ, the optimal solution will be zero. Next, consider a one dimensional regression problem with respect to the squared loss:

argmin w∈Rm

(

1

2 m

∑m

i=1

(xiw−yi)^2 +λ|w|

)

.

We can rewrite the problem as

argmin w∈Rm

(

1

2

(

1 m

∑

i

x^2 i

)

w^2 −

(

1 m

∑m

i=1

xiyi

)

w+λ|w|

)

.

For simplicity let us assume thatm^1

∑

ix^2 i= 1, and denote〈x,y〉=

∑m i=1xiyi; then the optimal solution is

w= sign(〈x,y〉) [|〈x,y〉|/m−λ]+.

That is, the solution will be zero unless the correlation between the featurex and the labels vectoryis larger thanλ. Remark 25.4 Unlike the` 1 norm, the` 2 norm does not induce sparse solutions. Indeed, consider the problem above with an` 2 regularization, namely,

argmin w∈Rm

(

1

2 m

∑m

i=1

(xiw−yi)^2 +λw^2

)

.

Then, the optimal solution is

w= 〈x,y〉/m ‖x‖^2 /m+ 2λ

.

This solution will be nonzero even if the correlation betweenxandyis very small. In contrast, as we have shown before, when using` 1 regularization,wwill be nonzero only if the correlation betweenxandyis larger than the regularization parameterλ.

Understanding Machine Learning: From Theory to Algorithms

(

1

2

)

. (25.6)

(

1

)

.

(

1

2

(

∑

)

(

)

)

.

∑

(

1

)

.

.

Get our desktop app

Company

Features

Documentation

Resources