Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

364 Feature Selection and Generation


Equation (25.4) and Equation (25.5) lead to the same solution, the two problems
are in some sense equivalent.
The` 1 regularization often induces sparse solutions. To illustrate this, let us
start with the simple optimization problem

min
w∈R

(

1

2

w^2 −xw+λ|w|

)

. (25.6)

It is easy to verify (see Exercise 2) that the solution to this problem is the “soft
thresholding” operator
w= sign(x) [|x|−λ]+, (25.7)

where [a]+def= max{a, 0 }. That is, as long as the absolute value ofxis smaller
thanλ, the optimal solution will be zero.
Next, consider a one dimensional regression problem with respect to the squared
loss:

argmin
w∈Rm

(

1

2 m

∑m

i=1

(xiw−yi)^2 +λ|w|

)

.

We can rewrite the problem as

argmin
w∈Rm

(

1

2

(

1
m


i

x^2 i

)

w^2 −

(

1
m

∑m

i=1

xiyi

)

w+λ|w|

)

.

For simplicity let us assume thatm^1


ix^2 i= 1, and denote〈x,y〉=

∑m
i=1xiyi;
then the optimal solution is

w= sign(〈x,y〉) [|〈x,y〉|/m−λ]+.

That is, the solution will be zero unless the correlation between the featurex
and the labels vectoryis larger thanλ.
Remark 25.4 Unlike the` 1 norm, the` 2 norm does not induce sparse solutions.
Indeed, consider the problem above with an` 2 regularization, namely,

argmin
w∈Rm

(

1

2 m

∑m

i=1

(xiw−yi)^2 +λw^2

)

.

Then, the optimal solution is

w= 〈x,y〉/m
‖x‖^2 /m+ 2λ

.

This solution will be nonzero even if the correlation betweenxandyis very small.
In contrast, as we have shown before, when using` 1 regularization,wwill be
nonzero only if the correlation betweenxandyis larger than the regularization
parameterλ.
Free download pdf