364 Feature Selection and Generation
Equation (25.4) and Equation (25.5) lead to the same solution, the two problems
are in some sense equivalent.
The` 1 regularization often induces sparse solutions. To illustrate this, let us
start with the simple optimization problemmin
w∈R(
1
2
w^2 −xw+λ|w|)
. (25.6)
It is easy to verify (see Exercise 2) that the solution to this problem is the “soft
thresholding” operator
w= sign(x) [|x|−λ]+, (25.7)where [a]+def= max{a, 0 }. That is, as long as the absolute value ofxis smaller
thanλ, the optimal solution will be zero.
Next, consider a one dimensional regression problem with respect to the squared
loss:argmin
w∈Rm(
1
2 m∑mi=1(xiw−yi)^2 +λ|w|)
.
We can rewrite the problem asargmin
w∈Rm(
1
2
(
1
m∑
ix^2 i)
w^2 −(
1
m∑mi=1xiyi)
w+λ|w|)
.
For simplicity let us assume thatm^1∑
ix^2 i= 1, and denote〈x,y〉=∑m
i=1xiyi;
then the optimal solution isw= sign(〈x,y〉) [|〈x,y〉|/m−λ]+.That is, the solution will be zero unless the correlation between the featurex
and the labels vectoryis larger thanλ.
Remark 25.4 Unlike the` 1 norm, the` 2 norm does not induce sparse solutions.
Indeed, consider the problem above with an` 2 regularization, namely,argmin
w∈Rm(
1
2 m∑mi=1(xiw−yi)^2 +λw^2)
.
Then, the optimal solution isw= 〈x,y〉/m
‖x‖^2 /m+ 2λ.
This solution will be nonzero even if the correlation betweenxandyis very small.
In contrast, as we have shown before, when using` 1 regularization,wwill be
nonzero only if the correlation betweenxandyis larger than the regularization
parameterλ.