364 Feature Selection and Generation
Equation (25.4) and Equation (25.5) lead to the same solution, the two problems
are in some sense equivalent.
The` 1 regularization often induces sparse solutions. To illustrate this, let us
start with the simple optimization problem
min
w∈R
(
1
2
w^2 −xw+λ|w|
)
. (25.6)
It is easy to verify (see Exercise 2) that the solution to this problem is the “soft
thresholding” operator
w= sign(x) [|x|−λ]+, (25.7)
where [a]+def= max{a, 0 }. That is, as long as the absolute value ofxis smaller
thanλ, the optimal solution will be zero.
Next, consider a one dimensional regression problem with respect to the squared
loss:
argmin
w∈Rm
(
1
2 m
∑m
i=1
(xiw−yi)^2 +λ|w|
)
.
We can rewrite the problem as
argmin
w∈Rm
(
1
2
(
1
m
∑
i
x^2 i
)
w^2 −
(
1
m
∑m
i=1
xiyi
)
w+λ|w|
)
.
For simplicity let us assume thatm^1
∑
ix^2 i= 1, and denote〈x,y〉=
∑m
i=1xiyi;
then the optimal solution is
w= sign(〈x,y〉) [|〈x,y〉|/m−λ]+.
That is, the solution will be zero unless the correlation between the featurex
and the labels vectoryis larger thanλ.
Remark 25.4 Unlike the` 1 norm, the` 2 norm does not induce sparse solutions.
Indeed, consider the problem above with an` 2 regularization, namely,
argmin
w∈Rm
(
1
2 m
∑m
i=1
(xiw−yi)^2 +λw^2
)
.
Then, the optimal solution is
w= 〈x,y〉/m
‖x‖^2 /m+ 2λ
.
This solution will be nonzero even if the correlation betweenxandyis very small.
In contrast, as we have shown before, when using` 1 regularization,wwill be
nonzero only if the correlation betweenxandyis larger than the regularization
parameterλ.