Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
25.1 Feature Selection 361

Then, we updateIt=It− 1 ∪{jt}.
We now describe a more efficient implementation of the forward greedy selec-
tion approach for linear regression which is calledOrthogonal Matching Pursuit
(OMP). The idea is to keep an orthogonal basis of the features aggregated so
far. LetVtbe a matrix whose columns form an orthonormal basis of the columns
ofXIt.
Clearly,


minw ‖XItw−y‖^2 = min
θ∈Rt

‖Vtθ−y‖^2.

We will maintain a vectorθtwhich minimizes the right-hand side of the equation.
Initially, we setI 0 =∅,V 0 =∅, andθ 1 to be the empty vector. At roundt, for
everyj, we decomposeXj=vj+ujwherevj=Vt− 1 Vt>− 1 Xjis the projection
ofXjonto the subspace spanned byVt− 1 andujis the part ofXjorthogonal to
Vt− 1 (see Appendix C). Then,


minθ,α‖Vt− 1 θ+αuj−y‖^2

= minθ,α

[

‖Vt− 1 θ−y‖^2 +α^2 ‖uj‖^2 + 2α〈uj,Vt− 1 θ−y〉

]

= minθ,α

[

‖Vt− 1 θ−y‖^2 +α^2 ‖uj‖^2 + 2α〈uj,−y〉

]

= minθ

[

‖Vt− 1 θ−y‖^2

]

+ minα

[

α^2 ‖uj‖^2 − 2 α〈uj,y〉

]

=

[

‖Vt− 1 θt− 1 −y‖^2

]

+ minα

[

α^2 ‖uj‖^2 − 2 α〈uj,y〉

]

=‖Vt− 1 θt− 1 −y‖^2 −

(〈uj,y〉)^2
‖uj‖^2

.

It follows that we should select the feature


jt= argmax
j

(〈uj,y〉)^2
‖uj‖^2

.

The rest of the update is to set


Vt=

[

Vt− 1 , ujt
‖ujt‖^2

]

, θt=

[

θt− 1 ; 〈ujt,y〉
‖ujt‖^2

]

.

The OMP procedure maintains an orthonormal basis of the selected features,
where in the preceding description, the orthonormalization property is obtained
by a procedure similar to Gram-Schmidt orthonormalization. In practice, the
Gram-Schmidt procedure is often numerically unstable. In the pseudocode that
follows we use SVD (see Section C.4) at the end of each round to obtain an
orthonormal basis in a numerically stable manner.

Free download pdf