25.1 Feature Selection 361
Then, we updateIt=It− 1 ∪{jt}.
We now describe a more efficient implementation of the forward greedy selec-
tion approach for linear regression which is calledOrthogonal Matching Pursuit
(OMP). The idea is to keep an orthogonal basis of the features aggregated so
far. LetVtbe a matrix whose columns form an orthonormal basis of the columns
ofXIt.
Clearly,
minw ‖XItw−y‖^2 = min
θ∈Rt
‖Vtθ−y‖^2.
We will maintain a vectorθtwhich minimizes the right-hand side of the equation.
Initially, we setI 0 =∅,V 0 =∅, andθ 1 to be the empty vector. At roundt, for
everyj, we decomposeXj=vj+ujwherevj=Vt− 1 Vt>− 1 Xjis the projection
ofXjonto the subspace spanned byVt− 1 andujis the part ofXjorthogonal to
Vt− 1 (see Appendix C). Then,
minθ,α‖Vt− 1 θ+αuj−y‖^2
= minθ,α
[
‖Vt− 1 θ−y‖^2 +α^2 ‖uj‖^2 + 2α〈uj,Vt− 1 θ−y〉
]
= minθ,α
[
‖Vt− 1 θ−y‖^2 +α^2 ‖uj‖^2 + 2α〈uj,−y〉
]
= minθ
[
‖Vt− 1 θ−y‖^2
]
+ minα
[
α^2 ‖uj‖^2 − 2 α〈uj,y〉
]
=
[
‖Vt− 1 θt− 1 −y‖^2
]
+ minα
[
α^2 ‖uj‖^2 − 2 α〈uj,y〉
]
=‖Vt− 1 θt− 1 −y‖^2 −
(〈uj,y〉)^2
‖uj‖^2
.
It follows that we should select the feature
jt= argmax
j
(〈uj,y〉)^2
‖uj‖^2
.
The rest of the update is to set
Vt=
[
Vt− 1 , ujt
‖ujt‖^2
]
, θt=
[
θt− 1 ; 〈ujt,y〉
‖ujt‖^2
]
.
The OMP procedure maintains an orthonormal basis of the selected features,
where in the preceding description, the orthonormalization property is obtained
by a procedure similar to Gram-Schmidt orthonormalization. In practice, the
Gram-Schmidt procedure is often numerically unstable. In the pseudocode that
follows we use SVD (see Section C.4) at the end of each round to obtain an
orthonormal basis in a numerically stable manner.