370 Feature Selection and Generation
in{ 0 , 1 }kthat indicates the closest centroid tox, whileφtakes as input an
indicator vector and returns the centroid representing this vector.
An important property of thek-means construction, which is key in allowing
kto be larger thand, is thatψmaps instances intosparsevectors. In fact, in
k-means only a single coordinate ofψ(x) is nonzero. An immediate extension of
thek-means construction is therefore to restrict the range ofψto be vectors with
at mostsnonzero elements, wheresis a small integer. In particular, letψandφ
be functions that depend onμ 1 ,...,μk. The functionψmaps an instance vector
xto a vectorψ(x)∈Rk, whereψ(x) should have at mostsnonzero elements.
The functionφ(v) is defined to be∑k
i=1viμi. As before, our goal is to have a
small reconstruction error, and therefore we can defineψ(x) = argmin
v‖x−φ(v)‖^2 s.t. ‖v‖ 0 ≤s,where‖v‖ 0 =|{j:vj 6 = 0}|. Note that whens= 1 and we further restrict‖v‖ 1 =
1 then we obtain thek-means encoding function; that is,ψ(x) is the indicator
vector of the centroid closest tox. For larger values ofs, the optimization problem
in the preceding definition ofψbecomes computationally difficult. Therefore, in
practice, we sometime use` 1 regularization instead of the sparsity constraint and
defineψto beψ(x) = argmin
v[
‖x−φ(v)‖^2 +λ‖v‖ 1]
,
whereλ >0 is a regularization parameter. Anyway, the dictionary learning
problem is now to find the vectorsμ 1 ,...,μksuch that the reconstruction er-
ror,∑m
i=1‖xi−φ(ψ(x))‖(^2) , is as small as possible. Even ifψis defined using
the` 1 regularization, this is still a computationally hard problem (similar to
thek-means problem). However, several heuristic search algorithms may give
reasonably good solutions. These algorithms are beyond the scope of this book.
25.4 Summary
Many machine learning algorithms take the feature representation of instances
for granted. Yet the choice of representation requires careful attention. We dis-
cussed approaches for feature selection, introducing filters, greedy selection al-
gorithms, and sparsity-inducing norms. Next we presented several examples for
feature transformations and demonstrated their usefulness. Last, we discussed
feature learning, and in particular dictionary learning. We have shown that fea-
ture selection, manipulation, and learning all depend on some prior knowledge
on the data.