Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
25.3 Feature Learning 369

and given a document, (p 1 ,...,pd), where eachpiis a word in the document,
we represent the document as a vectorx∈ { 0 , 1 }k, wherexiis 1 ifwi=pjfor
somej∈[d], andxi= 0 otherwise. It was empirically observed in many text
processing tasks that linear predictors are quite powerful when applied on this
representation. Intuitively, we can think of each word as a feature that measures
some aspect of the document. Given labeled examples (e.g., topics of the doc-
uments), a learning algorithm searches for a linear predictor that weights these
features so that a right combination of appearances of words is indicative of the
label.


While in text processing there is a natural meaning to words and to the dic-
tionary, in other applications we do not have such an intuitive representation
of an instance. For example, consider the computer vision application of object
recognition. Here, the instance is an image and the goal is to recognize which
object appears in the image. Applying a linear predictor on the pixel-based rep-
resentation of the image does not yield a good classifier. What we would like
to have is a mappingψthat would take the pixel-based representation of the
image and would output a bag of “visual words,” representing the content of the
image. For example, a “visual word” can be “there is an eye in the image.” If
we had such representation, we could have applied a linear predictor on top of
this representation to train a classifier for, say, face recognition. Our question is,
therefore, how can we learn a dictionary of “visual words” such that a bag-of-
words representation of an image would be helpful for predicting which object
appears in the image?


A first naive approach for dictionary learning relies on aclusteringalgorithm
(see Chapter 22). Suppose that we learn a functionc:X → { 1 ,...,k}, where
c(x) is the cluster to whichxbelongs. Then, we can think of the clusters as
“words,” and of instances as “documents,” where a documentxis mapped to
the vectorψ(x)∈ { 0 , 1 }k, whereψ(x)iis 1 if and only ifxbelongs to theith
cluster. Now, it is straightforward to see that applying a linear predictor onψ(x)
is equivalent to assigning the same target value to all instances that belong to
the same cluster. Furthermore, if the clustering is based on distances from a
class center (e.g.,k-means), then a linear predictor onψ(x) yields a piece-wise
constant predictor onx.


Both thek-means and PCA approaches can be regarded as special cases of a
more general approach for dictionary learning which is calledauto-encoders. In an
auto-encoder we learn a pair of functions: an “encoder” function,ψ:Rd→Rk,
and a “decoder” function,φ:Rk→Rd. The goal of the learning process is to
find a pair of functions such that the reconstruction error,



i‖xi−φ(ψ(xi))‖

(^2) ,
is small. Of course, we can trivially setk=dand bothψ,φto be the identity
mapping, which yields a perfect reconstruction. We therefore must restrictψand
φin some way. In PCA, we constraink < dand further restrictψandφto be
linear functions. Ink-means,kis not restricted to be smaller thand, but now
ψandφrely onkcentroids,μ 1 ,...,μk, andψ(x) returns an indicator vector

Free download pdf