Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
17.2 Linear Multiclass Predictors 231

Chapter 16 and as we will discuss in more detail in Chapter 25). Two examples
of useful constructions are given in the following.


The Multivector Construction:


LetY={ 1 ,...,k}and letX=Rn. We define Ψ :X ×Y →Rd, whered=nk,
as follows


Ψ(x, y) = [ 0︸,...,︷︷ (^0) ︸
∈R(y−1)n
, x︸ 1 ,...,x︷︷ n︸
∈Rn
, (^0) ︸,...,︷︷ (^0) ︸
∈R(k−y)n


]. (17.2)

That is, Ψ(x,y) is composed ofkvectors, each of which is of dimensionn, where
we set all the vectors to be the all zeros vector except they’th vector, which is
set to bex. It follows that we can think ofw∈Rnkas being composed ofk
weight vectors inRn, that is,w= [w 1 ; ...; wk], hence the namemultivec-
tor construction. By the construction we have that〈w,Ψ(x,y)〉=〈wy,x〉, and
therefore the multiclass prediction becomes


h(x) = argmax
y∈Y

〈wy,x〉.

A geometric illustration of the multiclass prediction overX=R^2 is given in the
following.


w 1

w 2

w 3 w 4

TF-IDF:


The previous definition of Ψ(x, y) does not incorporate any prior knowledge
about the problem. We next describe an example of a feature function Ψ that
does incorporate prior knowledge. LetXbe a set of text documents andYbe a
set of possible topics. Letdbe a size of a dictionary of words. For each word in the
dictionary, whose corresponding index isj, letTF(j,x) be the number of times
the word corresponding tojappears in the documentx. This quantity is called
Term-Frequency. Additionally, letDF(j,y) be the number of times the word
corresponding tojappears in documents in our training set that are not about
topicy. This quantity is called Document-Frequency and measures whether word
jis frequent in other topics. Now, define Ψ :X ×Y →Rdto be such that


Ψj(x,y) =TF(j,x) log

(

m
DF(j,y)

)

,

wheremis the total number of documents in our training set. The preced-
ing quantity is called term-frequency-inverse-document-frequency or TF-IDF for

Free download pdf