Pattern Recognition and Machine Learning

(Jeff_L) #1
4.2. Probabilistic Generative Models 201

the log likelihood function that depend onπare

∑N

n=1

{tnlnπ+(1−tn)ln(1−π)}. (4.72)

Setting the derivative with respect toπequal to zero and rearranging, we obtain

π=

1

N

∑N

n=1

tn=

N 1

N

=

N 1

N 1 +N 2

(4.73)

whereN 1 denotes the total number of data points in classC 1 , andN 2 denotes the total
number of data points in classC 2. Thus the maximum likelihood estimate forπis
simply the fraction of points in classC 1 as expected. This result is easily generalized
to the multiclass case where again the maximum likelihood estimate of the prior
probability associated with classCkis given by the fraction of the training set points
Exercise 4.9 assigned to that class.
Now consider the maximization with respect toμ 1. Again we can pick out of
the log likelihood function those terms that depend onμ 1 giving


∑N

n=1

tnlnN(xn|μ 1 ,Σ)=−

1

2

∑N

n=1

tn(xn−μ 1 )TΣ−^1 (xn−μ 1 )+const. (4.74)

Setting the derivative with respect toμ 1 to zero and rearranging, we obtain

μ 1 =

1

N 1

∑N

n=1

tnxn (4.75)

which is simply the mean of all the input vectorsxnassigned to classC 1 .Bya
similar argument, the corresponding result forμ 2 is given by

μ 2 =

1

N 2

∑N

n=1

(1−tn)xn (4.76)

which again is the mean of all the input vectorsxnassigned to classC 2.
Finally, consider the maximum likelihood solution for the shared covariance
matrixΣ. Picking out the terms in the log likelihood function that depend onΣ,we
have


1

2

∑N

n=1

tnln|Σ|−

1

2

∑N

n=1

tn(xn−μ 1 )TΣ−^1 (xn−μ 1 )


1

2

∑N

n=1

(1−tn)ln|Σ|−

1

2

∑N

n=1

(1−tn)(xn−μ 2 )TΣ−^1 (xn−μ 2 )

=−

N

2

ln|Σ|−

N

2

Tr

{
Σ−^1 S

}
(4.77)
Free download pdf