Understanding Machine Learning: From Theory to Algorithms

348 Generative Models

vector of featuresx= (x 1 ,...,xd). But now the generative assumption is as follows. First, we assume thatP[Y = 1] =P[Y= 0] = 1/2. Second, we assume that the conditional probability ofXgivenYis a Gaussian distribution. Finally, the covariance matrix of the Gaussian distribution is the same for both values of the label. Formally, letμ 0 ,μ 1 ∈Rdand let Σ be a covariance matrix. Then, the density distribution is given by

P[X=x|Y=y] =

1

(2π)d/^2 |Σ|^1 /^2

exp

(

−

1

2

(x−μy)TΣ−^1 (x−μy)

)

As we have shown in the previous section, using Bayes’ rule we can write

hBayes(x) = argmax y∈{ 0 , 1 }

P[Y=y]P[X=x|Y=y].

This means that we will predicthBayes(x) = 1 iff

log

(

P[Y= 1]P[X=x|Y= 1] P[Y= 0]P[X=x|Y= 0]

)

> 0.

This ratio is often called thelog-likelihood ratio. In our case, the log-likelihood ratio becomes 1 2 (x−μ^0 )

TΣ− (^1) (x−μ
0 )−
1
2 (x−μ^1 )
TΣ− (^1) (x−μ
1 )
We can rewrite this as〈w,x〉+bwhere
w= (μ 1 −μ 0 )TΣ−^1 and b=^12

(

μT 0 Σ−^1 μ 0 −μT 1 Σ−^1 μ 1

)

. (24.8)

As a result of the preceding derivation we obtain that under the aforemen- tioned generative assumptions, the Bayes optimal classifier is a linear classifier. Additionally, one may train the classifier by estimating the parameterμ 0 ,μ 1 and Σ from the data, using, for example, the maximum likelihood estimator. With those estimators at hand, the values ofwandbcan be calculated as in Equation (24.8).

24.4 Latent Variables and the EM Algorithm

In generative models we assume that the data is generated by sampling from a specific parametric distribution over our instance spaceX. Sometimes, it is convenient to express this distribution using latent random variables. A natural example is a mixture ofkGaussian distributions. That is,X =Rdand we assume that eachxis generated as follows. First, we choose a random number in { 1 ,...,k}. LetYbe a random variable corresponding to this choice, and denote P[Y=y] =cy. Second, we choosexon the basis of the value ofY according to a Gaussian distribution

P[X=x|Y=y] =

1

(2π)d/^2 |Σy|^1 /^2

exp

(

−

1

2

(x−μy)TΣ−y^1 (x−μy)

)

. (24.9)

Understanding Machine Learning: From Theory to Algorithms

1

(

−

1

2

)

(

)

> 0.

(

)

. (24.8)

24.4 Latent Variables and the EM Algorithm

1

(

−

1

2

)

. (24.9)

Get our desktop app

Company

Features

Documentation

Resources