Understanding Machine Learning: From Theory to Algorithms

354 Generative Models

As before, given a specific value ofθ, it is assumed that the conditional probability,P[X=x|θ], is known. In the drug company example,Xtakes values in { 0 , 1 }andP[X=x|θ] =θx(1−θ)^1 −x. Once the prior distribution overθand the conditional distribution overX givenθare defined, we again have complete knowledge of the distribution over X. This is because we can write the probability overXas a marginal probability P[X=x] =

∑

θ

P[X=x,θ] =

∑

θ

P[θ]P[X=x|θ],

where the last equality follows from the definition of conditional probability. If θis continuous we replaceP[θ] with the density function and the sum becomes an integral:

P[X=x] =

∫

θ

P[θ]P[X=x|θ]dθ.

Seemingly, once we knowP[X=x], a training setS= (x 1 ,...,xm) tells us nothing as we are already experts who know the distribution over a new point X. However, the Bayesian view introduces dependency betweenSandX. This is because we now refer toθas a random variable. A new pointXand the previous points inSare independentonlyconditioned onθ. This is different from the frequentist philosophy in whichθis a parameter that we might not know, but since it is just a parameter of the distribution, a new pointXand previous points Sare always independent. In the Bayesian framework, sinceXandSare not independent anymore, what we would like to calculate is the probability ofXgivenS, which by the chain rule can be written as follows: P[X=x|S] =

∑

θ

P[X=x|θ,S]P[θ|S] =

∑

θ

P[X=x|θ]P[θ|S].

The second inequality follows from the assumption thatXandSare independent when we condition onθ. UsingBayes’ rulewe have

P[θ|S] =

P[S|θ]P[θ] P[S]

,

and together with the assumption that points are independent conditioned onθ, we can write

P[θ|S] =P[S|θ]P[θ] P[S]

=^1

P[S]

∏m

i=1

P[X=xi|θ]P[θ].

We therefore obtain the following expression for Bayesian prediction:

P[X=x|S] =

1

P[S]

∑

θ

P[X=x|θ]

∏m

i=1

P[X=xi|θ]P[θ]. (24.16)

Getting back to our drug company example, we can rewriteP[X=x|S] as

P[X=x|S] =^1 P[S]

∫

θx+

∑ ixi(1−θ)^1 −x+

∑ i(1−xi)P[θ]dθ.

Understanding Machine Learning: From Theory to Algorithms

∑

∑

∫

∑

∑

,

=^1

P[S]

1

P[S]

∑

∫

Get our desktop app

Company

Features

Documentation

Resources