Understanding Machine Learning: From Theory to Algorithms

346 Generative Models

To answer this question we need to define how we assess the quality of an approxi- mated solution of the density estimation problem. Unlike discriminative learning, where there is a clear notion of “loss,” in generative learning there are various ways to define the loss of a model. On the basis of the previous subsection, one natural candidate is the expected log-loss as given in Equation (24.5). In some situations, it is easy to prove that the maximum likelihood principle guarantees low true risk as well. For example, consider the problem of estimating the mean of a Gaussian variable of unit variance. We saw previously that the maximum likelihood estimator is the average: ˆμ=m^1

∑

ixi. Letμ?be the optimal parameter. Then,

E x∼N(μ?,1)

[`(ˆμ,x)−`(μ?,x)] = E x∼N(μ?,1)

log

(

Pμ?[x] Pμˆ[x]

)

= E

x∼N(μ?,1)

(

−

1

2

(x−μ?)^2 +

1

2

(x−μˆ)^2

)

=

μˆ^2 2 −

(μ?)^2 2 + (μ

?−μˆ) E x∼N(μ?,1)[x]

=

μˆ^2 2

−

(μ?)^2 2

+ (μ?−μˆ)μ?

=

1

2

(ˆμ−μ?)^2. (24.6)

Next, we note that ˆμis the average ofmGaussian variables and therefore it is also distributed normally with meanμ?and varianceσ?/m. From this fact we can derive bounds of the form: with probability of at least 1−δwe have that |μˆ−μ?|≤wheredepends onσ?/mand onδ. In some situations, the maximum likelihood estimator clearly overfits. For example, consider a Bernoulli random variableXand letP[X= 1] =θ?. As we saw previously, using Hoeffding’s inequality we can easily derive a guarantee on|θ?−θˆ|that holds with high probability (see Equation (24.2)). However, if our goal is to obtain a small value of the expected log-loss function as defined in Equation (24.5) we might fail. For example, assume thatθ?is nonzero but very small. Then, the probability that no element of a sample of sizemwill be 1 is (1−θ?)m, which is greater thane−^2 θ

?m

. It follows that wheneverm≤log(2) 2 θ? ,
the probability that the sample is all zeros is at least 50%, and in that case, the
maximum likelihood rule will setθˆ= 0. But the true risk of the estimateθˆ= 0
is

x∼Eθ?[`(θ,xˆ )] =θ?`(θ,ˆ1) + (1−θ?)`(θ,ˆ0) =θ?log(1/ˆθ) + (1−θ?) log(1/(1−θˆ)) =θ?log(1/0) =∞.

This simple example shows that we should be careful in applying the maximum likelihood principle. To overcome overfitting, we can use the variety of tools we encountered pre-

Understanding Machine Learning: From Theory to Algorithms

∑

(

)

= E

(

−

1

2

1

2

)

=

−

1

2

Get our desktop app

Company

Features

Documentation

Resources