Bandit Algorithms

(Jeff_L) #1
14.2 The relative entropy 179

14.2 The relative entropy


Suppose that Alice and Bob agree to use a code that is optimal whenXis sampled
from distributionQ. Unbeknownst to them, however,Xis actually sampled from
distributionP. The relative entropy betweenPandQmeasures how much longer
the messages are expected to be using the optimal code forQthan what would be
obtained using the optimal code forP. Lettingpi=P(X=i) andqi=Q(X=i),
assuming Shannon coding, working out the math while droppingd·eleads to the
definition ofrelative entropyas

D(P,Q) =



i∈[N]:pi> 0

pilog

(


1


qi

)




i∈[N]:pi> 0

pilog

(


1


pi

)


=



i∈[N]:pi> 0

pilog

(


pi
qi

)


(14.4)


From the coding interpretation, one conjectures thatD(P,Q)≥0. Indeed, this
is easy to verify using Jensen’s inequality. Still poking around the definition,
what happens whenqi= 0 andpi= 0? This means that symboliis superfluous
and the value ofD(P,Q) should not be impacted by introducing superfluous
symbols. And again, it is not by the definition of the expectations. We also see
that the sufficient and necessary condition forD(P,Q)<∞is that for eachi
withqi= 0 we also havepi= 0. The condition we discovered is equivalent to
saying thatPis absolutely continuous with respect toQ. Note that absolute
continuity only implies a finite relative entropy whenXtakes on finitely many
values (Exercise 14.2).
This brings us back to defining relative entropy between probability measures
P andQon arbitrary measurable spaces (Ω,F). When the support ofP is
uncountable, defining the entropy via communication is hard because infinitely
many symbols are needed to describe some outcomes. This seems to be a
fundamental difficulty. Luckily, the impasse gets resolved automatically if we
only consider relative entropy. While we cannot communicate the outcome, for
any finite discretization of the possible outcomes, the discretized values can be
communicated finitely and all our definitions will work. Formally, a discretization
to [N] is specified by aF/ 2 [N]-measurable mapX: Ω→[N]. Then the entropy
ofPrelativeQcan be defined as

D(P,Q) = sup
N∈N+

sup
X

D(PX,QX), (14.5)


wherePXis the pushforward ofP on [N] defined byPX(A) =P(X∈A). The
inner supremum is over allF/ 2 N-measurable maps. Informally we take all possible
discretizationsX(with no limit on the ‘fineness’ of the discretization) and define
D(P,Q) as the excess information when expecting to seeXwithX∼QXwhile
reality isX∼PX. As we shall see it soon, it is indeed a reasonable definition.

Theorem14.1.Let(Ω,F)be a measurable space and letPandQbe measures
Free download pdf