Bandit Algorithms

36.4 Information theoretic analysis 438

from context we writeHP. LetG ⊆Fbe a sub-σ-algebra. Then theconditional entropyofXgivenGis

H(X|G) =E





∑

x∈range(X)

P(X=x|G) log

(

1

P(X=x|G)

)

.

Despite the nomenclature and the notation,H(X|G) is a real value, not a random variable. Noting thatH(X|G) =E

[

HP(·|G)(X)

]

, a better nomenclature would have been the expected conditional entropy, but we follow the standard convention. GivenP, the entropy of random variableXis a measure of the amount of information inXwhile the conditional entropy givenGis the expected amount of information required to encodeXhaving observed the information inG. The mutual informationbetweenXandGis the difference between the entropy and the conditional entropy:

I(X;G) =H(X)−H(X|G).

The mutual information is always nonnegative, which should not be surprising because the information remaining inXcan only decrease as more information is observed. Another name for the mutual information isinformation gain. As with the entropy, we use these forms when the underlying measure is clear from context. If this is not the case, then the measure is shown in the subscript: H(X) =HP(X),H(X|G) =HP(X|G), andI(X;G) =IP(X;G). The following lemmas provide a chain rule for the mutual information and a simple connection to the relative entropy. The proofs are definitional and are left to Exercise 36.7.

Lemma36.5.LetXbe a random variable on(Ω,F,P)and(Ft)nt=0a filtration ofFwithF 0 ={∅,Ω}andPt(·) =P(·|Ft). Then,

E

[n ∑

t=1

IPt− 1 (X;Ft)

]

=IP(X;Fn).

For the next lemma, for random variablesX,Ydefined on a common probability space we introduce the abbreviationI(X;Y) =I(X;σ(Y)).

Lemma36.6.LetXandY be random variables on probability space(Ω,F,P). If Xis discrete, thenI(X;Y) =E[D(P(Y∈·|X),P(Y∈·))].

36.4.2 Bayesian adversarial bandits

In the adversarial bandit model studied in Part III the adversary secretly chooses a matrixx∈[0,1]n×kat the start of the game. The reward in roundtisxtAt. A Bayesian spin on this setup works as follows. The priorQis a probability measure on ([0,1]n×k,B([0,1]n×k), which is known to the learner. At the start of the game a reward matrixXis sampled fromQ, but not revealed. The learner then chooses actions (At)nt=1and the reward in roundtisXtAt. Formally, let

Bandit Algorithms

H(X|G) =E





∑

(

1

)

.

[

HP(·|G)(X)

]

E

]

Get our desktop app

Company

Features

Documentation

Resources