Bandit Algorithms

26.3 Bregman divergence 293

y x

Df(x, y)

f(x) f(y)+〈∇f(y), x−y〉

Figure 26.2Bregman divergence

where∇f(y)∈Rdis the gradient offaty. To get a sense of the divergence
functionDf, note thatDf(x,y) is the difference betweenf(x) and its first order
Taylor expansion about the pointy. Sincefis convex, the linear approximation
offis a lower bound onf(why?) and soDf(x,y) is nonnegative over its domain
withDf(x,x) = 0.

Theorem26.3.The following hold:

(a)Df(x,y)≥ 0 for allx,y∈A;
(b)Df(x,x) = 0for allx∈A;
(c)Df(x,y)is convex as a function ofx.

The square root of the Bregman divergence shares many properties with a metric and for some choices offit actually is a metric. In general, however, it is not symmetric and does not satisfy the triangle inequality.

Example26.4.Letf(x) =^12 ‖x‖^22. Then∇f(x) =xand

Df(x,y) =^1 2

‖x‖^22 −^1 2

‖y‖^22 −〈y,x−y〉=^1 2

‖x−y‖^22.

Example26.5.LetA= [0,∞)d,dom(f) =Aandf(x) =

∑d
i=1(xilog(xi)−xi).
Then∇f(x) = log(x) and

Df(x,y) =

∑d

i=1

(xilog(xi)−xi)−

∑d

i=1

(yilogyi−yi)−

∑d

i=1

log(yi)(xi−yi)

=

∑d

i=1

xilog

(

xi yi

)

+

∑d

i=1

(yi−xi).

Notice that ifx,y∈Pd− 1 are in the unit simplex, thenDf(x,y) is the relative entropy between probability vectorsxand y. The functionf is called the unnormalized negentropy, which will feature heavily in many of the chapters that follow.

Bandit Algorithms

=

(

)

+

Get our desktop app

Company

Features

Documentation

Resources