Bandit Algorithms

26.2 Jensen’s inequality 292

is obsessed with convex functions because all local minimums are global (see Fig. 26.1). This means that minimizing a convex function is usually possible (efficiently) using some variation of gradient descent. A functionf:Rd→R ̄is concaveif−fis convex.

26.2 Jensen’s inequality

One of the most important results for convex functions is Jensen’s inequality. Theorem26.2 (Jensen’s inequality).Letf:Rd→R ̄be a convex function and Xbe anRd-valued random element on some probability space such thatE[X] exists andX∈dom(f)holds almost surely. ThenE[f(X)]≥f(E[X]). If we allowed Lebesgue integrals to take on the value of +∞, the condition thatXis almost surely an element of the domain offcould be removed and the result would still be true. Indeed, in this case we would immediately conclude thatE[f(X)] = +∞and Jensen’s inequality would trivially hold.

x 1 x 2 x 3 x 4 x 5

( ̄x,

∑n k=1pkf(xk))

( ̄x, f( ̄x))

f(x)

The basic inequality of (26.1) is trivially a special case of Jensen’s inequality. Jensen’s inequality is so central to convexity that it can actually be used as the definition (a function is convex if and only if it satisfies Jensen’s inequality). The proof of Jensen’s using Definition 26.1 in full generality is left to the reader (Exercise 26.1). However, we cannot resist to include here a simple ‘graphical proof’ that works in the simple case whenXis supported onx 1 ,...,xnandP(X=xk) =pk. Then, lettingx ̄=

∑n k=1pkxk, one can notice that the point (x, ̄

∑

kpkxk) lies in the convex hull of{(xk,f(xk))k}, which is a convex subset of the epigraphEf⊂Rd+1. The result follows because (x,f ̄ (x ̄)) is on the boundary ofEfas shown in the figure. The direction of Jensen’s inequality is reversed if ‘convex’ is replaced by ‘concave’.

It is not necessary to assume thatfis measurable in Theorem 26.2 because all convex functions are measurable (they are continuous on their domain).

26.3 Bregman divergence

Letf:Rd→Rbe convex,x,y∈Rd,fdifferentiable aty. Then theBregman divergenceatyinduced byfis defined by Df(x,y) =f(x)−f(y)−〈∇f(y),x−y〉,

Bandit Algorithms

26.2 Jensen’s inequality

∑

26.3 Bregman divergence

Get our desktop app

Company

Features

Documentation

Resources