26.2 Jensen’s inequality 292
is obsessed with convex functions because all local minimums are global (see
Fig. 26.1). This means that minimizing a convex function is usually possible
(efficiently) using some variation of gradient descent. A functionf:Rd→R ̄is
concaveif−fis convex.
26.2 Jensen’s inequality
One of the most important results for convex functions is Jensen’s inequality.
Theorem26.2 (Jensen’s inequality).Letf:Rd→R ̄be a convex function and
Xbe anRd-valued random element on some probability space such thatE[X]
exists andX∈dom(f)holds almost surely. ThenE[f(X)]≥f(E[X]).
If we allowed Lebesgue integrals to take on the value of +∞, the condition
thatXis almost surely an element of the domain offcould be removed and the
result would still be true. Indeed, in this case we would immediately conclude
thatE[f(X)] = +∞and Jensen’s inequality would trivially hold.
x 1 x 2 x 3 x 4 x 5
( ̄x,
∑n
k=1pkf(xk))
( ̄x, f( ̄x))
f(x)
The basic inequality of (26.1) is trivially
a special case of Jensen’s inequality. Jensen’s
inequality is so central to convexity that it
can actually be used as the definition (a
function is convex if and only if it satisfies
Jensen’s inequality). The proof of Jensen’s using
Definition 26.1 in full generality is left to the
reader (Exercise 26.1). However, we cannot resist
to include here a simple ‘graphical proof’ that
works in the simple case whenXis supported
onx 1 ,...,xnandP(X=xk) =pk. Then, lettingx ̄=
∑n
k=1pkxk, one can notice
that the point (x, ̄
∑
kpkxk) lies in the convex hull of{(xk,f(xk))k}, which is a
convex subset of the epigraphEf⊂Rd+1. The result follows because (x,f ̄ (x ̄)) is
on the boundary ofEfas shown in the figure. The direction of Jensen’s inequality
is reversed if ‘convex’ is replaced by ‘concave’.
It is not necessary to assume thatfis measurable in Theorem 26.2 because
all convex functions are measurable (they are continuous on their domain).
26.3 Bregman divergence
Letf:Rd→Rbe convex,x,y∈Rd,fdifferentiable aty. Then theBregman
divergenceatyinduced byfis defined by
Df(x,y) =f(x)−f(y)−〈∇f(y),x−y〉,