Bandit Algorithms

37.6 Upper bound for easy games 464

Ca

Cb C ̃a

v

u

w

Figure 37.4The construction used in the proof of Lemma 37.17.

Suppose that Algorithm 25 is run on local ly observableG= (L,Φ)with parameters

η=^1 V

√

log(k/δ) nk

and γ=V kη and δ∈(0,1).

Then with probability at least 1 −δthe regret is bounded byRˆn≤CG

√

nlog(e/δ)), whereCGis a constant depending onG, but notnorδ.

By choosingδ= 1/nthe following corollary is obtained.

Corollary 37.16. Suppose that Algorithm 25 is run on locally observable G= (L,Φ)with the same choices ofηandγas Theorem 37.15 andδ= 1/n, then there exists a constantCG′ depending onG, but notnsuch that

Rn≤CG′

√

nlog(n).

37.6 Upper bound for easy games

In this section we prove Theorem 37.15. The first step is a simple lemma showing that the regret can be localised to the neighborhood of the played action.

Lemma37.17.There exists a constantεG> 0 depending only onGsuch that for all pairs of actionsa,a ̃∈ Aandu∈C ̃athere exists an actionb∈ Na∩A such that〈à−à ̃,u〉≤〈à−`b,u〉/εG.

Proof Sinceu∈C ̃a, 0≤〈à−à ̃,u〉. The result is trivial ifaand ̃aare neighbors or〈à−à ̃,u〉= 0. From now on assume that〈à−` ̃a,u〉>0 and thata,a ̃are not neighbors. Letvbe the centroid ofCa. The idea is to chooseb∈Na∩Aas that neighbor ofawhose cell is the one that the line segment that connectsv anduenters when leavingCa. To be precise, ifwlies in the intersection of the line segment connectingvanduand the boundary ofCathenbis a neighbor of ainAso thatw∈Ca∩Cb. Note thatwis well defined by the Jordan–Brouwer separation theorem (see the notes at the end of the chapter), andbis well defined becauseAis a maximal duplicate-free subset of the Pareto optimal actions. Using

Bandit Algorithms

√

√

√

37.6 Upper bound for easy games

Get our desktop app

Company

Features

Documentation

Resources