Bandit Algorithms

4.4 The regret 58

their destination if all the edges in their chosen path are present. This problem can be formalized by lettingAbe the set of paths and

νθ=

(

B

(

∏

e∈a

θe

)

:a∈A

)

and E={νθ:θ∈[0,1]|E|}.

An important feature of structured bandits is that the learner can often obtain information about some actions while never playing them.

4.4 The regret

In Chapter 1 we informally defined the regret as being the deficit suffered by the learner relative to the optimal policy. Letν= (Pa:a∈A) be a stochastic bandit and define

μa(ν) =

∫∞

−∞

xdPa(x).

Then letμ∗(ν) = maxa∈Aμa(ν) be the largest mean of all the arms.

We assume throughout thatμa(ν) exists and is finite for all actions and thatargmaxa∈Aμa(ν) is nonempty. The latter assumption could be relaxed by carefully adapting all arguments using nearly optimal actions, but in practice this is never required.

The regret of policyπon bandit instanceνis

Rn(π,ν) =nμ∗(ν)−E

[n ∑

t=1

Xt

]

, (4.1)

where the expectation is taken with respect to the measure on outcomes induced by the interaction ofπandν. Minimizing the regret is equivalent to maximizing the expectation ofSn, but the normalization inherent in the definition of the regret is useful when stating results, which would otherwise need to be stated relative to the optimal action.

If the context is clear we will often drop the dependence onνandπin various quantities. For example, by writingRn=nμ∗−E[

∑n t=1Xt]. Similarly, the limits in sums and maxima are abbreviated when we think you can work out ranges of symbols in a unique way. For example:μ∗= maxiμi

Bandit Algorithms

(

B

(

∏

)

)

4.4 The regret

∫∞

]

, (4.1)

Get our desktop app

Company

Features

Documentation

Resources