Bandit Algorithms

18.3 Exp4 218

18.3 Exp4

Exp4 is not just an increased version number, but stands forExponential weighting forExploration andExploitation withExperts. The idea of the algorithm is very simple. Since exponential weighting worked so well in the standard bandit problem we aim to adopt it to the problem at hand. However, since the goal is to compete with the best expert in hindsight, it is not the actions that we will score, but the experts. Exp4 thus maintains a probability distributionQtover experts and uses this to come up with the next action in the obvious way, by first choosing an expertMtat random fromQtand then following the chosen expert’s advice to chooseAt∼E(Mt)t. The reader is invited to check for themselves that this is the same as samplingAtfromPt=QtE(t)whereQtis treated as a row vector. Once the action is chosen, one can use their favorite reward estimation procedure to estimate the rewards for all the actions, which is then used to estimate how much total reward the individual experts would have made so far. The reward estimates are then used to updateQtusing exponential weighting. The pseudocode of Exp4 is given in Algorithm 11.

1:Input: n,k,M,η,γ 2:SetQ 1 = (1/M,..., 1 /M)∈[0,1]^1 ×M(a row vector) 3:fort= 1,...,ndo 4: Receive adviceE(t) 5: Choose the actionAt∼Pt, wherePt=QtE(t) 6: Receive the rewardXt=xtAt 7: Estimate the action rewards:Xˆti= 1−I{PAtit+=γi}(1−Xt) 8: Propagate the rewards to the experts:X ̃t=E(t)Xˆt 9: Update the distributionQtusing exponential weighting:

Qt+1,i= exp(η X ̃ti)Qti ∑ jexp(ηX ̃tj)Qtj

for alli∈[M]

10:end for Algorithm 11:Exp4.

The algorithm usesO(M) memory andO(M+k) computation per round (when sampling in two steps). Hence it is only practical when bothMandkare reasonably small.

18.4 Regret analysis

We restrict our attention to the case whenγ= 0, which is the original algorithm. The version whereγ >0 is called Exp4-IX and its analysis is left for Exercise 18.3.

Theorem18.1. Letγ = 0andη=

√

2 log(M)/(nk)and denote byRnthe

Bandit Algorithms

18.3 Exp4

18.4 Regret analysis

√

Get our desktop app

Company

Features

Documentation

Resources