Reminder from last week Goals Lower bounds on the weak regret The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15, 2017 Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Reminder from last week Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Background Problem setup: K arms Assume K is known to player in advance Rewards X i ( t ) are bounded to [0 , 1] Generalized trivially to [ a , b ] by ( b − a ) x + a Partial information Player learns only rewards of arms he chose Slot machines need not be of fixed distribution Can even be adversarial Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Background Problem setup: Rewards assignment is determined in advance I.e. before the first arm is pulled Assignments can be picked after player strategy is already known We want to minimize the regret Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Notations K - number of possible actions (i.e arms) denoted usually by i ∈ { 1 , ..., K } T - total time denoted usually by time t ∈ { 1 , ..., T } One action i per time t x i ( t ) - reward of arm i at time t x i ( t ) ∈ [0 , 1] A - player’s strategy Choose arm i t at time t (and receive reward x i t ( t )) Player only knows x i 1 (1) , ..., x i t ( t ) of previously chosen actions i 1 , ..., i t Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Notations take 2 K - number of possible actions (i.e arms) denoted usually by i ∈ { 1 , ..., K } T - total time denoted usually by time t ∈ { 1 , ..., T } One action i per time t x i ( t ) - reward of arm i at time t x i ( t ) ∈ [0 , 1] A - player’s strategy Can be viewed as a sequence I 1 , I 2 , ... where each I t is a mapping from ( { 1 , ..., K } × [0 , 1]) t − 1 → { 1 , ..., K } , that is the set of action indices and previous rewards to the set of indices Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Notations take 3 G A ( T ) - Total reward of strategy A at time horizon T T � G A ( T ) := x i t ( t ) t =1 denoted as G A when context of T is obvious Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Notations take 4 Regret take 1: Given a ssequence of actions ( j 1 , ..., j T ), we denote T � G ( j 1 ,..., j T ) := x j t ( t ) t =1 as the return of the sequence (Worst-case) regret defined as G ( j 1 ,..., j T ) − G A ( T ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Notations take 5 Regret take 2: G max ( T ) - Total reward of the best arm at time horizon T T � G max ( T ) := max x j ( t ) j t =1 denoted as G max as well Weak regret defined as G max − G A ( T ) We will consider the weak regret from now on and will refer to it simply as ”the regret” Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Goals Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Goals √ Lower bounds on the weak regret is Ω( KT ) Does not match the upper bound of previous week’s algorithm � of O ( KT ln( K )) Closing the gap is still an open problem (today??) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Goals √ Lower bounds on the weak regret is Ω( KT ) Upper bounds on the weak regret that hold with probability 1 If time permits... Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Lower bounds on the weak regret Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Theorem 5.1. For any number of actions K ≥ 2 and for any time horizon T , there exists a distribution over the assignment of rewards such that the expected weak regret of any algorithm is √ Ω( KT ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find lower bound to the expected gain of the best arm G max Pretty straightforward Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Pretty straightforward Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Pretty straightforward Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Deduce lower bound on their diffrence Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Proof overview Construct the random distribution of rewards s.t all strategies would reach an expected regret of our lower bound Find an lower bound to the expected gain of the best arm G max Find a upper bound to the expected gain of any given strategy G A Here is where all the magic happens Deduce lower bound on their diffrence Proof by notations :) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 + ǫ Pr[ x j ( t ) = 0] = 1 2 − ǫ if j � = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 Pr[ x j ( t ) = 0] = 1 2 for some small, fixed ǫ ∈ (0 , 1 2 ] to be chosen later down the road Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 + ǫ Pr[ x j ( t ) = 0] = 1 2 − ǫ if j � = I , meaning j is the ”good” action: Pr[ x j ( t ) = 1] = 1 2 Pr[ x j ( t ) = 0] = 1 2 Then the expected reward of the best action is ( 1 2 + ǫ ) T Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Constructing the distribution Translation of our problem: Our goal now is to show that for any given strategy A , we can find √ an ǫ s.t A ’s expected regret is of Ω( TK ). We will soon see that ǫ depends only on number of actions K and total time T . Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Some more notations P ∗ {·} - probabilty w.r.t the afromentioned distribution P i {·} - probabilty conditioned on i being the good action P i {·} = P ∗ { · | i = I } P unif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E ∗ [ · ] , E i [ · ] , E unif [ · ] Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Some more notations P ∗ {·} - probabilty w.r.t the afromentioned distribution P i {·} - probabilty conditioned on i being the good action P i {·} = P ∗ { · | i = I } P unif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E ∗ [ · ] , E i [ · ] , E unif [ · ] We want to show: √ E ∗ [ G max − G A ] ≥ Ω( KT ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Reminder from last week Goals Lower bounds on the weak regret Some more notations (...) A - as before, player strategy r t = x i t ( t ) - random variable denoting reward received at time t r t = � r 1 , ..., r t � - sequence of rewards up to time t r = r T - the entire sequence N i - number of times action i is chosen by A Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem
Recommend
More recommend