Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 / 23
Themes Learning with partial and delayed feedback Distributed online learning Trade-o ff between quality and quantity of feedback information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 2 / 23
The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23
The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23
The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23
The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 Ad placement, recommender systems, online auctions, . . . N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23
Regret Regret of randomized agent I 1 , I 2 , . . . � T � T � � def ℓ t ( i ) want R T = E ℓ t ( I t ) − min = o ( T ) i = 1,..., K t = 1 t = 1 √ Lower bound: R T � KT N. Cesa-Bianchi (UNIMI) Delay and Cooperation 4 / 23
The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1 ℓ t ( i ) � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed 0 otherwise Only one non-zero component in � ℓ t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23
The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1 ℓ t ( i ) � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed 0 otherwise Only one non-zero component in � ℓ t Properties of importance weighting estimator � � � ℓ t ( i ) = ℓ t ( i ) E t unbiasedness � ℓ t ( i ) 2 � 1 � � � variance control E t � ℓ t ( i ) observed P t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23
Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� � K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23
Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� � K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η The full information (experts) setting Agent observes vector of losses ℓ t after each play P t ( ℓ t ( i ) is observed ) = 1 √ R T � T ln K N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23
Learning with delayed losses At the end of each round t > d the agent pays ℓ t ( I t ) and observes ℓ t − d ( I t − d ) Upper bound [Neu et al., 2010; Joulani et al., 2013] � ( d + 1 ) KT R T � Proof (by reduction): Run d + 1 instances of a bandit algorithm for the standard (no delay) setting in parallel. At each time step t = ( d + 1 ) r + s , use instance s + 1 for the current play. Lower bound � � � � √ � � max KT , ( d + 1 ) T ln K = Ω ( d + K ) T � �� �� �� � � ���������������� �� ���������������� � bandit delayed experts lower bound lower bound N. Cesa-Bianchi (UNIMI) Delay and Cooperation 7 / 23
Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available ℓ t − d ( i ) � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed 0 otherwise N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23
Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available ℓ t − d ( i ) � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed 0 otherwise Regret bound � R T = d + ( d + K ) T ln K matching the lower bound up to logarithmic factors N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23
Properties of the delayed loss estimate Recall key step in Exp3 analysis (a.k.a. “bandit magic”) � � K P t I t = i � � � = K P t ℓ t ( i ) is observed i = 1 For the delayed loss estimate we have � � K I t = i P t 1 � � � � Ke for η � ℓ t − d ( i ) is observed eK ( d + 1 ) P t − d i = 1 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 9 / 23
Cooperation with delay N agents sitting 1 on the vertices of an unknown communication 6 2 graph G = ( V , E ) Agents cooperate 7 3 5 10 to solve a common bandit problem Each agent runs 4 9 an instance of the same bandit 8 algorithm N. Cesa-Bianchi (UNIMI) Delay and Cooperation 10 / 23
Some related works Cooperative nonstochastic bandits without delays [Awerbuch and Kleinberg, 2008] Cooperative stochastic bandits on dynamic P2P networks [Szorenyi et al., 2013] Stochastic bandits that compete for shared resources (cognitive radio networks) Distributed gradient descent N. Cesa-Bianchi (UNIMI) Delay and Cooperation 11 / 23
The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23
The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d An agent receives a message from another agent with a delay equal to the shortest path between them A message sent by some agent v at time t will be received by all agents whose shortest-path distance from v is at most d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23
Average welfare regret � T � T � � = 1 � � � R coop I t ( v ) − min ℓ t ( i ) E ℓ t T N i = 1,..., K v ∈ V t = 1 t = 1 Remarks √ Clearly, R coop TK ln K when agents run vanilla Exp3 � T (no cooperation) By using other agent’s plays, each agent may estimate ℓ t better (thus learning nearly at full info rate) In general, d trades o ff between quality and quantity of information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 13 / 23
Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better ℓ t − d ( i ) × B t − d ( i , v ) � � if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d 0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23
Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better ℓ t − d ( i ) × B t − d ( i , v ) � � if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d 0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d Now � ℓ ( v ) may have many non-zero components (better estimate) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23
Recommend
More recommend