Delay and Cooperation in Nonstochastic Bandits Nicol` o - PowerPoint PPT Presentation

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 / 23

Themes Learning with partial and delayed feedback Distributed online learning Trade-o ff between quality and quantity of feedback information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 2 / 23

The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 Ad placement, recommender systems, online auctions, . . . N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

Regret Regret of randomized agent I 1 , I 2 , . . . � T � T � � def ℓ t ( i ) want R T = E ℓ t ( I t ) − min = o ( T ) i = 1,..., K t = 1 t = 1 √ Lower bound: R T � KT N. Cesa-Bianchi (UNIMI) Delay and Cooperation 4 / 23

The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1  ℓ t ( i )  � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Only one non-zero component in � ℓ t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23

The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1  ℓ t ( i )  � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Only one non-zero component in � ℓ t Properties of importance weighting estimator � � � ℓ t ( i ) = ℓ t ( i ) E t unbiasedness � ℓ t ( i ) 2 � 1 � � � variance control E t � ℓ t ( i ) observed P t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23

Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23

Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η The full information (experts) setting Agent observes vector of losses ℓ t after each play P t ( ℓ t ( i ) is observed ) = 1 √ R T � T ln K N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23

Learning with delayed losses At the end of each round t > d the agent pays ℓ t ( I t ) and observes ℓ t − d ( I t − d ) Upper bound [Neu et al., 2010; Joulani et al., 2013] � ( d + 1 ) KT R T � Proof (by reduction): Run d + 1 instances of a bandit algorithm for the standard (no delay) setting in parallel. At each time step t = ( d + 1 ) r + s , use instance s + 1 for the current play. Lower bound � � � � √ � � max KT , ( d + 1 ) T ln K = Ω ( d + K ) T � �� bandit delayed experts lower bound lower bound N. Cesa-Bianchi (UNIMI) Delay and Cooperation 7 / 23

Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available  ℓ t − d ( i )  � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed  0 otherwise N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23

Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available  ℓ t − d ( i )  � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed  0 otherwise Regret bound � R T = d + ( d + K ) T ln K matching the lower bound up to logarithmic factors N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23

Properties of the delayed loss estimate Recall key step in Exp3 analysis (a.k.a. “bandit magic”) � � K P t I t = i � � � = K P t ℓ t ( i ) is observed i = 1 For the delayed loss estimate we have � � K I t = i P t 1 � � � � Ke for η � ℓ t − d ( i ) is observed eK ( d + 1 ) P t − d i = 1 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 9 / 23

Cooperation with delay N agents sitting 1 on the vertices of an unknown communication 6 2 graph G = ( V , E ) Agents cooperate 7 3 5 10 to solve a common bandit problem Each agent runs 4 9 an instance of the same bandit 8 algorithm N. Cesa-Bianchi (UNIMI) Delay and Cooperation 10 / 23

Some related works Cooperative nonstochastic bandits without delays [Awerbuch and Kleinberg, 2008] Cooperative stochastic bandits on dynamic P2P networks [Szorenyi et al., 2013] Stochastic bandits that compete for shared resources (cognitive radio networks) Distributed gradient descent N. Cesa-Bianchi (UNIMI) Delay and Cooperation 11 / 23

The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23

The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d An agent receives a message from another agent with a delay equal to the shortest path between them A message sent by some agent v at time t will be received by all agents whose shortest-path distance from v is at most d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23

Average welfare regret � T � T � � = 1 � � � R coop I t ( v ) − min ℓ t ( i ) E ℓ t T N i = 1,..., K v ∈ V t = 1 t = 1 Remarks √ Clearly, R coop TK ln K when agents run vanilla Exp3 � T (no cooperation) By using other agent’s plays, each agent may estimate ℓ t better (thus learning nearly at full info rate) In general, d trades o ff between quality and quantity of information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 13 / 23

Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better  ℓ t − d ( i ) × B t − d ( i , v ) � �  if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d  0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23

Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better  ℓ t − d ( i ) × B t − d ( i , v ) � �  if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d  0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d Now � ℓ ( v ) may have many non-zero components (better estimate) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23

Delay and Cooperation in Nonstochastic Bandits Nicol` o - PowerPoint PPT Presentation

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 /

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Interconnect Gate delay Wire delay The delay in VLSI circuits have two components Gate delay (

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

WORKSHOP Inter-generational exchange, participation and membership. Chiara Bertelli &

Reconstruction for Indoor Scenes from a Single Image https://yinyunie.github.io/Total3D/ Yinyu Nie

Webinar: Social Isolation and Seniors During the Pandemic April 21, 2020 The Center for

WCET Tool Challenge 2014 Outline 1. Objectives of the challenge 2. Benchmarks and problems 3.

Emotional Self Management Learnfest Co op Insurance Delivered by Andi Roberts Quickly review

What does workplace democracy look like? Governance, Management and Ways Worker Coops Make

Studying Comp Sci in CSE (or, why a CS Coop Scholarship is the right choice for you ) John

Contract Solutions By Almasa Kljako 1 May 25, 2013 Agilent Support Contract Value Statement

Delay and Cooperation in Nonstochastic Bandits Nicol` o - PowerPoint PPT Presentation

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 /

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Interconnect Gate delay Wire delay The delay in VLSI circuits have two components Gate delay (

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

WORKSHOP Inter-generational exchange, participation and membership. Chiara Bertelli &amp;

Reconstruction for Indoor Scenes from a Single Image https://yinyunie.github.io/Total3D/ Yinyu Nie

Webinar: Social Isolation and Seniors During the Pandemic April 21, 2020 The Center for

WCET Tool Challenge 2014 Outline 1. Objectives of the challenge 2. Benchmarks and problems 3.

Emotional Self Management Learnfest Co op Insurance Delivered by Andi Roberts Quickly review

What does workplace democracy look like? Governance, Management and Ways Worker Coops Make

Studying Comp Sci in CSE (or, why a CS Coop Scholarship is the right choice for you ) John

Contract Solutions By Almasa Kljako 1 May 25, 2013 Agilent Support Contract Value Statement

WORKSHOP Inter-generational exchange, participation and membership. Chiara Bertelli &