From Bandits to Experts: A Tale of Domination and Independence - PowerPoint PPT Presentation

From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Noga Alon Ofer Dekel Tomer Koren N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

Theory of repeated games James Hannan David Blackwell (1922–2010) (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Domination and Independence 2 / 1

Zero-sum 2-person games played more than once 1 2 . . . M N × M known loss matrix over R 1 ℓ ( 1, 1 ) ℓ ( 1, 2 ) . . . Row player (player) ℓ ( 2, 1 ) ℓ ( 2, 2 ) 2 . . . has N actions . . . ... . . . . . . Column player (opponent) N has M actions For each game round t = 1, 2, . . . Player chooses action i t and opponent chooses action y t The player su ff ers loss ℓ ( i t , y t ) ( = gain of opponent) Player can learn from opponent’s history of past choices y 1 , . . . , y t − 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 3 / 1

Prediction with expert advice t = 1 t = 2 . . . ℓ 1 ( 1 ) ℓ 2 ( 1 ) 1 . . . 2 ℓ 1 ( 2 ) ℓ 2 ( 2 ) . . . . . . ... . . . . . . N ℓ 1 ( N ) ℓ 2 ( N ) Volodya Vovk Manfred Warmuth Play an unknown loss matrix Opponent’s moves y 1 , y 2 , . . . define a sequential prediction problem with a time-varying loss function ℓ ( i t , y t ) = ℓ t ( i t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 4 / 1

Playing the experts game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

Playing the experts game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

Playing the experts game N actions 7 3 2 4 1 6 7 4 9 For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) � � Player gets feedback information: ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( N ) 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

Oblivious opponents The loss process � ℓ t � t � 1 is deterministic and unknown to the (randomized) player I 1 , I 2 , . . . Oblivious regret minimization � T � T � � def ℓ t ( i ) want R T = E ℓ t ( I t ) − min = o ( T ) i = 1,..., N t = 1 t = 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 6 / 1

Bounds on regret [How to use expert advice, 1997] Lower bound using random losses Losses ℓ t ( i ) are independent random coin flips L t ( i ) ∈ { 0, 1 } � T � = T � For any player strategy L t ( I t ) E 2 t = 1 Then the expected regret is � �� 1 T � T ln N � E max 2 − L t ( i ) = 1 − o ( 1 ) 2 i = 1,..., N t = 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 7 / 1

Exponentially weighted forecaster At time t pick action I t = i with probability proportional to � � t − 1 � exp − η ℓ s ( i ) s = 1 the sum at the exponent is the total loss of action i up to now Regret bound [How to use expert advice, 1997] � � T ln N If η = ( ln N ) / ( 8 T ) then R T � 2 Matching lower bound including constants � Dynamic choice η t = ( ln N ) / ( 8 t ) only loses small constants N. Cesa-Bianchi (UNIMI) Domination and Independence 8 / 1

The bandit problem: playing an unknown game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

The bandit problem: playing an unknown game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

The bandit problem: playing an unknown game N actions 3 ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) Player gets feedback information: Only ℓ t ( I t ) is revealed 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

The bandit problem: playing an unknown game N actions 3 ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) Player gets feedback information: Only ℓ t ( I t ) is revealed 3 Many applications Ad placement, dynamic content adaptation, routing, online auctions N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

Relationships between actions [Mannor and Shamir, 2011] N. Cesa-Bianchi (UNIMI) Domination and Independence 10 / 1

A graph of relationships over actions ? ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

A graph of relationships over actions 7 3 6 7 2 ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

Recovering expert and bandit settings Experts: clique Bandits: empty graph 7 ? 3 6 3 ? 7 2 1 2 ? ? ? ? 4 9 ? ? 4 ? N. Cesa-Bianchi (UNIMI) Domination and Independence 12 / 1

Exponentially weighted forecaster — Reprise Player’s strategy [Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013] � � t − 1 � � P t ( I t = i ) ∝ exp − η ℓ s ( i ) i = 1, . . . , N s = 1  ℓ t ( i )  � � if ℓ t ( i ) is observed � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Importance sampling estimator � � � E t ℓ t ( i ) = ℓ t ( i ) unbiasedness � ℓ t ( i ) 2 � 1 � � � variance control E t � ℓ t ( i ) observed P t N. Cesa-Bianchi (UNIMI) Domination and Independence 13 / 1

Regret bounds Analysis (undirected graphs) T N R T � ln N + η P t ( I t = i ) � � � η 2 P t ( I t = i ) + P t ( I t = j ) t = 1 i = 1 j ∈ N G ( i ) Lemma For any undirected graph G = ( V , E ) and for any probability assignment p 1 , . . . , p N over its vertices N p i � � α ( G ) � p i + p j i = 1 j ∈ N G ( i ) α ( G ) is the independence number of G (largest subset of V such that no two distinct vertices in it are adjacent in G ) N. Cesa-Bianchi (UNIMI) Domination and Independence 14 / 1

Regret bounds Analysis (undirected graphs) T � R T � ln N + η � α ( G ) = Tα ( G ) ln N by choosing η η 2 t = 1 Special cases √ Experts (clique): α ( G ) = 1 T ln N R T � √ Bandits (empty graph): α ( G ) = N TN ln N R T � Minimax rate �� The general bound is tight: R T = Θ Tα ( G ) ln N N. Cesa-Bianchi (UNIMI) Domination and Independence 15 / 1

More general feedback models Interventions Directed N. Cesa-Bianchi (UNIMI) Domination and Independence 16 / 1

Old and new examples Experts Bandits Cops & Robbers Revealing Action N. Cesa-Bianchi (UNIMI) Domination and Independence 17 / 1

Exponentially weighted forecaster with exploration Player’s strategy [Alon, C-B, Dekel and Koren, 2015] � � t − 1 P t ( I t = i ) ∝ 1 − γ � � exp − η ℓ s ( i ) + γ U G i = 1, . . . , N Z t s = 1  ℓ t ( i )  � � if ℓ t ( i ) is observed � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise U G is uniform distribution supported on a subset of V N. Cesa-Bianchi (UNIMI) Domination and Independence 18 / 1

A characterization of feedback graphs A vertex of G is: observable if it has at least one incoming edge (possibly a self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable 1 3 is not observable 5 2 2 and 5 are weakly observable 1 and 4 are strongly observable 4 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 19 / 1

Minimax rates �� R T = � G is strongly observable Θ α ( G ) T U G is uniform on V � � R T = � T 2 / 3 δ ( G ) G is weakly observable Θ U G is uniform on a weakly dominating set G is not observable R T = Θ ( T ) 1 Weakly dominating set 5 2 δ ( G ) is the size of the smallest set that dominates all weakly observable nodes of G 4 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 20 / 1

From Bandits to Experts: A Tale of Domination and Independence - PowerPoint PPT Presentation

From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A Tale of Domination and Independence

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Two domination parameters in graphs Guangjun Xu Department of Mathematics and Statistics The

Domination in generalized Domination in generalized Petersen graphs Petersen graphs Advisor:

On traffic domination in communication networks Walid Ben-Ameur 1 Pablo Pavon 2 oro 3 , 4 Micha

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

tt tr

Domination Dominating Set Colouring For a graph G = ( V , E ) , a vertex set D is called

Slot machines an approach to the Strategy Challenge in SMT solving St ephane Graham-Lengrand

Domination in circle graphs Nicolas Bousquet Daniel Gon calves George B. Mertzios Christophe

Realizations of the Game Domination Number Bo stjan Bre sar, Sandi Klav zar, Ga sper

Equality in the Domination Chain in Triangulations Stephen Finbow Joint work with C. M. van

On some properties of Archimedean tiling graphs Liping Yuan College of Mathematics and

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us