from bandits to experts a tale of domination and
play

From Bandits to Experts: A Tale of Domination and Independence - PowerPoint PPT Presentation

From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A Tale of Domination and Independence


  1. From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

  2. From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Noga Alon Ofer Dekel Tomer Koren N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

  3. Theory of repeated games James Hannan David Blackwell (1922–2010) (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Domination and Independence 2 / 1

  4. Zero-sum 2-person games played more than once 1 2 . . . M N × M known loss matrix over R 1 ℓ ( 1, 1 ) ℓ ( 1, 2 ) . . . Row player (player) ℓ ( 2, 1 ) ℓ ( 2, 2 ) 2 . . . has N actions . . . ... . . . . . . Column player (opponent) N has M actions For each game round t = 1, 2, . . . Player chooses action i t and opponent chooses action y t The player su ff ers loss ℓ ( i t , y t ) ( = gain of opponent) Player can learn from opponent’s history of past choices y 1 , . . . , y t − 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 3 / 1

  5. Prediction with expert advice t = 1 t = 2 . . . ℓ 1 ( 1 ) ℓ 2 ( 1 ) 1 . . . 2 ℓ 1 ( 2 ) ℓ 2 ( 2 ) . . . . . . ... . . . . . . N ℓ 1 ( N ) ℓ 2 ( N ) Volodya Vovk Manfred Warmuth Play an unknown loss matrix Opponent’s moves y 1 , y 2 , . . . define a sequential prediction problem with a time-varying loss function ℓ ( i t , y t ) = ℓ t ( i t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 4 / 1

  6. Playing the experts game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

  7. Playing the experts game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

  8. Playing the experts game N actions 7 3 2 4 1 6 7 4 9 For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) � � Player gets feedback information: ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( N ) 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

  9. Oblivious opponents The loss process � ℓ t � t � 1 is deterministic and unknown to the (randomized) player I 1 , I 2 , . . . Oblivious regret minimization � T � T � � def ℓ t ( i ) want R T = E ℓ t ( I t ) − min = o ( T ) i = 1,..., N t = 1 t = 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 6 / 1

  10. Bounds on regret [How to use expert advice, 1997] Lower bound using random losses Losses ℓ t ( i ) are independent random coin flips L t ( i ) ∈ { 0, 1 } � T � = T � For any player strategy L t ( I t ) E 2 t = 1 Then the expected regret is � �� � � � 1 T � T ln N � E max 2 − L t ( i ) = 1 − o ( 1 ) 2 i = 1,..., N t = 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 7 / 1

  11. Exponentially weighted forecaster At time t pick action I t = i with probability proportional to � � t − 1 � exp − η ℓ s ( i ) s = 1 the sum at the exponent is the total loss of action i up to now Regret bound [How to use expert advice, 1997] � � T ln N If η = ( ln N ) / ( 8 T ) then R T � 2 Matching lower bound including constants � Dynamic choice η t = ( ln N ) / ( 8 t ) only loses small constants N. Cesa-Bianchi (UNIMI) Domination and Independence 8 / 1

  12. The bandit problem: playing an unknown game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

  13. The bandit problem: playing an unknown game N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

  14. The bandit problem: playing an unknown game N actions 3 ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) Player gets feedback information: Only ℓ t ( I t ) is revealed 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

  15. The bandit problem: playing an unknown game N actions 3 ? ? ? ? ? ? ? ? For t = 1, 2, . . . Loss ℓ t ( i ) ∈ [ 0, 1 ] is assigned to every action i = 1, . . . , N 1 (hidden from the player) Player picks an action I t (possibly using randomization) and 2 incurs loss ℓ t ( I t ) Player gets feedback information: Only ℓ t ( I t ) is revealed 3 Many applications Ad placement, dynamic content adaptation, routing, online auctions N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

  16. Relationships between actions [Mannor and Shamir, 2011] N. Cesa-Bianchi (UNIMI) Domination and Independence 10 / 1

  17. A graph of relationships over actions ? ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

  18. A graph of relationships over actions ? ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

  19. A graph of relationships over actions 7 3 6 7 2 ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

  20. Recovering expert and bandit settings Experts: clique Bandits: empty graph 7 ? 3 6 3 ? 7 2 1 2 ? ? ? ? 4 9 ? ? 4 ? N. Cesa-Bianchi (UNIMI) Domination and Independence 12 / 1

  21. Exponentially weighted forecaster — Reprise Player’s strategy [Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013] � � t − 1 � � P t ( I t = i ) ∝ exp − η ℓ s ( i ) i = 1, . . . , N s = 1  ℓ t ( i )  � � if ℓ t ( i ) is observed � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Importance sampling estimator � � � E t ℓ t ( i ) = ℓ t ( i ) unbiasedness � ℓ t ( i ) 2 � 1 � � � variance control E t � ℓ t ( i ) observed P t N. Cesa-Bianchi (UNIMI) Domination and Independence 13 / 1

  22. Regret bounds Analysis (undirected graphs) T N R T � ln N + η P t ( I t = i ) � � � η 2 P t ( I t = i ) + P t ( I t = j ) t = 1 i = 1 j ∈ N G ( i ) Lemma For any undirected graph G = ( V , E ) and for any probability assignment p 1 , . . . , p N over its vertices N p i � � α ( G ) � p i + p j i = 1 j ∈ N G ( i ) α ( G ) is the independence number of G (largest subset of V such that no two distinct vertices in it are adjacent in G ) N. Cesa-Bianchi (UNIMI) Domination and Independence 14 / 1

  23. Regret bounds Analysis (undirected graphs) T � R T � ln N + η � α ( G ) = Tα ( G ) ln N by choosing η η 2 t = 1 Special cases √ Experts (clique): α ( G ) = 1 T ln N R T � √ Bandits (empty graph): α ( G ) = N TN ln N R T � Minimax rate �� � The general bound is tight: R T = Θ Tα ( G ) ln N N. Cesa-Bianchi (UNIMI) Domination and Independence 15 / 1

  24. More general feedback models Interventions Directed N. Cesa-Bianchi (UNIMI) Domination and Independence 16 / 1

  25. Old and new examples Experts Bandits Cops & Robbers Revealing Action N. Cesa-Bianchi (UNIMI) Domination and Independence 17 / 1

  26. Exponentially weighted forecaster with exploration Player’s strategy [Alon, C-B, Dekel and Koren, 2015] � � t − 1 P t ( I t = i ) ∝ 1 − γ � � exp − η ℓ s ( i ) + γ U G i = 1, . . . , N Z t s = 1  ℓ t ( i )  � � if ℓ t ( i ) is observed � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise U G is uniform distribution supported on a subset of V N. Cesa-Bianchi (UNIMI) Domination and Independence 18 / 1

  27. A characterization of feedback graphs A vertex of G is: observable if it has at least one incoming edge (possibly a self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable 1 3 is not observable 5 2 2 and 5 are weakly observable 1 and 4 are strongly observable 4 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 19 / 1

  28. Minimax rates �� � R T = � G is strongly observable Θ α ( G ) T U G is uniform on V � � R T = � T 2 / 3 δ ( G ) G is weakly observable Θ U G is uniform on a weakly dominating set G is not observable R T = Θ ( T ) 1 Weakly dominating set 5 2 δ ( G ) is the size of the smallest set that dominates all weakly observable nodes of G 4 3 N. Cesa-Bianchi (UNIMI) Domination and Independence 20 / 1

Recommend


More recommend