reducing contextual bandits to supervised learning
play

Reducing contextual bandits to supervised learning Daniel Hsu - PowerPoint PPT Presentation

Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing physician 2 Learning to


  1. Using no-exploration Example : two contexts { X , Y } , two actions { A , B } . Suppose initial policy says ˆ π ( X ) = A and ˆ π ( Y ) = B . Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 — 0 . 1 0 . 5 0 . 1 Y Y π ′ ( X ) = ˆ π ′ ( Y ) = A . New policy: ˆ Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Never try action B in context X . 9

  2. Using no-exploration Example : two contexts { X , Y } , two actions { A , B } . Suppose initial policy says ˆ π ( X ) = A and ˆ π ( Y ) = B . Observed rewards Reward estimates A B A B X 0 . 7 — X 0 . 7 0 . 5 — 0 . 1 0 . 5 0 . 1 Y Y π ′ ( X ) = ˆ π ′ ( Y ) = A . New policy: ˆ Observed rewards Reward estimates True rewards A B A B A B X 0 . 7 — X 0 . 7 0 . 5 X 0 . 7 1 . 0 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Y 0 . 3 0 . 1 Never try action B in context X . Ω( 1 ) regret . 9

  3. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! 10

  4. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . 10

  5. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . 10

  6. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� � K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. 10

  7. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� � K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. ◮ Explicit bookkeeping is computationally intractable for large N . 10

  8. Dealing with policies Feedback in round t : reward of chosen action r t ( a t ) . ◮ Tells us about policies π ∈ Π s.t. π ( x t ) = a t . ◮ Not informative about other policies! Possible approach : track average reward of each π ∈ Π . ◮ Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . �� � K log N ◮ Statistically optimal regret bound O T for K := |A| actions and N := | Π | policies after T rounds. ◮ Explicit bookkeeping is computationally intractable for large N . But perhaps policy class Π has some structure . . . 10

  9. Hypothetical “full-information” setting If we observed rewards for all actions . . . 11

  10. Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . 11

  11. Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier 11

  12. Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier ◮ Can often exploit structure of Π to get tractable algorithms. Abstraction : arg max oracle ( AMO ) �� � � t � t AMO ( x i , ρ i ) := arg max ρ i ( π ( x i )) . i = 1 π ∈ Π i = 1 11

  13. Hypothetical “full-information” setting If we observed rewards for all actions . . . ◮ Like supervised learning , have labeled data after t rounds: ( x 1 , ρ 1 ) , . . . , ( x t , ρ t ) ∈ X × R A . context − → features actions − → classes rewards − → − costs policy − → classifier ◮ Can often exploit structure of Π to get tractable algorithms. Abstraction : arg max oracle ( AMO ) �� � � t � t AMO ( x i , ρ i ) := arg max ρ i ( π ( x i )) . i = 1 π ∈ Π i = 1 Can’t directly use this in bandit setting. 11

  14. Using AMO with some exploration Explore-then-exploit 12

  15. Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . 12

  16. Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 12

  17. Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 3. Henceforth use a t := ˆ π ( x t ) , for t = τ + 1 , τ + 2 , . . . , T . 12

  18. Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t ∈ A u.a.r. to get unbiased estimates ˆ r t of r t for all t ≤ τ . r t ) } τ π := AMO ( { ( x t , ˆ 2. Get ˆ t = 1 ) . 3. Henceforth use a t := ˆ π ( x t ) , for t = τ + 1 , τ + 2 , . . . , T . Regret bound with best τ : ∼ T − 1 / 3 ( sub-optimal ). (Dependencies on |A| and | Π | hidden.) 12

  19. Previous contextual bandit algorithms 13

  20. Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . 13

  21. Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO . 13

  22. Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995) . Optimal regret, but explicitly enumerates Π . Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO . Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, & Zhang, UAI 2011) Near optimal regret, but O ( T 6 ) calls to AMO . 13

  23. Our result Let K := |A| and N := | Π | . Our result : a new, fast and simple algorithm. �� � ◮ Regret bound: ˜ K log N . O T Near optimal. �� � ◮ # calls to AMO : ˜ TK O . log N Less than once per round! 14

  24. Rest of the talk Components of the new algorithm : Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS 1. “Classical” tricks: randomization, inverse propensity weighting. 2. Efficient algorithm for balancing exploration/exploitation. 3. Additional tricks: warm-start and epoch structure. 15

  25. 1. Classical tricks 16

  26. What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] 17

  27. What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] Q : How do I learn about r t ( a ) for actions a I don’t actually take? 17

  28. What would’ve happened if I had done X? For t = 1 , 2 , . . . , T : 0. Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 1. Observe context x t ∈ X . [e.g., user profile, search query] 2. Choose action a t ∈ A . [e.g., ad to display] 3. Collect reward r t ( a t ) ∈ [ 0 , 1 ] . [e.g., 1 if click, 0 otherwise] Q : How do I learn about r t ( a ) for actions a I don’t actually take? A : Randomize . Draw a t ∼ p t for some pre-specified prob. dist. p t . 17

  29. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t : r t ( a ) := r t ( a t ) · 1 { a = a t } ∀ a ∈ A � ˆ p t ( a t ) 18

  30. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . 18

  31. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 18

  32. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . 18

  33. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . � t Estimate avg. reward of policy : � Rew t ( π ) := 1 i = 1 ˆ r i ( π ( x i )) . t 18

  34. Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t :   r t ( a t )   if a = a t ,  r t ( a ) := r t ( a t ) · 1 { a = a t } p t ( a t ) ∀ a ∈ A � ˆ = p t ( a t )     0 otherwise . Unbiasedness : � � � p t ( a ′ ) · r t ( a ′ ) · 1 { a = a ′ } r t ( a ) ˆ = = r t ( a ) . E a t ∼ p t p t ( a ′ ) a ′ ∈A 1 Range and variance : upper-bounded by p t ( a ) . � t Estimate avg. reward of policy : � Rew t ( π ) := 1 i = 1 ˆ r i ( π ( x i )) . t How should we choose the p t ? 18

  35. Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� � � �� � action distribution ( policy distribution , context ) 19

  36. Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� � � �� � action distribution ( policy distribution , context ) Policy distribution : Q = ( Q ( π ) : π ∈ Π) probability dist. over policies π in the policy class Π 19

  37. Hedging over policies Get action distributions via policy distributions. ( Q , x ) �− → p � �� � � �� � action distribution ( policy distribution , context ) 1: Pick initial distribution Q 1 over policies Π . 2: for round t = 1 , 2 , . . . do Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 3: Observe context x t . 4: Compute distribution p t over A (using Q t and x t ). 5: Pick action a t ∼ p t . 6: Collect reward r t ( a t ) . 7: Compute new distribution Q t + 1 over policies Π . 8: 9: end for 19

  38. 2. Efficient construction of good policy distributions 20

  39. Our approach Q : How do we choose Q t for good exploration/exploitation? 21

  40. Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! 21

  41. Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 21

  42. Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q . 21

  43. Our approach Q : How do we choose Q t for good exploration/exploitation? Caveat : Q t must be efficiently computable + representable! Our approach : 1. Define convex feasibility problem (over distributions Q on Π ) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q . Algorithm only accesses Π via calls to AMO = ⇒ nnz ( Q ) = O ( # AMO calls ) 21

  44. The “good policy distribution” problem Convex feasibility problem for policy distribution Q 22

  45. The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π 22

  46. The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) 22

  47. The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) Using feasible Q t in round t gives near-optimal regret. 22

  48. The “good policy distribution” problem Convex feasibility problem for policy distribution Q � � K log N Q ( π ) · � Reg t ( π ) ≤ (Low regret) t π ∈ Π � � � var � Rew t ( π ) ≤ b ( π ) ∀ π ∈ Π (Low variance) Using feasible Q t in round t gives near-optimal regret. But | Π | variables and > | Π | constraints, . . . 22

  49. Solving the convex feasibility problem Solver for “good policy distribution” problem (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

  50. Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

  51. Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

  52. Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. 2. Find most violated “low variance” constraint—say, corresponding to policy � π —and update Q ( � π ) := Q ( � π ) + α . ( c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

  53. Solving the convex feasibility problem Solver for “good policy distribution” problem Start with some Q (e.g., Q := 0 ), then repeat: 1. If “low regret” constraint violated, then fix by rescaling: Q := c Q for some c < 1. 2. Find most violated “low variance” constraint—say, corresponding to policy � π —and update Q ( � π ) := Q ( � π ) + α . (If no such violated constraint, stop and return Q .) ( c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

  54. Implementation via AMO Finding “low variance” constraint violation : 1. Create fictitious rewards for each i = 1 , 2 , . . . , t : µ � r i ( a ) := ˆ r i ( a ) + ∀ a ∈ A , Q ( a | x i ) � ( log N ) / ( Kt ) . where µ ≈ �� � � t ( x i , � 2. Obtain � π := AMO r i ) . i = 1 3. � Rew t ( � π ) > threshold iff � π ’s “low variance” constraint is violated. 24

  55. Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

  56. Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Partial derivative w.r.t. Q ( π ) is “low variance” constraint for π .) (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

  57. Iteration bound Solver is coordinate descent for minimizing potential function � � � Φ( Q ) := c 1 · � Q ( π ) � RE ( uniform � Q ( ·| x )) + c 2 · Reg t ( π ) . E x π ∈ Π (Partial derivative w.r.t. Q ( π ) is “low variance” constraint for π .) Returns a feasible solution after   � Kt ˜   steps . O log N (Actually use ( 1 − ε ) · Q + ε · uniform inside RE expression.) 25

  58. Algorithm 1: Pick initial distribution Q 1 over policies Π . 2: for round t = 1 , 2 , . . . do Nature draws ( x t , r t ) from dist. D over X × [ 0 , 1 ] A . 3: Observe context x t . 4: Compute action distribution p t := Q t ( · | x t ) . 5: Pick action a t ∼ p t . 6: Collect reward r t ( a t ) . 7: Compute new policy distribution Q t + 1 using coordinate 8: descent + AMO . 9: end for 26

  59. Recap 27

  60. Recap Feasible solution to “good policy distribution problem” gives near optimal regret bound. 27

  61. Recap Feasible solution to “good policy distribution problem” gives near optimal regret bound. New coordinate descent algorithm : repeatedly find a violated constraint and adjust Q to satisfy it. 27

Recommend


More recommend