the contextual bandits problem the contextual bandits
play

The Contextual Bandits Problem The Contextual Bandits Problem The - PowerPoint PPT Presentation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and


  1. Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies • goal: learn through experimentation to do (almost) as well as best π ∈ Π • policies may be very complex and expressive ⇒ powerful approach • challenges: • Π extremely large • need to be learning about all policies simultaneously while also performing as well as the best • when action selected, only observe reward for policies that would have chosen same action • exploration versus exploitation on a gigantic scale!

  2. Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π

  3. Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T 1 � r t ( a t ) T t =1 � �� � learner’s average reward

  4. Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T T 1 1 � � max r t ( π ( x t )) − r t ( a t ) T T π ∈ Π t =1 t =1 � �� � � �� � best policy’s average reward learner’s average reward

  5. An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π

  6. An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data

  7. An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic

  8. An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic • seems hopeless to do better for fully general policy spaces • this talk: aim for time/space only poly(log | Π | ) when Π is “well structured”

  9. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions

  10. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 0 = learner’s action

  11. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action

  12. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action learner’s total reward = 0 . 2 +

  13. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·

  14. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any π , can compute rewards would have received

  15. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received

  16. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward

  17. The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward • choose empirically best π ∈ Π �� � ln | Π | • regret = O T

  18. “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1

  19. “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � π = arg max ˆ r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

  20. “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost • so: if have “good” classification algorithm for Π, can use to find good policy (in full-information setting)

  21. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken

  22. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . .

  23. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . .

  24. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·

  25. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any policy π , only observe π ’s rewards on subset of rounds

  26. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds

  27. But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds • might like to use AMO to find empirically good policy • problems: • only see some rewards • observed rewards highly biased (due to skewed choice of actions)

  28. Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO?

  29. Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias

  30. Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | )

  31. Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | ) • AMO is theoretical idealization • captures structure in policy space • in practice, can use off-the-shelf classification algorithm

  32. ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) • uniformly at random (with probability ǫ )

  33. ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ )

  34. ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ ) �� K ln | Π | � 1 / 3 � • regret = O T • fast and simple, but not optimal

  35. “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm [Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang] • RandomizedUCB (aka “Monster”) algorithm gets optimal regret using AMO • solves multiple optimization problems using ellipsoid algorithm � T 4 � • very slow: calls AMO about ˜ O times on every round

  36. Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� � K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� � K ˜ O ≪ 1 T ln | Π | times per round

  37. Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� � K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� � K ˜ O ≪ 1 T ln | Π | times per round • rest of talk: sketching main ideas of the algorithm

  38. De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates

  39. De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates • nevertheless: can use simple trick to get unbiased estimates for all actions

  40. De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a

  41. De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a )

  42. De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions

  43. De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1

  44. De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1 ∴ can estimate regret of any policy π : � ˆ π ) − ˆ Regret( π ) = max R (ˆ R ( π ) ˆ π ∈ Π • can find maximizing ˆ π using AMO

  45. Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done?

  46. Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large

  47. Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a )

  48. Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a ) ∴ to get good estimates, must ensure that 1 / p ( a ) not too large

  49. Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly

  50. Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x )

  51. Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ]

  52. Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ] • seems will require time/space O ( | Π | ) to compute Q over space Π • will see later how to avoid!

  53. How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward

  54. How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward 2. low (estimated) variance i.e., ensure future estimates will be accurate

  55. How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret [exploit] i.e., choose actions think will give high reward 2. low (estimated) variance [explore] i.e., ensure future estimates will be accurate

  56. Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π

  57. Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π

  58. Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π • want small: � Q ( π ) � Regret( π ) ≤ [small] π

  59. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a •

  60. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π

  61. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1

  62. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π

  63. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero

  64. Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q µ ( a | x ) = variance of estimate of reward for action a • 1 • so Q µ ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q µ ( π ( x ) | x ) Q µ ( π ( x τ ) | x τ ) t − 1 τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero • to avoid, “smooth” probabilities by occassionally picking action uniformly at random: Q µ ( a | x ) = (1 − K µ ) Q ( a | x ) + µ

  65. Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π

  66. Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π � Q ( π ) = 1 π

  67. Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 for all π ∈ Π � Q ( π ) = 1 π • can fill in constants

  68. Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 + � Regret( π ) for all π ∈ Π � Q ( π ) = 1 π • can fill in constants • make easier by: • allowing higher variance for policies with higher regret (poor policies can be eliminated even with fairly poor performance estimates)

Recommend


More recommend