Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies • goal: learn through experimentation to do (almost) as well as best π ∈ Π • policies may be very complex and expressive ⇒ powerful approach • challenges: • Π extremely large • need to be learning about all policies simultaneously while also performing as well as the best • when action selected, only observe reward for policies that would have chosen same action • exploration versus exploitation on a gigantic scale!
Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π
Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T 1 � r t ( a t ) T t =1 � �� � learner’s average reward
Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T T 1 1 � � max r t ( π ( x t )) − r t ( a t ) T T π ∈ Π t =1 t =1 � �� � � �� � best policy’s average reward learner’s average reward
An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π
An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data
An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic
An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� � K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic • seems hopeless to do better for fully general policy spaces • this talk: aim for time/space only poly(log | Π | ) when Π is “well structured”
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 0 = learner’s action
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action learner’s total reward = 0 . 2 +
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any π , can compute rewards would have received
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward
The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward • choose empirically best π ∈ Π �� � ln | Π | • regret = O T
“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1
“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � π = arg max ˆ r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost
“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost • so: if have “good” classification algorithm for Π, can use to find good policy (in full-information setting)
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . .
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . .
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any policy π , only observe π ’s rewards on subset of rounds
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds
But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds • might like to use AMO to find empirically good policy • problems: • only see some rewards • observed rewards highly biased (due to skewed choice of actions)
Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO?
Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias
Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | )
Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | ) • AMO is theoretical idealization • captures structure in policy space • in practice, can use off-the-shelf classification algorithm
ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) • uniformly at random (with probability ǫ )
ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ )
ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ ) �� K ln | Π | � 1 / 3 � • regret = O T • fast and simple, but not optimal
“Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm [Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang] • RandomizedUCB (aka “Monster”) algorithm gets optimal regret using AMO • solves multiple optimization problems using ellipsoid algorithm � T 4 � • very slow: calls AMO about ˜ O times on every round
Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� � K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� � K ˜ O ≪ 1 T ln | Π | times per round
Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� � K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� � K ˜ O ≪ 1 T ln | Π | times per round • rest of talk: sketching main ideas of the algorithm
De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates
De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates • nevertheless: can use simple trick to get unbiased estimates for all actions
De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a
De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a )
De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions
De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1
De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1 ∴ can estimate regret of any policy π : � ˆ π ) − ˆ Regret( π ) = max R (ˆ R ( π ) ˆ π ∈ Π • can find maximizing ˆ π using AMO
Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done?
Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large
Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a )
Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a ) ∴ to get good estimates, must ensure that 1 / p ( a ) not too large
Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly
Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x )
Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ]
Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ] • seems will require time/space O ( | Π | ) to compute Q over space Π • will see later how to avoid!
How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward
How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward 2. low (estimated) variance i.e., ensure future estimates will be accurate
How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret [exploit] i.e., choose actions think will give high reward 2. low (estimated) variance [explore] i.e., ensure future estimates will be accurate
Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π
Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π
Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π • want small: � Q ( π ) � Regret( π ) ≤ [small] π
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a •
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero
Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q µ ( a | x ) = variance of estimate of reward for action a • 1 • so Q µ ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q µ ( π ( x ) | x ) Q µ ( π ( x τ ) | x τ ) t − 1 τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero • to avoid, “smooth” probabilities by occassionally picking action uniformly at random: Q µ ( a | x ) = (1 − K µ ) Q ( a | x ) + µ
Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π
Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π � Q ( π ) = 1 π
Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 for all π ∈ Π � Q ( π ) = 1 π • can fill in constants
Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 + � Regret( π ) for all π ∈ Π � Q ( π ) = 1 π • can fill in constants • make easier by: • allowing higher variance for policies with higher regret (poor policies can be eliminated even with fairly poor performance estimates)
Recommend
More recommend