The Contextual Bandits Problem The Contextual Bandits Problem The - PowerPoint PPT Presentation

Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies Learning with Context and Policies • goal: learn through experimentation to do (almost) as well as best π ∈ Π • policies may be very complex and expressive ⇒ powerful approach • challenges: • Π extremely large • need to be learning about all policies simultaneously while also performing as well as the best • when action selected, only observe reward for policies that would have chosen same action • exploration versus exploitation on a gigantic scale!

Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π

Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T 1 � r t ( a t ) T t =1 � �� learner’s average reward

Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) Formal Model ( revisited ) • repeat 1a. learner observes context x t 1b. reward vector r t ∈ [0 , 1] K chosen (but not observed) 2. learner selects action a t ∈ { 1 , . . . , K } 3. learner receives observed reward r t ( a t ) • goal: want high total (or average) reward relative to best policy π ∈ Π • i.e., want small regret: T T 1 1 � � max r t ( π ( x t )) − r t ( a t ) T T π ∈ Π t =1 t =1 � �� best policy’s average reward learner’s average reward

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic

An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem An Algorithm that Solves this Problem [Auer, Cesa-Bianchi, Freund, Schapire] • Exp4 solves this problem • maintains weights over all policies in Π • regret is essentially optimal: �� K ln | Π | O T • even works for adversarial (i.e., non-random, non-iid) data • but: time/space are linear in | Π | • too slow if | Π | gigantic • seems hopeless to do better for fully general policy spaces • this talk: aim for time/space only poly(log | Π | ) when Π is “well structured”

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 0 = learner’s action

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action learner’s total reward = 0 . 2 +

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any π , can compute rewards would have received

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward

The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting The (Fantasy) Full-Information Setting • say see rewards for all actions Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action � � = π ’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0 . 0 + 1 . 0 + 0 . 5 + · · · • for any π , can compute rewards would have received • average is good estimate of π ’s expected reward • choose empirically best π ∈ Π �� ln | Π | • regret = O T

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � π = arg max ˆ r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost

“Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) “Arg-Max Oracle” (AMO) • to apply, just need “oracle” (algorithm/subroutine) for finding best π ∈ Π on observed rewards • input: ( x 1 , r 1 ) , . . . , ( x T , r T ) x t = context r t = ( r t (1) , . . . , r t ( K )) = rewards for all actions • output: T � ˆ π = arg max r t ( π ( x t )) π ∈ Π t =1 • really just (cost-sensitive) classification: context ↔ example action ↔ label/class policy ↔ classifier reward ↔ gain/(negative) cost • so: if have “good” classification algorithm for Π, can use to find good policy (in full-information setting)

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . .

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 ( Male , 50 , . . . ) 1.0 0.2 0.0 0 = learner’s action ( Female , 18 , . . . ) 1.0 0.0 1.0 ( Female , 48 , . . . ) 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · ·

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · • for any policy π , only observe π ’s rewards on subset of rounds

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds

But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... But in the Bandit Setting... • ...only see rewards for actions taken Actions Context 1 2 3 0.2 � ( Male , 50 , . . . ) 1.0 0.0 0 = learner’s action 0.0 � � = π ’s action ( Female , 18 , . . . ) 1.0 1.0 ( Female , 48 , . . . ) � 0.5 0.1 0.7 . . . . . . learner’s total reward = 0 . 2 + 1 . 0 + 0 . 1 + · · · π ’s total reward = 0.0 ?? + 1 . 0 + 0.5 ?? + · · · • for any policy π , only observe π ’s rewards on subset of rounds • might like to use AMO to find empirically good policy • problems: • only see some rewards • observed rewards highly biased (due to skewed choice of actions)

Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO?

Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias

Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | )

Key Question Key Question Key Question Key Question Key Question • still: AMO is a natural primitive • key question: can we solve the contextual bandits problem given access to AMO? • can we use an AMO on bandit data by somehow: • filling in missing data • overcoming bias • want: • optimal regret • time/space bounds poly(log | Π | ) • AMO is theoretical idealization • captures structure in policy space • in practice, can use off-the-shelf classification algorithm

ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) • uniformly at random (with probability ǫ )

ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ )

ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy ǫ -Greedy/Epoch-Greedy [Langford & Zhang] • partially solved by the ǫ -greedy/epoch-greedy algorithm • on each round, choose action: • according to “best” policy so far (with probability 1 − ǫ ) [can find with AMO] • uniformly at random (with probability ǫ ) �� K ln | Π | � 1 / 3 � • regret = O T • fast and simple, but not optimal

“Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm “Monster” Algorithm [Dud´ ık, Hsu, Kale, Karampatziakis, Langford, Reyzin & Zhang] • RandomizedUCB (aka “Monster”) algorithm gets optimal regret using AMO • solves multiple optimization problems using ellipsoid algorithm � T 4 � • very slow: calls AMO about ˜ O times on every round

Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� K ˜ O ≪ 1 T ln | Π | times per round

Main Result Main Result Main Result Main Result Main Result • new, simple algorithm for contextual bandits with AMO access �� K ln | Π | • (nearly) optimal regret: ˜ O T • fast: calls AMO far less than once per round! • on average, calls AMO �� K ˜ O ≪ 1 T ln | Π | times per round • rest of talk: sketching main ideas of the algorithm

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates

De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates De-biasing Biased Estimates • selection bias is major problem: • only observe reward for single action • exploring while exploiting leads to inherently biased estimates • nevertheless: can use simple trick to get unbiased estimates for all actions

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a )

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1

De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) De-biasing Biased Estimates (cont.) • say r ( a ) = (unknown) reward for action a p ( a ) = (known) probability of choosing a � r ( a ) / p ( a ) if a chosen • define ˆ r ( a ) = 0 else • then E [ˆ r ( a )] = r ( a ) — unbiased! ∴ can estimate reward for all actions ∴ can estimate expected reward for any policy π : t − 1 1 � ˆ r τ ( π ( x τ )) = ˆ R ( π ) = ˆ E [ˆ r ( π ( x ))] t − 1 τ =1 ∴ can estimate regret of any policy π : � ˆ π ) − ˆ Regret( π ) = max R (ˆ R ( π ) ˆ π ∈ Π • can find maximizing ˆ π using AMO

Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done?

Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large

Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a )

Variance Control Variance Control Variance Control Variance Control Variance Control • estimates are unbiased — done? • no! — variance may be extremely large 1 • can show variance(ˆ r ( a )) ≤ p ( a ) ∴ to get good estimates, must ensure that 1 / p ( a ) not too large

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x )

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ]

Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies Randomizing over Policies • need to choose actions (semi-)randomly • approach: on each round, • compute distribution Q over policy space Π • randomly pick π ∼ Q • on current context x , choose action π ( x ) • Q induces distribution over actions (for any x ): Q ( a | x ) = Pr π ∼ Q [ π ( x ) = a ] • seems will require time/space O ( | Π | ) to compute Q over space Π • will see later how to avoid!

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret i.e., choose actions think will give high reward 2. low (estimated) variance i.e., ensure future estimates will be accurate

How to Pick Q How to Pick Q How to Pick Q How to Pick Q How to Pick Q • on each round, want to pick Q with: 1. low (estimated) regret [exploit] i.e., choose actions think will give high reward 2. low (estimated) variance [explore] i.e., ensure future estimates will be accurate

Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π

Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π

Low Regret Low Regret Low Regret Low Regret Low Regret • � Regret( π ) = estimated regret of π • so: estimated regret for random π ∼ Q is � � � Q ( π ) � � Regret( π ) = E π ∼ Q Regret( π ) π • want small: � Q ( π ) � Regret( π ) ≤ [small] π

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a •

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q ( a | x ) = variance of estimate of reward for action a • 1 • so Q ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q ( π ( x ) | x ) t − 1 Q ( π ( x τ ) | x τ ) τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero

Low Variance Low Variance Low Variance Low Variance Low Variance 1 Q µ ( a | x ) = variance of estimate of reward for action a • 1 • so Q µ ( π ( x ) | x ) = variance if action chosen by π • can estimate expected variance for actions chosen by π : t − 1 � � 1 1 1 � V Q ( π ) = ˆ ˆ = E Q µ ( π ( x ) | x ) Q µ ( π ( x τ ) | x τ ) t − 1 τ =1 • want small: ˆ V Q ( π ) ≤ [small] for all π ∈ Π • detail: problematic if Q ( a | x ) too close to zero • to avoid, “smooth” probabilities by occassionally picking action uniformly at random: Q µ ( a | x ) = (1 − K µ ) Q ( a | x ) + µ

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ [small] π V Q ( π ) ≤ [small] ˆ for all π ∈ Π � Q ( π ) = 1 π

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 for all π ∈ Π � Q ( π ) = 1 π • can fill in constants

Pulling Together Pulling Together Pulling Together Pulling Together Pulling Together • want Q such that: � Q ( π ) � Regret( π ) ≤ C 0 π C 1 · ˆ V Q ( π ) ≤ C 0 + � Regret( π ) for all π ∈ Π � Q ( π ) = 1 π • can fill in constants • make easier by: • allowing higher variance for policies with higher regret (poor policies can be eliminated even with fairly poor performance estimates)

The Contextual Bandits Problem The Contextual Bandits Problem The - PowerPoint PPT Presentation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Computer Science 210: Data Structures Fall 2010 Welcome to Data Structures! The class is

Optimal Crossover Designs for Comparing Test Treatments to a Control Treatment When Subject

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

tr ts ts t

Efficient Algorithms for Infinite-Armed Bandit Arghya Roy Chaudhuri under the guidance of Prof.

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

The Contextual Bandits Problem The Contextual Bandits Problem The - PowerPoint PPT Presentation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem A New, Fast, and Simple Algorithm A New, Fast, and Simple Algorithm A New, Fast, and

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Computer Science 210: Data Structures Fall 2010 Welcome to Data Structures! The class is

Optimal Crossover Designs for Comparing Test Treatments to a Control Treatment When Subject

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

tr ts ts t

Efficient Algorithms for Infinite-Armed Bandit Arghya Roy Chaudhuri under the guidance of Prof.

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual