weighted bandits or how bandits learn distorted values
play

Weighted bandits or: How bandits learn distorted values that are not - PowerPoint PPT Presentation

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. Joint work with Aditya Gopalan , Michael Fu and Steve Marcus University of Maryland, College Park Indian Institute of Science


  1. Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A. ∗ Joint work with Aditya Gopalan † , Michael Fu ∗ and Steve Marcus ∗ ∗ University of Maryland, College Park † Indian Institute of Science

  2. Going to offjce - bandit style On every day 1. Pick a route to offjce 2. Reach offjce and record (sufgered) delay 1

  3. Why not distort? Delays are stochastic In choosing between routes, humans *need not* minimize expected delay 2

  4. Why not distort? Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

  5. Why not distort? Two-route scenario 1: Average delay(Route 2) slightly above that of Route 1 Route 2 has a *small* chance of *very* low delay I might prefer Route 2 Two-route scenario 2: Average delay(Route 2) slightly below that of Route 1 Route 2 has a *small* chance of *very* high delay, e.g. jammed traffjc I might prefer Route 1 3

  6. What we do 1 Rank-dependent expected utility - Quiggin (1982) [1] Cumulative prospect theory - Tversky & Kahneman (1992) 0 0 Multi-armed bandits Probability distortion 1 Probability p 4 0 1 0 0 . 8 Weight w ( p ) 0 . 6 p 0 . 69 0 . 4 ( p 0 . 69 + ( 1 − p ) 0 . 69 ) 1 / 0 . 69 0 . 2 0 . 2 0 . 4 0 . 6 0 . 8 The weight-distorted value µ k for any arm k ∈ { 1 , . . . , K } is ∫ ∞ ∫ ∞ µ k = w ( P [ Y k > z ]) dz − w ( P [ −Y k > z ]) dz , Y k is the r.v. corresponding to stochastic costs from arm k. Weight function w : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1

  7. 1-slide summary K-armed bandits Linearly parameterized bandits noise model Application: Traveler’s route choice • optimizing the route choice of a human traveler using GLD traffjc simulator • implement vanilla OFUL and weight-distorted OFUL ( WOFUL ) • exhibit qualitative difgerence between WOFUL and OFUL routes 5 • Upper Confjdence Bound (UCB) + distortions • Sublinear regret O ( n ( 2 − α ) / 2 ) , α ∈ ( 0 , 1 ) is Hölder exponent of w • Matching lower bound • Optimism in the Face of Uncertainty Linear (OFUL) + arm-dependent • Regret O ( d √ n polylog ( n )) , for sub-Gaussian cost distributions.

  8. Outline K-armed bandits Linear bandits Routing application 6

  9. Bandit model 0 K • • K Known # of arms K and horizon n K . 7 • observe a sample cost from F I m Unknown Distributions F k , k = 1 , . . . , K, distorted values µ 1 , . . . , µ K Interaction In each round m = 1 , . . . , n • pull arm I m ∈ { 1 , . . . , K } { } ∫ ∞ Benchmark: µ ∗ = min µ k := w ( 1 − F k ( z )) dz 1 ,..., K ∑ ∑ Regret: R n = T k ( n ) µ k − n µ ∗ = T k ( n )∆ k k = 1 k = 1 T k ( n ) is the # of times arm k is pulled up to time n ∆ k = µ k − µ ∗ is the gap ∑ Goal: Minimize expected regret E R n = E [ T k ( n )]∆ k k = 1

  10. UCB values 1 • Mean-reward estimate • Confjdence width At each round t, select a tap. Optimize the quality of n selected beers [1] Auer et al. (2002) Finite-time analysis of the multiarmed bandit problem. In: MLJ. 8 UCB ( k ) = ˆ ˆ µ k σ k −

  11. Assumptions 3 log m arg min 2 9 Weighted UCB Pull each arm once (A1). Weight w is Hölder continuous with constant L and exponent α ∈ ( 0 , 1 ] (A2). The arms’ costs are bounded by M > 0 a.s. For each round m = 1 , 2 , . . . do For each arm k = 1 , . . . , K do Compute an estimate ˆ µ k , T k ( m − 1 ) of weight-distorted value µ k ( ) α UCB index: UCB ( k , m ) = � µ k , T k ( m − 1 ) − LM 2T k ( m − 1 ) Pull arm I m = UCB ( k , m ) . k = { 1 ,..., K }

  12. k lies in Weight-distorted value estimation 2j j j k j LM 3 log m 2 j k j LM 3 log m 2j 2 w.h.p. 0 1 10 j w 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k .

  13. Weight-distorted value estimation w w.h.p. 2 2j 2j j 1 0 j j j 10 0 j ∫ ∞ Problem: Estimate weight-distorted value µ k = w ( 1 − F k ( z )) dz for some k ∈ { 1 , . . . , K } Input: Samples Y k , 1 , . . . , Y k , j from distribution F k ( ( j + 1 − i ) ( j − i )) ∑ Solution: � µ k , j := − w Y [ k , i ] i = 1 ∫ ∞ w ( 1 − ˆ µ k , j = F k , j ( z )) dz Interpretation: � ∑ ˆ I [ ] is the empirical distribution function for arm k F k , j ( x ) := Yk , i ≤ x i = 1 Sample complexity Under (A1) and (A2), ∀ ϵ > 0 and any k ∈ { 1 , . . . , K } , we have ( − 2j ( ϵ/ LM ) 2 /α ) � � � > ϵ ) ≤ 2 exp �� P ( µ k , j − µ k . [ ] ( 3 log m ) α ( 3 log m ) α 2 , � µ k lies in µ k , j − LM µ k , j + LM �

  14. LM 2 How I learn to stop regretting.. 4 a stochastic environment and Hölder weight w such that R n k k 0 log n 2 Upper bound 1 k f n g n f n cg n for some positive c and n n 0 Lower bound For any sub-polynomial regret algorithm, 2 11 Gap-independent: Gap-dependent: k 2 3 n ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n )

  15. How I learn to stop regretting.. 3 k environment and Hölder weight w such that 2 n Upper bound Gap-independent: 2 11 k Gap-dependent: ( ) ∑ 3 ( 2LM ) 2 /α log n 1 + 2 π 2 E R n ≤ + MK . 2 ∆ 2 /α − 1 { k :∆ k > 0 } ( 3 ) α 2 − α E R n ≤ MK α/ 2 2 ( 2L ) 2 /α log n + c . For α < 1, the bound above is worse than usual UCB upper bound of O ( √ n ) Lower bound For any sub-polynomial regret algorithm, ∃ a stochastic   ∑ ( LM ) 2 /α log n   . E R n = Ω 4 ∆ 2 /α − 1 { k :∆ k > 0 } f ( n ) = Ω( g ( n )) ⇔ f ( n ) ≥ cg ( n ) for some positive c and n > n 0

  16. Outline K-armed bandits Linear bandits Routing application 12

  17. Linear bandit model Choose x I m before you get drunk Optimize the beer you drink, is enough estimating No need to estimate mean-reward of all arms, Linearity standard Gaussian r.v.s random vector of i.i.d. 13 Gaussian noise Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ

  18. Linear bandit model Gaussian noise before you get drunk Optimize the beer you drink, No need to estimate mean-reward of all arms, standard Gaussian r.v.s random vector of i.i.d. Choose x I m 13 Large set of arms Use ridge regression Observe c m c m := x T I m ( θ + N m ) Unknown parameter θ ∈ R d x i ∈ R d , i = 1 , . . . , K, K ≫ 1 N m := ( N 1 m , . . . , N d m ) ,is a Estimate θ Linearity ⇒ estimating θ is enough

  19. Arm-dependent noise model 5 [1] Abbasi-Yadkori et al. (2011) Improved algorithms for linear stochastic bandits. In NIPS. standard Gaussian Noise model: specifjes the edge delay encoded by a vector of Route: x is a collection of edges Routing example: src 6 dst 4 12 1 2 3 7 8 9 10 11 14 Dimension d = # number of lanes 0 − 1 values Edge weight: For any edge j, θ j c m := x T I m ( θ + N m ) for any I m ∈ { 1 , . . . , K } Previous linear bandit algorithms, e.g. OFUL 1 , assume c m := x T Im θ + ξ m , where ξ m is

  20. WOFUL Algorithm • Updates for ridge m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: regression 15 distortions within ellipsoid won’t Probability work with • OFUL’s choice x T Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

  21. WOFUL Algorithm regression m Update statistics arg min Arm selection + feedback 2 log . A m Confjdence ellipsoid: 15 • Updates for ridge ellipsoid won’t work with Probability distortions • OFUL’s choice Initialization: A 1 = λ I d × d , b 1 = 0, ˆ θ 1 = 0. For each round m = 1 , 2 , . . . do { } � � � � θ ∈ R d : � θ − ˆ Set C m := θ m ≤ D m � • Ensures θ lies in √ ( ) √ det ( A m ) 1 / 2 λ d / 2 /δ where D m := + β λ C m with high Let ( x m , ˜ µ x ( θ ′ ) . θ m ) := x T θ within ( x ,θ ′ ) ∈X× C m Choose arm x m and observe cost c m . Update A m + 1 = A m + x m x T ∥ x m ∥ 2 , b m + 1 = b m + c m x m ∥ x m ∥ , and θ m + 1 = A − 1 ˆ m + 1 b m + 1

Recommend


More recommend