cumulative prospect theory meets reinforcement learning
play

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction - PowerPoint PPT Presentation

Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvri University of Maryland, College Park AI that benefjts humans


  1. Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control Prashanth L.A. Joint work with Cheng Jie, Michael Fu, Steve Marcus and Csaba Szepesvári University of Maryland, College Park

  2. AI that benefjts humans Reinforcement learning (RL) setting with rewards evaluated by humans Cumulative prospect theory (CPT) captures human preferences Reward World CPT Agent

  3. CPT-value dz max max a 0 , a a X X z dz X 0 z dz X 0 X Connection to expected value: Losses a 0 0 dz 0 Gains For a given r.v. X, CPT-value C ( X ) is ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P − P � �� � � �� � Utility functions u + , u − : R → R + , u + ( x ) = 0 when x ≤ 0, u − ( x ) = 0 when x ≥ 0 Weight functions w + , w − : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1

  4. CPT-value Connection to expected value: Gains dz Losses dz 0 0 0 0 For a given r.v. X, CPT-value C ( X ) is ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P − P � �� � � �� � Utility functions u + , u − : R → R + , u + ( x ) = 0 when x ≤ 0, u − ( x ) = 0 when x ≥ 0 Weight functions w + , w − : [ 0 , 1 ] → [ 0 , 1 ] with w ( 0 ) = 0, w ( 1 ) = 1 ∫ + ∞ ∫ + ∞ C ( X ) = P ( X > z ) dz − P ( − X > z ) dz [ ( X ) + ] [ ( X ) − ] = E − E ( a ) + = max ( a , 0 ) , ( a ) − = max ( − a , 0 )

  5. Utility and weight functions 0 Overweight low probabilities, Probability p 1 0 1 Utility functions underweight high probabilities Weight function Losses Utility Gains 0 . 8 u + Weight w ( p ) 0 . 6 p 0 . 69 0 . 4 ( p 0 . 69 + ( 1 − p ) 0 . 69 ) 1 / 0 . 69 0 . 2 − u − 0 . 2 0 . 4 0 . 6 0 . 8 For losses, the disutility − u − is convex, for gains, the utility u + is concave

  6. Prospect Theory Amos Tversky Daniel Kahneman Kahneman & Tversky (1979) “Prospect Theory: An analysis of decision under risk” is the second most cited paper in economics during the period, 1975-2000

  7. Our Contributions 0 convergence of policy gradient • sample complexity bounds for estimation + asymptotic • SPSA-based policy gradient algorithm • CPT-value estimation using empirical distribution functions dz • traffjc signal control application 0 ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) C ( X θ ) := u + ( X θ ) > z u − ( X θ ) > z P dz − P Find θ ∗ = arg max C ( X θ ) θ ∈ Θ

  8. CPT-value estimation Nice to have: Sample complexity O 0 dz 0 Problem: Given samples X 1 , . . . , X n of X, estimate ∫ + ∞ ∫ + ∞ w + ( ( )) w − ( ( )) u + ( X ) > z u − ( X ) > z C ( X ) := P dz − P ( 1 /ϵ 2 ) for accuracy ϵ

  9. Computing Part (I): Let X 1 X 2 X n denote the order-statistics X i n 0 Part (II) 0 Part (I) n i 1 u n n X, w n and 1 i n n n w n i Part (I) Empirical distribution function (EDF): Given samples X 1 , . . . , X n of ∑ ∑ ˆ ˆ F + F − n ( x ) = 1 1 ( u + ( X i ) ≤ x ) , n ( x ) = 1 1 ( u − ( X i ) ≤ x ) i = 1 i = 1 Using EDFs, the CPT-value C ( X ) is estimated by ∫ + ∞ ∫ + ∞ w + ( 1 − ˆ w − ( 1 − ˆ F + F − C n = n ( x )) dx − n ( x )) dx � �� � � �� �

  10. n n n n 0 n n Part (I) X, 0 and Part (II) n Empirical distribution function (EDF): Given samples X 1 , . . . , X n of ∑ ∑ ˆ ˆ F + F − n ( x ) = 1 1 ( u + ( X i ) ≤ x ) , n ( x ) = 1 1 ( u − ( X i ) ≤ x ) i = 1 i = 1 Using EDFs, the CPT-value C ( X ) is estimated by ∫ + ∞ ∫ + ∞ w + ( 1 − ˆ w − ( 1 − ˆ F + F − C n = n ( x )) dx − n ( x )) dx � �� � � �� � Computing Part (I): Let X [ 1 ] , X [ 2 ] , . . . , X [ n ] denote the order-statistics ( ( n + 1 − i ) ( n − i )) ∑ u + ( X [ i ] ) w + − w + Part (I) = , i = 1

  11. Sample complexity O 1 for accuracy Special Case: Lipschitz weights ( 2 Sample Complexity: 1) (A1). Weights w + , w − are Hölder continuous, i.e., | w + ( x ) − w + ( y ) | ≤ H | x − y | α , ∀ x , y ∈ [ 0 , 1 ] (A2). Utilities u + ( X ) and u − ( X ) are bounded above by M < ∞ Under (A1) and (A2), for any ϵ, δ > 0, we have ( 1 ) (� � ) � C n − C ( X ) � ≤ ϵ P > 1 − δ , ∀ n ≥ ln · 4H 2 M 2 δ ϵ 2 /α

  12. Sample Complexity: Sample complexity O (A1). Weights w + , w − are Hölder continuous, i.e., | w + ( x ) − w + ( y ) | ≤ H | x − y | α , ∀ x , y ∈ [ 0 , 1 ] (A2). Utilities u + ( X ) and u − ( X ) are bounded above by M < ∞ Under (A1) and (A2), for any ϵ, δ > 0, we have ( 1 ) (� � ) � C n − C ( X ) � ≤ ϵ P > 1 − δ , ∀ n ≥ ln · 4H 2 M 2 δ ϵ 2 /α Special Case: Lipschitz weights ( α = 1) ( 1 /ϵ 2 ) for accuracy ϵ

  13. CPT-value optimization Prediction Control Two-Stage Solution: gradient ascent Find θ ∗ = arg max C ( X θ ) θ ∈ Θ RL application: θ = policy parameter, X θ = return inner stage Obtain samples of X θ and estimate C ( X θ ) ; Parameter θ CPT-value C θ outer stage Update θ using ∇ i C ( X θ ) is not given

  14. Update rule: Projection operator n n n Solution: use SPSA [Spall’92] Gradient estimate Step-sizes asymptotically. ( ) � ∇ i C ( X θ n ) θ i n + 1 = Γ i θ i n + γ n , i = 1 , . . . , d . Challenge: estimating ∇ i C ( X θ ) given only biased estimates of C ( X θ ) θ n + δ n ∆ n θ n − δ n ∆ n ∇ i C ( X θ ) = C − C � 2 δ n ∆ i ∆ n is a vector of independent Rademacher r.v.s and δ n > 0 vanishes

  15. x n n 1 Ensure Figure 1: Overall fmow of CPT-SPSA Control cent) (Gradient as- Prediction n Measurement Prediction Estimator CPT Oracle Zero mean CPT-value optimization Simulation optimization Controlled bias f ( x ) + ξ X , ϵ C ( X ) + ϵ δ n ∆ n θ n + δ n ∆ n + m n samples C Update θ n θ n θ n + 1 θ n − δ n ∆ n m n samples C − δ n ∆ n → 0 m α/ 2 How to choose m n to ignore estimation bias? δ n

  16. Application: Traffjc signal control be the delay gain • calculated with a pre-timed traffjc light controller as reference • CPT captures the road users’ evaluation of the delay gain X i • Goal: Maximize • For any path i = 1 , . . . , M , let X i ∑ M CPT ( X 1 , . . . , X M ) = µ i C ( X i ) i = 1 µ i : proportion of traffjc on path i

  17. means (no utility/weights), EUT uses utilities but no weights and CPT uses both. 20 20 1 0 1 9 14 21 26 19 8 Bin Frequency (b) EUT-SPSA 0 40 0 1 0 0 1 0 1 8 52 24 12 Bin Frequency (c) CPT-SPSA Figure 2: Histogram of CPT-value of the delay gain: AVG uses plain sample 10 0 1 5 4 7 7 18 22 10 0 20 11 6 17 Bin Frequency (a) AVG-SPSA -335.86 -315.86 -295.86 -275.86 -255.86 -235.86 -215.86 -195.86 -175.86 -155.86 -188.19 -175.69 -163.19 -150.69 -138.19 -125.69 -113.19 -100.69 -88.19 -75.69 -43.36 -33.36 -23.36 -13.36 -3.36 6.64 16.64 26.64 36.64 46.64

  18. • Control: No Bellman, but SPSA can be employed • Robustness to unknown utility and weight function parameters Conclusions • Want AI to be benefjcial to humans • We lay the foundations for using CPT in an RL setting • Prediction: Sample means (TD) won’t work, but empirical distributions do! Future directions: • Crowdsourcing experiment to validate CPT online • CPT - a very popular paradigm for modeling human decisions

  19. Conclusions • Want AI to be benefjcial to humans distributions do! Future directions: • Crowdsourcing experiment to validate CPT online • Robustness to unknown utility and weight function parameters • CPT - a very popular paradigm for modeling human decisions • We lay the foundations for using CPT in an RL setting • Prediction: Sample means (TD) won’t work, but empirical • Control: No Bellman, but SPSA can be employed

  20. Thanks! Questions?

Recommend


More recommend