Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille – Team SequeL Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26
Motivation Risk is like fire: If controlled it will help you; if uncontrolled it will rise up and destroy you. Theodore Roosevelt The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. Douglas Adams Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26
Motivation Risk is like fire: If controlled it will help you; if uncontrolled it will rise up and destroy you. Theodore Roosevelt The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. Douglas Adams Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26
Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
Conditional Value-at-Risk (CVaR) VaR α ( X ) := inf { ξ | P ( X ≤ ξ ) ≥ α } CVaR α ( X ) := E [ X | X ≥ VaR α ( X )] . Unlike VaR, CVaR is a coherent risk measure 1 1convex, monotone, positive homogeneous and translation equi-variant Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 5 / 26
Practical Motivation Portfolio Re-allocation Stock 3 Portfolio composed of assets (e.g. stocks) Target Stochastic gains for buying/selling assets Aim find an investment strategy that Current achieves a targeted asset allocation Stock 1 Stock 2 A risk-averse investor would prefer a strategy that quickly achieves the target asset allocation; 1 minimizes the worst-case losses incurred 2 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26
Practical Motivation Portfolio Re-allocation Stock 3 Portfolio composed of assets (e.g. stocks) Target Stochastic gains for buying/selling assets Aim find an investment strategy that Current achieves a targeted asset allocation Stock 1 Stock 2 A risk-averse investor would prefer a strategy that quickly achieves the target asset allocation; 1 minimizes the worst-case losses incurred 2 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26
Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
CVaR-Constrained SSP Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 8 / 26
Stochastic Shortest Path S = { 0 , 1 , . . . , r } State. A ( s ) = {feasible actions in state s } Actions. Costs. g ( s , a ) and c ( s , a ) used in the objective used in the constraint Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26
Stochastic Shortest Path S = { 0 , 1 , . . . , r } State. A ( s ) = {feasible actions in state s } Actions. Costs. g ( s , a ) and c ( s , a ) used in the objective used in the constraint Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26
CVaR-Constrained SSP minimize the total cost: � τ − 1 � � � � s 0 = s 0 E g ( s m , a m ) m = 0 � �� � G θ ( s 0 ) subject to (CVaR constraint): � τ − 1 � � � � s 0 = s 0 c ( s m , a m ) CVaR α m = 0 � �� � C θ ( s 0 ) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26
CVaR-Constrained SSP minimize the total cost: � τ − 1 � � � � s 0 = s 0 E g ( s m , a m ) m = 0 � �� � G θ ( s 0 ) subject to (CVaR constraint): � τ − 1 � � � � s 0 = s 0 c ( s m , a m ) CVaR α m = 0 � �� � C θ ( s 0 ) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26
Lagrangian Relaxation min θ G θ ( s 0 ) CVaR α ( C θ ( s 0 ) ≤ K α s.t. � � � � � �� L θ,λ ( s 0 ) := G θ ( s 0 ) + λ CVaR α ( C θ ( s 0 )) − K α max λ min θ Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 11 / 26
Recommend
More recommend