Online Control with Adversarial Disturbances Naman Agarwal Google AI Princeton Joint Work with Brian Bullins, Elad Hazan, Sham Kakade, Karan Singh
Dynamical Systems with Control • Robotics ! "#$ = &(! " , ) " ) • Autonomous Vehicles + , • Data Center Cooling - , [Cohen et al ‘18]
! " : State Our Setting ) " : Control Robustly Control a Noisy Linear Dynamical System • Known Dynamics ! "#$ = &! " + () " + * " • Fully Observable State
! " : State Our Setting ) " : Control Robustly Control a Noisy Linear Dynamical System • Known Dynamics ! "#$ = &! " + () " + * " • Fully Observable State Disturbance + , adversarially chosen ( ||+ , || ≤ / )
! " : State Our Setting ) " : Control Robustly Control a Noisy Linear Dynamical System • Known Dynamics ! "#$ = &! " + () " + * " • Fully Observable State Disturbance 0 1 adversarially chosen ( ||0 1 || ≤ 4 ) • Online and Adversarial Minimize Costs - ∑ , " (! " , ) " ) • General Convex Function
! " : State Our Setting ) " : Control Robustly Control a Noisy Linear Dynamical System • Known Dynamics ! "#$ = &! " + () " + * " • Fully Observable State Disturbance 0 1 adversarially chosen ( ||0 1 || ≤ 4 ) • Online and Adversarial Minimize Costs - ∑ , " (! " , ) " ) • General Convex Function vs. Linear Quadratic Regulator (LQR ): Adversarial vs Random Disturbance Online, Convex Costs vs Known Quadratic Loss
Goal – Minimize Regret • Fixed Time horizon - ! • Produce actions " # , " % … " ' to minimize regret w.r.t best in hindsight ' ' ( + ) , ) , " ) − min ( + ) , ) (3), 3, ) (3) 1 )*# )*#
Goal – Minimize Regret • Fixed Time horizon - ! • Produce actions " # , " % … " ' to minimize regret w.r.t best in hindsight ' ' ( + ) , ) , " ) − min ( + ) , ) (3), 3, ) (3) 1 )*# )*# " ) only knows 5 # … 5 ) Best Linear Policy knowing 5 # … 5 ' Optimal for LQR
Goal – Minimize Regret • Fixed Time horizon - ! • Produce actions " # , " % … " ' to minimize regret w.r.t best in hindsight ' ' ( + ) , ) , " ) − min ( + ) , ) (3), 3, ) (3) 1 )*# )*# " ) only knows 5 # … 5 ) Best Linear Policy knowing 5 # … 5 ' Optimal for LQR Counterfactual Regret – , ) (3) depends on K
Previous work: 8 9 Control • min-max problem, worst case perturbation: min $ max ' (:* + - . , , 0(2 ,34 , … 2 6 ) , • Disturbance 2 4:: adversarially chosen
Previous work: 8 9 Control • min-max problem, worst case perturbation: min $ max ' (:* + - . , , 0(2 ,34 , … 2 6 ) , • Disturbance 2 4:: adversarially chosen Compute Adaptivity • Closed form: Quadratics • 8 9 is Pessimistic • Difficult for general costs • Regret: adapts to favorable sequence
Main Result Efficient Online Algorithm: ! " … ! $ s.t. $ $ ∑ &'" /∈1&2345 ∑ &'" ( & ) & , ! & − min ( & ) & , 6) & ≤ 8( :) • Convexity through Improper Relaxation • Efficient → Polynomial in system parameters, logarithmic in T
Outline of the approach 1. Improper Learning: Can we even figure out the best in hindsight policy? ”relaxed” policy class: Next Control a linear function of previous ! " 2. Strong Stability ⇒ error feedback policy: learn change to action via ”small horizon” of previous disturbances. 3. Small Horizon ⇒ Efficient Reduction to Online Convex Optimization (OCO) with memory [Anava et al.]
Thank You! For more details please visit the Poster Pacific Ballroom #155 namanagarwal@google.com
Recommend
More recommend