Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren
Reinforcement Learning action u t State x t +1 Cost c t 2
Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Transition Unstructured c t = c ( x t , u t ) Costs Optimal Policy Dynamic programming | S | , | A | Problem Size 3
Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space x t ∈ ℝ d , u t ∈ ℝ k Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Linear x t +1 = A ⋆ x t + B ⋆ u t + w t Transition Quadratic c t = x ⊤ t Qx t + u ⊤ Unstructured c t = c ( x t , u t ) Costs t Ru t Optimal Policy Dynamic programming u t = − K ⋆ x t | S | , | A | d , k , ∥ A ⋆ ∥ , ∥ B ⋆ ∥ Problem Size 3
“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019)
“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T
“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically T regret regret log T in stochastic Noise bandits
“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically Typically T T regret regret regret regret log T log T Objective in stochastic for strongly Noise Structure bandits convex costs
Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O 5
Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O … but in general, regret is unavoidable T • First* regret lower bound for the adaptive LQR problem Ω ( T ) • Holds even when is known A ⋆ • Construction relies on small λ min ( K ⋆ K ⊤ ⋆ ) * concurrently with Simchowitz and Foster (2020) 5
Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) 6
Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) Learning Objective Regret minimization under parameter uncertainty. Regret = 𝔽 [ ( c t − J ( K ⋆ )) ] T ∑ t =1 6
Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 7
Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 Strong Stability (Cohen et al. 2018) u t = − Kx t ⟹ 𝔽 [ c t ] T exponentially J ( K ) 1 ∑ Playing T t =1 Definition: K ∈ ℝ k × d is -strongly stable for if such that: ∃ H , L ( κ , γ ) A ⋆ , B ⋆ 1. A ⋆ + B ⋆ K = HLH − 1 2. ∥ L ∥ ≤ 1 − γ , and ∥ H ∥ , ∥ H − 1 ∥ , ∥ K ∥ ≤ κ 7
• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 8
• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ 8
• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ Challenges • Estimation rate is ∥ K t − K ⋆ ∥ ⪆ 1/ T • Exploration can be expensive! e.g., in previous work ∥ K t − K ⋆ ∥ ≤ T − 1/4 8
• Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise 9
̂ • Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise Least Squares Estimation ( ) Error: A t Free Exploration σ ∥ ̂ ∝ T − 1/2 By ! A t − A ⋆ ∥ ∝ w t − 1 λ min ( ∑ t s =1 w s w ⊤ s ) 9
Recommend
More recommend