Poster #50 Fingerprint Policy Optimisation for Robust Reinforcement Learning Supratik Paul, Michael A. Osborne, Shimon Whiteson This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements \#637713)
Motivation 2
Motivation 2
Motivation • Environment variable (EV) • E.g. wind conditions • Controllable during learning but not during execution 2
Motivation • Environment variable (EV) • E.g. wind conditions • Controllable during learning but not during execution • Objective: Find 𝜌 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝜌 𝐾 𝜌 = 𝑏𝑠𝑛𝑏𝑦 𝜌 𝔽 𝐹𝑊~𝑞(𝐹𝑊) [𝑆 𝜌 ] 2
Motivation • Environment variable (EV) • E.g. wind conditions • Controllable during learning but not during execution • Objective: Find 𝜌 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝜌 𝐾 𝜌 = 𝑏𝑠𝑛𝑏𝑦 𝜌 𝔽 𝐹𝑊~𝑞(𝐹𝑊) [𝑆 𝜌 ] • Need to account for rare events • E.g. rare wind conditions leading to a crash 2
Naïve application of f policy gradients 3
Naïve application of f policy gradients Trajectories ~ 𝜌 3
Naïve application of f policy gradients Trajectories ~ 𝜌 3
Naïve application of f policy gradients Rare Events Trajectories ~ 𝜌 4
Naïve application of f policy gradients Rare Events Trajectories ~ 𝜌 • Monte Carlo estimate of the Policy Gradient has very high variance ⟹ Doomed to failure 4
Fingerprint Policy Optimisation (F (FPO) 5
Fingerprint Policy Optimisation (F (FPO) 5
Fingerprint Policy Optimisation (F (FPO) 5
Fingerprint Policy Optimisation (F (FPO) At each iteration, select parameters 𝜔 of 𝑟 𝜔 (𝐹𝑊) such that it maximises one-step expected return 5
Fingerprint Policy Optimisation (F (FPO) 6
Fingerprint Policy Optimisation (F (FPO) • 𝜌 ′ = 𝜌 + α𝛼𝐾 𝜌 • 𝐾 𝜌 ′ = f(𝜌, 𝜔) 6
Fingerprint Policy Optimisation (F (FPO) • 𝜌 ′ = 𝜌 + α𝛼𝐾 𝜌 • 𝐾 𝜌 ′ = f(𝜌, 𝜔) • Model 𝐾 𝜌 ′ as a Gaussian Process with inputs (𝜌, 𝜔) • Use Bayesian Optimisation to select 𝜔|𝜌 = argmax 𝜔 f(𝜌, 𝜔) 6
Fingerprint Policy Optimisation (F (FPO) • 𝜌 ′ = 𝜌 + α𝛼𝐾 𝜌 • 𝐾 𝜌 ′ = f(𝜌, 𝜔) • Model 𝐾 𝜌 ′ as a Gaussian Process with inputs (𝜌, 𝜔) • Use Bayesian Optimisation to select 𝜔|𝜌 = argmax 𝜔 f(𝜌, 𝜔) Low dimensional representation 𝜌 is high dimensional “Fingerprint” 6
Policy fi fingerprints 7
Policy fi fingerprints • Disambiguation, not accurate representation 7
Policy fi fingerprints • Disambiguation, not accurate representation • State/Action fingerprints: Gaussians fitted to the stationary state/action distribution induced by 𝜌 7
Policy fi fingerprints • Disambiguation, not accurate representation • State/Action fingerprints: Gaussians fitted to the stationary state/action distribution induced by 𝜌 • Gross simplification, but good at disambiguating between policies 7
Results Half Cheetah Ant • Velocity target = 2 with probability • Reward proportional to velocity 98% and ‘normal’ reward • 5% chance that velocity > 2 leads to • Velocity target = 4 with probability 2% joint damage with large negative with significantly high reward reward 8
Poster #50 Fingerprint Policy Optimisation for Robust Reinforcement Learning Supratik Paul, Michael A. Osborne, Shimon Whiteson This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements \#637713)
Recommend
More recommend