Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang
Background Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered generalization is important ● Paper’s Goal: Empirical study of generalization in deep RL with different (1) algorithms, (2) environments, and (3) metrics
Algorithms ● Vanilla (Baseline) Algorithms ○ A2C: Actor-Critic Family ○ PPO: Policy-Gradient Family ● Generalization-Tackling Algorithms ○ EPOpt: Robust Approach ○ RL2: Adapt Approach ● 6 Algorithms Total A2C, PPO, EPOpt-A2C, EPOpt-PPO, RL2-A2C, RL2-PPO
Algorithms - Vanilla ● A2C / Actor-Critic Family ○ Critic: learns a value function ○ Actor: uses the function to learn a policy that maximizes expected reward ● PPO / Policy-Gradient Family ○ Learn sequence of improving policies ○ Maximize surrogate for the expected reward via gradient ascent
Algorithms - Generalization-Tackling ● EPOpt / Robust Approach ○ Maximize expected reward over subset of environments with lowest expected reward (Maximize conditional value at risk) ● RL2 / Adapt Approach ○ Learn environment embedding at test time “on-the-fly” ○ RNN with current trajectory as input / hidden states = embeddings
Algorithms - Network Architecture ● Feed Forward (FF) ○ Multi-layer perceptron (MLP) ● Recurrent (RC) ○ LSTM on top of MLP ● 4 Non-RL2 Algorithms → Test on both FF and RC ● 2 RL2 Algorithms → Test only on RC
Environments ● 6 Environments (OpenAI) CartPole MountainCar AcroBot Pendulum HalfCheetah Hopper
Metrics - Environment Parameters ● Deterministic (D) ○ fixed at default value (fixed environment) ● Random (R) ○ uniformly sampled from d -dimensional box (feasible environment) ● Extreme (E) ○ uniformly sampled from union of 2 intervals that straddle corresponding interval in R (edge cases) Schematic ( d =2 and 4 samples)
Metrics - Evaluation ● 3 Evaluation Metrics From 3x3 train-test pairs of (D/R/E) 1. Default: DD 2. Interpolation: RR 3. Extrapolation: Mean of DR, DE, RE ● Metric Value (Performance) ○ Success Rate (%) of episodes where a certain goal is completed
Experiment ● Compare performance of: ○ 10 algo combinations (6 algorithms / 2 architectures) ○ 6 environments ○ 3 metrics (default, interpolation, extrapolation) ● Methodology Train 15000 episodes / Test 1000 episodes ● Fairness No memory of previous episode Several sweeps of hyperparameters Success rate instead of reward itself
Results ● Default > Inter > Extrapolation ● FF > RC Architecture ● Vanilla > Generalization-Tackling ● RL2 variants do not work ● EPOpt-PPO works well in continuous action space (Pendulum, ½ Cheetah, Hopper)
Discussion Questions ● Generalization-tackling algorithms tested in this paper failed. What would be a potential strategy that makes generalization work? How would you solve this RL generalization problem? ● Why do you think generalization-tackling algorithms and recurrent (RC) architectures perform worse than Vanilla and feed forward (FF)? When would you expect generalized-tackling algorithms and recurrent (RC) architectures to work better? ● Do you think the paper’s experiment methodology is fair? Is there a better way to evaluate generalization on different algorithms and architectures?
Recommend
More recommend