Learning Dexterity Peter Welinder SEPTEMBER 09, 2018
Learning
Trends towards learning-based robotics
Reinforcement Learning Go (AlphaGo Zero) Dota 2 (OpenAI Five)
What about Robotics? RL doesn’t work because it uses lots of experience. 5 million games ~500 years of playing Go: 200 years per day Dota: 200 years per day
Simulators
Learning dexterity
24 joints: 20 actuated 4 under actuated
Rotating a block Challenges RL in real world high dimensional control noisy and partial observations manipulating multiple objects.
Approach
Reinforcement Learning + Domain Randomization
Reinforcement Learning STATE REWARDS action t = policy(state t ) X score = reward(state t , action t ) AGENT (POLICY) t ACTIONS
Reinforcement Learning θ ∗ = arg max X reward(policy θ , τ ) θ τ ∈ episodes Proximal Policy Optimization (PPO) Schulman et al. (2017)
Policy finger joint positions Action Distribution LSTM Fully-connected ReLU Normalization fingertip positions object pose Noisy Observation Goal
Distributed training with Rapid Rollout Workers Optimizers 6,000 CPU Cores 8 GPUs Policy Parameters
Domain Randomization F Sadeghi, S Levine (2017) Tobin et al. (2017) Peng et al. (2018)
Physics Randomizations object dimensions object and robot link masses surface friction coefficients robot joint damping coefficients actuator force gains joint limits gravity vector
Object position Dense Object rotation Concat SSM SSM SSM ResNet ResNet ResNet Pool Pool Pool Conv Conv Conv Camera 1 Camera 2 Camera 3
Train in Simulation A Distributed workers collect B We train a control policy using reinforcement learning. experience on randomized It chooses the next action based on fingertip positions environments at large scale. and the object pose. LSTM Observed Robot States Actions C We train a convolutional neural network to predict the object pose given three simulated camera images. CONV CONV Object Pose CONV
Transfer to the Real World We combine the pose estimation network D and the control policy to transfer to the real world. CONV Fingertip LSTM CONV Locations Actions CONV Object Pose
Results
Page Title
Results MAX NUMBER MEDIAN NUMBER RANDOMIZATONS OBJECT TRACKING OF SUCCESSES OF SUCCESSES All Vision 46 11.5 All Motion tracking 50 13 None Motion tracking 6 0
Training time 50 Consecutive Goals Achieved 40 30 20 10 0 1 10 100 Years of Experience All Randomizations No Randomizations
Tip Pinch Palmar Pinch Tripod Quadpod Power Grasp 5-finger Precision Grasp
Thank You Visit openai.com for more information. FOLLOW @OPENAI ON TWITTER
Recommend
More recommend