Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain and Cognitive Sciences (RBCS) Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy Shashank.Pathak@iit.it - Giorgio.Metta@iit.it Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi (DIBRIS) Universit` a degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – Italy Armando.Tacchella@unige.it POLCOMING, Universit` a degli Studi di Sassari Viale Mancini 5 – 07100 Sassari – Italy Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL lpulina@uniss.it
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Some relevant features: Learning through experiences ie ( S t , A t , R t , S next ) Objective is to attain a policy π ( s i ) → A i Secondly, policy should be maximizing some Figure : Reinforcement Learning measure of ”rewards” R i Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Finite-window rewards Assume: finite time-horizon t ∈ ( t , T ) and discounting γ with γ ∈ [0 , 1) R t = r t +1 + γ r t +2 + γ 2 r t +3 + · · · + γ T − t − 1 r T and that we define, Value as expected-value of this averaged-reward V π ( s ) = E π ( R t | s t = s ) V ( s t ) → V ( s t ) + α ( R t − V ( s t )) We would have update : V ( s t ) → V ( s t ) + αδ, δ = ( r t +1 + γ V ( s t +1 ) − V ( s t )) (1) Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Air hockey Figure : Platform and simulator Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Reasons for picking up air hockey Air hockey is a challenging platform and has been used in past to demonstrate learning As a robotic setup, it has been included as one of the benchmark for robotics & humanoids Our previous work has been performed on real air hockey and supervised learning Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Simulator For the current study, we chose simulator instead of real setup Our goal was to demonstrate safety in a model-free learning approach and ways to improve it Some sophiticated semi-supervised approaches are needed to apply RL on real setup Showing benefits of verification and repair was independent to these approaches Simulation or at least some logging would be required even if real setup were used Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Simulator ... Simulator was implemented with C++ using some libraries like OpenCV, Boost and Pantheios For simplicity no game engine was used, rather 2D Physics was implemented Also physical and geometric considerations were made Extensive logging and a GUI based parameter search was done Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Learning Problem Given : an air hockey platform and a robotic arm. Objective : to learn to defend the goal as good as possible Action of robotic arm was constrained to be minimum-jerk trajectory Joint-kinematics and safety State was defined in trajectory-space rather than cartesian coordinates Discrete state and discrete actions were considered Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Learn Algorithm 1 Pseudo-code for learning to play Air hockey using Re- inforcement Learning Initialize Q ← 0 ; ∆ t ← 20ms function Learn ( Ne , Nb , Nr ) for all i ∈ { 1 , . . . , Ne } do Send Start signal to Simulator j ← 1 repeat Receive sj ← ( pj , α j , θ j ) from Simulator ∆ θ j ← ComputePolicy ( Q , sj ) Send (∆ θ j , ∆ t ) to Simulator Receive sj +1 ← ( pj +1 , α j +1 , θ j +1) and fj +1 ← ( m , g , w , r ) rj +1 ← ComputeReward (( sj +1 , fj +1) Ej ← ( sj , ∆ θ j , rj +1 , sj +1 , fj +1) Q ← Update ( Q , Ej ) j ← j + 1 if ( j = Nb ) then for all k ∈ { 1 , . . . , Nr } do Choose random m ∈ { 1 , . . . , Nb } Q ← Update ( Q , Ej ) end for j ← 1 end if until r = TRUE end for return Q end function Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of DTMC Discrete state-action space, allowed to model learned policy as a Discrete Time Markov Chain Learnt policy π ( s ) → a was Softmax distribution over Q-values, e κ Q ( s , a i ) π ( s , a i ) = (2) a ∈ A e κ Q ( s , a ) � Next states were observed via simulation and probabilities were adjusted imperically We considered 2 approaches: unsafe states as failure and as fault Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of DTMC: Unsafe states as failures unsafe flag = ⇒ halt On practical setups, there are usually low-level control Some approaches to address this: Lyapunov candidates, safety conscious rewarding etc For sake of generality and yet effectiveness, we used safety conscious rewarding schema while avoided Lyapunov candidates In our case, safety of the agent is reachability probability on unsafe states Using safety property, we used both PRISM and MRMC, to get qualitative measure of safety Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Repairing DTMC Intuition: badness of a state depends on forward proximity to a bad state. In general, changing Q-values in ways similar to eligibility trace would make policy safer While this is more effective than incorporating safety while learning, it could deteriorate learnt policy Our experiments show it need not be the case Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Repairing DTMC: Using COMICS We used tool COMICS to generate the counter-example We then proceeded with repairing the paths The overall algorithm was Algorithm 2 Pseudo-code for Verification and Repair of Learn 1: Given agent A , learning algorithm Learn , safety bound P bound 2: Using A perform Learn 3: Obtain policy π ( s , a ) 4: Construct a DTMC D from policy π ( s , a ) 5: Use MRMC or PRISM on D to obtain P unsafe of violating P 6: repeat 7: repeat 8: Use COMICS to generate set S unsafe negating P with bound P unsafe 9: Apply Repair on S unsafe 10: until S unsafe = { φ } 11: P unsafe ← P unsafe − ǫ , ǫ ∈ (0 , P unsafe − P bound ] 12: until P unsafe < P bound Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Results 0.25 0.25 Learn Learn Test Test 1 0.2 0.2 0.8 0.15 0.15 0.6 Learn Test 0.1 0.1 0.4 0.05 0.05 0.2 0 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Nbr of episodes Nbr of episodes 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 Learn Learn Learn Test Test Test 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Nbr of episodes Nbr of episodes Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Thanks to audience and my colleagues 1 ! Questions or comments? 1 Armando Tacchella , Giorgio Metta , & Luca Pulina Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Recommend
More recommend