AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad - PowerPoint PPT Presentation

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer TAO, CNRS − INRIA − LRI, UPSud, Universit´ e Paris-Saclay, France ECML PKDD 2016, Riva della Garda

Reinforcement Learning The ultimate challenge ◮ Learning improves survival expectation RL and the value function ◮ State space S , action space A ◮ transition p ( s , a , s ′ ) ◮ reward function R : S �→ I R ◮ policy π : S �→ A For each π , define reward expectation � ∞ � � γ t R ( s t +1 ) | s 0 = s , s t +1 ∼ p ( s t , a t = π ( s t ) , · ) V π ( s ) = R ( s )+ I E t =0

1: Do we really need the value function ? YES Bellman optimality equation V ∗ ( s ) = max π V π ( s ) Q ∗ ( s , a ) = R ( s ) + γ I E s ′ ∼ p ( s , a , · ) V ∗ ( s ′ ) π ∗ ( s ) = arg max Q ∗ ( s , a ) a NO Learning value function Q : S × A �→ I R more complex than learning policy π : S �→ A

Value function and Energy-based learning Le Cun et al., 2006 Goal : Learn h : X �→ Y e.g. Y structured Energy-based Learning 1. Learn s . t . g ( x , y x ) > g ( x , y ′ ) for y ′ � = y x g : X × Y �→ I R 2. Set h ( x ) = arg max g ( x , y ) y EbL pros and cons − more complex + more robust

2: Which human expertise required for RL ? Agent learns Human designs / yields RL Pol. π ∗ S , A , R Inverse RL Reward R (optimal) trajectories [1] + RL Preference RL Pol. Return ranked trajectories [2,3,4,5] + DPS [1] Abbeel, P.: Apprenticeship Learning and Reinforcement Learning PhD thesis 08 [2] Frnkrantz, J. et al.: Preference-based reinforcement learning. MLJ 12M [3] Wilson et al.: A Bayesian Approach for Policy Learning from Trajectory Preference Queries. NIPS 12 [4] Jain et al.: Learning Trajectory Preferences for Manipulators via Iterative Improvement NIPS 13 [5] Akrour et al. Programming by Feedback, ICML 14

This talk Relaxing expertise requirement Expert only required to know what can go wrong Counter-trajectories CD = def ( s 1 , . . . s T ) s.t. V ∗ ( s t ) < V ∗ ( s t +1 ) with V ∗ the (unknown) optimal value function. Example ◮ Take a bicycle in equilibrium s 1 ◮ Apply a random policy ◮ Bicycle soon falls down... s T

Anti-Imitation Policy Learning 1/3 Given counter trajectories E = { ( s i , 1 , . . . s i , T i ) , i = 1 . . . n } Learn pseudo-value U ∗ s.t. U ∗ ( s i , t ) > U ∗ ( s i , t +1 ) Formally U ∗ = arg min { Loss( U , E ) + R ( U ) } with • Loss( U , E ) = Σ i Σ t < t ′ [ U ( s i , t ′ ) − U ( s i , t ) + 1] + • R ( U ) a regularization term If transition model is known, AiPOL policy : E s ′ ∼ p ( s , a , · ) U ∗ ( s ′ ) π U ∗ ( s ) = arg max I a

Anti-Imitation Policy Learning 2/3 If transition model is unknown 1. Given U ∗ pseudo-value function 2. Given G = { ( s i , a i , s ′ i ) , i = 1 , m , s . t . U ∗ ( s ′ i ) > U ∗ ( s ′ i +1 ) } Learn pseudo Q -value s.t. Q ∗ ( s i , a i ) > Q ∗ ( s i +1 , a i +1 ) Formally Learning to rank Q ∗ = arg min { Loss( Q , G ) + R ( Q ) } with • Loss( Q , G ) = Σ i < j [ Q ( s j , a j ) − Q ( s i , a i ) + 1] + • R ( U ) a regularization term AiPOL policy : Q ∗ ( s , a ) π Q ∗ ( s ) = arg max a

Anti-Imitation Policy Learning 2/3 Proposition If i) V ∗ continuous on S ; ii) U ∗ monotonous wrt V ∗ on S iii) with a margin between best and other actions ∀ a ′ � = a = π U ∗ ( s ) , I E U ∗ ( s ′ E U ∗ ( s ′ s , a ) > I s , a ′ ) + M iv) U ∗ Lipschitz with constant M ; v) transition model β -sub-Gaussian: s , a || 2 > t ) < 2 e − β t 2 R + , I ∀ t ∈ I P ( || I E s ′ s , a − s ′ Then if 2 L < M β , π U ∗ is an optimal policy

Experimental validation Goals of experiment ◮ How many CD s? ◮ How much expertise in generating CD s ? (starting state, controller) Experimental setting Mountain Bicycle Pendulum # CD 1 20 1 length CD 1,000 5 1,000 starting state target st random target st. controller neutral random neutral

Experimental setting, 2 Learning to rank Ranking SVM with Gaussian kernel Joachims 06 Mountain Bicycle Pendulum 10 3 10 3 10 − 5 U ∗ C 1 1 /σ 2 10 − 3 10 − 3 .5 1 Q ∗ nb const 500 5,000 − 10 3 10 3 C 2 − 1 /σ 2 10 − 3 10 − 3 − 2

Mountain Car, 1/3 AiPOL vs SARSA depending on the friction steps to the goal (after learning) 600 500 400 300 200 100 0 - 0. 005 0 0. 005 0. 01 0. 015 0. 02 0. 025 friction level Mountain car (20 runs)

Mountain Car, 2/3 AiPOL pseudo-value vs SARSA value (1,000 iter) 1000 0 -50 0 -100 -1000 -150 -2000 -200 -3000 -250 -4000 -300 0.1 1 0.05 1 0.5 0.5 0.5 0 0 0 0 -0.5 -0.05 -0.5 -0.5 -1 Speed -1 Speed Position Position -0.1 -1 -1.5 -1.5

Mountain Car, 3/3 AiPOL policy vs SARSA policy - 0. 6 - 0. 6 - 0. 4 - 0. 4 - 0. 2 - 0. 2 Speed Speed 0 0 0. 2 0. 2 0. 4 0. 4 0. 6 0. 6 - 1. 2 - 1 - 0. 8 - 0. 6 - 0. 4 - 0. 2 0 0. 2 0. 4 0. 6 - 1. 2 - 1 - 0. 8 - 0. 6 - 0. 4 - 0. 2 0 0. 2 0. 4 Position Position Action: forward, backward, neutral.

Bicycle Sensitivity wrt number and length of CD s 100 80 % of success 60 40 20 CD length = 2 CD length = 5 CD length =10 0 0 10 20 30 40 50 number of CDs

Inverted Pendulum Sensitivity wrt Ranking-SVM hyper-parameters Interpretation ◮ kernel width too small, no generalization (doesn’t reach the top) ◮ too large, U ∗ imprecise (goes to the top and falls on the other side)

AiPOL: Discussion Pro ◮ Compared to Inverse RL, AiPOL involves relaxed expertise requirements, with lesser computational requirements (greedification as opposed to RL) Limitations ◮ Latency of transitions: (e.g. bicycle) ( s i , a i , s ′ i , a ′ i , s “ i ) Q ( s i , a i ) > Q ( s j , a j ) if U ∗ ( s ′′ i ) > U ∗ ( s “ j ) ◮ Cost of learning Q ∗ quadratic in number of triplets. Further work ◮ Non reversible MDP needs be addressed.

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad - PowerPoint PPT Presentation

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer TAO, CNRS INRIA LRI, UPSud, Universit e Paris-Saclay, France ECML PKDD 2016, Riva della Garda Reinforcement Learning The

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

anti-virus and anti-anti-virus 1 logistics: TRICKY HW assignment out infecting an

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations Chih-Hui Ho, Chun Hu,

Formal terms 1 Object-learning Social imitation Emulation Mimicry copying only an

A Bayesian Approach to Generative Adversarial Imitation Learning NeurIPS 2018 Presenter Wonseok

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han

Pastor Congregation Outline of Presentation Thesis Key Points Summary Features of

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

Music Informatics Alan Smaill Mar 29 2018 Alan Smaill Music Informatics Mar 29 2018 1/21

CSE 427 Markov Models and Hidden Markov Models 2

Model Galaxy Formation in Quasar Proximity Zones during Reionization Huanqing Chen UChicago

Imprints of Cosmic Phase Transition on Gravitational Waves (GWs) Takeo Moroi (Tokyo) Refs:

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad - PowerPoint PPT Presentation

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer TAO, CNRS INRIA LRI, UPSud, Universit e Paris-Saclay, France ECML PKDD 2016, Riva della Garda Reinforcement Learning The

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell &amp; Geoff Gordon

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&amp;M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

anti-virus and anti-anti-virus 1 logistics: TRICKY HW assignment out infecting an

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations Chih-Hui Ho, Chun Hu,

Formal terms 1 Object-learning Social imitation Emulation Mimicry copying only an

A Bayesian Approach to Generative Adversarial Imitation Learning NeurIPS 2018 Presenter Wonseok

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han

Pastor Congregation Outline of Presentation Thesis Key Points Summary Features of

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

Music Informatics Alan Smaill Mar 29 2018 Alan Smaill Music Informatics Mar 29 2018 1/21

CSE 427 Markov Models and Hidden Markov Models 2

Model Galaxy Formation in Quasar Proximity Zones during Reionization Huanqing Chen UChicago

Imprints of Cosmic Phase Transition on Gravitational Waves (GWs) Takeo Moroi (Tokyo) Refs:

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift