Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: - PowerPoint PPT Presentation

Maximum Entropy Inverse Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: Naireen Hussain

Overview  What is Inverse Reinforcement Learning (IRL)?  What are the diffjcultjes with IRL?  Researchers’ Contributjons  Motjvatjon of Max Entropy Set Up  Problem Set-Up  Algorithm  Experimental Set-Up  Discussion  Critjques and Limitatjons  Recap

What is inverse reinforcement learning? Given access to trajectories generated from an expert, can a reward functjon be learned that induces the same behaviour as the expert?  a form of imitatjon learning How is this difgerent than the previous forms of RL we’ve seen before?

What is inverse reinforcement learning? http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_12_irl.pdf

Diffjculties with IRL Ill posed problem – no unique set of weights describing the optjmal behaviour  Difgerent policies may be optjmal for difgerent reward weights (even when it’s all zeros!)  Which policy is preferable?  Match feature expectatjons [Abbeel & Ng, 2004]  No clear way to handle multjple policies  Use maximum margin planning [Ratlifg, Bagnell, Zinkevich 2006]  Maximize margin between reward of expert to the reward of the best agent policy plus some similarity measure  Sufgers in the presence of an sub - optjmal expert, as no reward functjon makes the agent optjmal and signifjcantly betuer than any observed behaviour

Researcher’s Contribution  Created the Maximum Entropy IRL (MaxEnt) framework  Provided an algorithmic approach to handle uncertaintjes in actjons  Effjcient Dynamic Programming algorithm  case study of predictjng driver’s behaviour  prior work in this applicatjon was ineffjcient [Liao et al, 2007]  largest IRL experiment in terms of data set size at the tjme (2008)

Why use Max Entropy?  Principle of Max Entropy [Janyes 1957] – demonstrates that the best distributjon over current informatjon is one with the largest entropy  Prevents issues with label bias  Portjons of state space with many branches will each be biased to being less likely, and while areas with fewer branches will have higher probabilitjes (locally greedy)  The consequences of label bias is: 1)the most rewarding path being not the most likely 2)two difgerent but equally rewarded paths with difgerent probability

Problem Set-Up  Agent is optjmizing a reward functjon that linearly maps the features of each state f s in the path ζ to a state reward value.  Reward is parameterized by the weights θ:  Expected empirical feature counts based on m demonstratjons :

Algorithm Set-Up  Reward functjon uses a Boltzmann distributjon  Above formulatjon assumes deterministjc MDP’s ζ – path (must be fjnite for Z(θ) to converge, or use discounted rewards for infjnite paths) θ - reward weights Z(θ) – partjtjon functjon, normalizatjon value

Algorithm Set-Up O bservatjons here are introduced to make the stochastjc MDP deterministjc given previous state distributjons  Two further simplifjcatjons are made:  The partjtjon functjon is constant for all outcome samples  Transitjon randomness doesn’t afgect behaviour o – outcome sample T – T ransitjon distributjon

Maximum Likelihood Estimation  Use the maximize likelihood of observing expert data for θ as the cost functjon for θ  convex for deterministjc MDPs  intuitjvely can be understood as difgerence in agent’s empirical feature counts, and the expert’s expected feature counts ● Used sample based approach to compute expert’s feature counts

1) Start from a terminal state 2) Compute the partjtjon functjon at each state and actjon to obtain local actjon probabilitjes 3) Compute state frequencies at each tjme step 4) Sum over agent’s state frequency all tjme steps 5)This is similar to value iteratjon!

Experimental Set-Up  The researchers were trying to investjgate if a reward functjon for predictjng driving behaviour could be recovered.  Modelled road network as an MDP  Due to difgerent start and end positjons, each trip’s MDP is slightly difgerent  Because of difgering MDP’s reward weight are treated as independent of the goal, so a single set of weights θ can be learned from many difgerent MDP’s

Dataset Details  Collected driving data of 100,000 miles spanning 3,000 driving hours for Pitusburgh  Fitued GPS data to the road network, to generate ~13,000 road trips  Discarded noisy trips, or trips that were too short (less than 10 road segments)  This was done to speed up computatjon tjme

Path Features Four difgerent road aspects considered:  Road type : interstate to local road  Speed : high speed to low speed,  L anes : multj-lane or single lane  T ransitjons : straight, lefu, right, hard lefu, hard right There was a total of 22 features used to represent this state

Results Model % Matching % >90% Match Log Prob Reference Time- Based 72.38 43.12 N/A n/a Max Margin 75.29 46.56 N/A [Ratlifg, Bagnell, & Zinkevich, 2006] Actjon 77.30 50.37 -7.91 [Ramchandran & Amir 2007] Actjon (Cost) 77.74 50.75 N/A [Ramchandran & Amir 2007] MaxEnt 78.79 52.98 -6.85 [Zeibart et al. 2008] Time Based: Based on expected travel tjme, weights the cost of a unit distance of road to be inversely proportjonal to the speed of the road Max Margin: maximize margin between reward of expert to the reward of the best agent policy plus some similarity measure Actjon: Locally probabilistjc Bayesian IRL model Actjon (cost): – lowest cost path from the weights predicted from the actjon model

Discussion 1/2 2/3 Max Entropy 1/3 Action Based 1/2  Ability to remove label bias which is present in locally greedy actjon based distributjonal models  MaxEnt gives all paths equal probability due to equal reward  Actjon based paths (weighted on future expected rewards) look only locally to determine possible paths  P(A->B) != P(B->A)

Discussion The model learns to penalize slow roads and trajectories with many short paths

Discussion It is possible to infer driving behaviour from partjally observable paths with Bayes’ Theorem P ( B ∣ A ) = P ( A ∣ B ) ∗ P ( B ) P ( A )

Discussion  Possible to infer driving behaviour from partjally observable paths  Destjnatjon 2 is far less likely than Destjnatjon 1 due to Destjnatjon 1 being far more common in the data-set.

Critique / Limitations / Open Issues  Tests for inferring goal locatjons were done with only 5 destjnatjon locatjons  Easier to correctly predict the goals if they’re relatjvely spread out vs clustered close together  Relatjvely small feature space  Assumes the state transitjons are known  Assumes linear reward functjon  Requires hand crafued state features  Extended to a Deep Maximum Entropy Inverse Learning model [Wulfmeier et al, 2016]

Contributions (Recap) Problem How to handle uncertaintjes in demonstratjons due to sub-optjmal experts and how to handle ambiguity with multjple reward functjons. Limitatjons of Prior Work Max. Marginal predictjon is unable to be used for inference (predict probability of path), or handle sub-optjmal experts. Previous actjon based probabilistjc models that could handle inferences sufgered from label biases. Key Insights and Contributjons MaxEnt uses a probabilistjc approach that maximizes the entropy of the actjons, allowing a principled way to handle noise, and it prevents label bias. It also provides an effjcient algorithm to compute empirical feature count, leading to state of the art performance at the tjme.

Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: - PowerPoint PPT Presentation

Maximum Entropy Inverse Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: Naireen Hussain Overview What is Inverse Reinforcement Learning (IRL)? What are the diffjcultjes with IRL? Researchers Contributjons

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon

Toward a Common Model for Highly Concurrent Applications Douglas Thain University of Notre Dame

Authen'ca'on CS461/ECE422 Spring 2012 Readings Chapter 3

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning

Networks Used As Episodic Memory for An Autonomous Robot Outline

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems

Probability and Statistics for Computer Science A major use of probability in sta4s4cal