maximum entropy inverse rl adversarial imitation learning
play

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement Learning Probability Dynamics distribution over next Model T


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki

  2. Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel

  3. Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s π ∗ recover reward R and policy !

  4. Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s π ∗ recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.

  5. Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Mathematically imitation boils down to a distribution matching problem: the learner needs to come up with a reward/policy whose resulting state, action trajectory distribution matches the expert trajectory distribution.

  6. A simple example • Roads have unknown costs linear in features • Paths (trajectories) have unknown costs, sum of road (state) costs • Experts (taxi-drivers) demonstrate Pittsburgh traveling behavior • How can we learn to navigate Pitts like a taxi (or uber) driver? • Assumption: cost is independent of the goal state, so it only depends on road features, e.g., traffic width tolls etc.

  7. State features Features f can be: # Bridges crossed e: # Miles of interstate # Stoplights

  8. A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i “If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’s worth of trips, # Stoplights the model should also use 136.3 miles of interstate and 12 bridges in expectation for those same start-destination pairs.”

  9. A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights

  10. A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

  11. Ambiguity However, many distributions over paths can Features f can be: match feature counts, and some will be very different from observed behavior. The model # Bridges crossed could produce a policy that avoid the interstate and bridges for all routes except one, which drives in circles on the interstate for 136 miles and crosses 12 bridges. Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

  12. Principle of Maximum Entropy The probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information). Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. The distribution with maximal information entropy is the best choice. • Maximizing entropy minimizes the amount of prior information built into the distribution • Many physical systems tend to move towards maximal entropy configurations over time

  13. Resolve Ambiguity by Maximum Entropy Features f can be: Let’s pick the policy that satisfies feature count constraints without over-committing! # Bridges crossed X max P ( τ ) log P ( τ ) − P τ Feature matching constraint: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

  14. Maximum Entropy Inverse Optimal Control as uniform as possible Maximizing the entropy over paths: X max P ( τ ) log P ( τ ) − P τ While matching feature counts (and being a probability distribution): X P ( τ ) f τ = f dem τ X P ( τ ) = 1 τ

  15. From features to costs Cost of a trajectory (linear): τ c θ ( τ ) = θ T f τ = θ T f s X s ∈ τ Constraint: Match the cost of expert trajectories in expectation: 1 Z X c θ ( τ ∗ ) p ( τ ) c θ ( τ ) d τ = | D | τ ∗ ∈ D τ Maximum Entropy min . − H ( p ( τ )) Z Z s.t. p ( τ ) c θ ( τ ) d τ = ˜ p ( τ ) d τ = 1 c,

  16. From maximum entropy to exponential family Maximum Entropy min . − H ( p ( τ )) Z Z s.t. p ( τ ) c θ ( τ ) d τ = ˜ p ( τ ) d τ = 1 c, Z Z ⇒ L ( p, λ ) = p ( τ ) log( p ( τ )) d τ + λ 1 ( p ( τ ) c θ ( τ ) d τ − ˜ c ) ⇐ Z + λ 0 ( p ( τ ) d τ − 1) ∂ L ∂ p = log p ( τ ) + 1 + λ 1 c θ ( τ ) + λ 0 ∂ L ∂ p = 0 ⇐ ⇒ log p ( τ ) = − 1 − λ 1 c θ ( τ ) − λ 0 p ( τ ) ∝ e c θ ( τ ) p ( τ ) = e ( − 1 − λ 0 − λ 1 c θ ( τ ))

  17. From maximum entropy to exponential family • Maximizing the entropy of the distribution over paths subject to the feature constraints from observed data implies that we maximize the likelihood of the observed data under the maximum entropy (exponential family) distribution (Jaynes 1957) 1 1 sj ∈ τ i θ T f sj Z ( θ ) e θ T f τ i = P P ( τ i | θ ) = Z ( θ ) e e θ T f τ S X Z ( θ , s ) = τ S • Strong Preference for Low Cost Paths • Equal Cost Paths Equally Probable

  18. Maximum Likelihood Y X p ( τ ∗ ) ⇐ log p ( τ ∗ ) max . log ⇒ max . θ θ τ ∗ ∈ D τ ∗ ∈ D log e − c θ ( τ ∗ ) X max . Z θ τ ∗ ∈ D X X X e − c θ ( τ ) ) − c θ ( τ ∗ ) − max log( . θ τ ∗ τ τ ∗ ∈ D X X e − c θ ( τ ) ) | D | − c θ ( τ ∗ ) − log( max . θ τ τ ∗ ∈ D X X e − c θ ( τ ) ) → J ( θ ) c θ ( τ ∗ ) + | D | log( min . θ τ τ ∗ ∈ D dc θ ( τ ∗ ) τ ( e − c θ ( τ ) ) ) 1 ( − dc θ ( τ ) X X ∑ r θ J ( θ ) = + | D | d θ τ e − c θ ( τ ) P d θ τ ∗ ∈ D dc θ ( τ ∗ ) p ( τ | θ ) dc θ ( τ ) X X − = + | D | d θ d θ τ τ ∗ ∈ D

  19. From trajectories to states Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t ) p ( τ ) ∞ e − c θ ( τ ) X p ( τ ) ∞ e −∑ s ∈ τ c θ ( s ) c θ ( τ ) = c θ ( s ) s ) ⇒ s ∈ τ dc θ ( s ) p ( s | θ , τ ) dc θ ( s ) X X r θ J ( θ ) = − + | D | d θ d θ s s ∈ τ ∗ ∈ D Successful imitation boils down to p ( s, a | θ , τ ) dc θ ( s, a ) learning a policy that matches the state X d θ visitation distribution (or state/action s,a visitation distribution)

  20. State densities In the tabular case and for known dynamics we can compute them with dynamic programming, assuming we have obtained the policy: µ 1 ( s ) = p ( s s ) Time indexed state densities for t = 1 , ..., T X X µ t ( s 0 ) p ( a | s 0 ) p ( s | s 0 , a ) µ t +1 ( s ) = a s 0 X p ( s | θ , T ) = µ t ( s ) t dc θ ( s ) p ( s | θ , T ) dc θ ( s ) X X − r θ J ( θ ) = + | D | d θ d θ s s t ∈ τ ∗ ∈ D c θ ( s ) = θ T f s For linear costs: ∇ θ J ( θ ) = ∑ f s + | D | ∑ p ( s | θ , 𝒰 ) f s s ∈ τ * s

  21. Maximum entropy Inverse RL Known dynamics, linear costs • Body Level One • Body Level Two • Body Level Three • Body Level Four Body Level Five −

  22. Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: ? Miles of Miles of Cost interstate: interstate: Weight: 20.7 ? 5.0 Stoplights: Stoplights 10 : Cost ? Weight: 31 3.0

  23. Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: 4.7 +1.7 Miles of Miles of Cost Weight: interstate: interstate: 5.0 20.7 16.2 ‐4.5 Stoplights: Stoplights 10 : Cost 7.4 Weight: 34 ‐2.6 3.0

  24. Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: 4.7 Miles of Miles of 7.2 interstate: interstate: Cost Weight: 20.7 16.2 5.0 Stoplights: Stoplights 10 : 1.1 7.4 Cost 35 Weight:

Recommend


More recommend