manifold embeddings for model based reinforcement
play

Manifold Embeddings for Model-Based Reinforcement Learning under - PowerPoint PPT Presentation

Outline Introduction Method Experiments Conclusions Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE


  1. Outline Introduction Method Experiments Conclusions Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE February 19, 2010 By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  2. Outline Introduction Method Experiments Conclusions Outline ◮ Introduction of RL ◮ Theory and Philosophy Behind RL ◮ Markov Models and RL ◮ Moving from Theory Study to Real Applications ◮ Background of This Paper ◮ Methods ◮ Experiments ◮ Conclusions and Discussions By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  3. Outline Introduction Method Experiments Conclusions RL ◮ Simple philosophy: agent receives rewards for good behaviors; punished for bad behaviors. ◮ General training data format: state (situation), action (decision), reward (label). ◮ Purpose of RL: learning behavior policy. ◮ Fundamental: Optimal Bellman Equation, V ∗ ( s t ) = max a t [ E ( r ( s t +1 ) | s t , a t ) + γ ∗ E ( V ∗ ( s t +1 ) | s t , a t )] ◮ The most general learning framework: Degree of Labels’ Clearness Most (with Explicit Least (No Labels) Labels) Supervised Learning RL Unsupervised Learning By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  4. Outline Introduction Method Experiments Conclusions Markov Models and RL ◮ Makov chain, HMM, MDP and POMDPs: Do we have control over the state transitons? Markov Models NO YES Markov Chain MDP YES Are the Markov Decision Process states completely HMM POMDP observable? NO Hidden Markov Model Partially Observable Markov Decision Process adopted from pomdp.org ◮ Two ways to learning the behavior policy: ◮ Model-based: learn dynamics, solve Markov Models; ◮ Model-free: directly learning the policy, such as, RPR, Q-learning. ◮ Applications: robotics, decision making under uncertainty. By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  5. Outline Introduction Method Experiments Conclusions Background of This Paper ◮ Barriers of Applications: (1) goal (reward) not well-defined; (2) exploration is expensive; (3) data not preserve Markov property. ◮ Solution 1: For many domains, particularly those governed by differential equations, leverage the induction of locality (nearest neighborhood, e.g., s ( t + 1) and s ( t ) ) during function approximation to satisfy Markov property ◮ Solution 2: Reconstruct state-spaces of partially observable systems: transfer high-order Markov property to first order Markov property, and preserve the locality. ◮ Example: use manifold embeddings to reconstruct locally Eculidean state-spaces of forced partially observable systems; can find the embedding non-parametrically. By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  6. Outline Introduction Method Experiments Conclusions Summary of the method An offline RL learning: ◮ Part 1, modeling phase: identify the appropriate embedding and define the local model. ◮ Part 2, learning phase: leverage (use) the resultant locality and perform RL. By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  7. Outline Introduction Method Experiments Conclusions Modeling: Manifold Embeddings for RL 1/2 Purpose: Using nonlinear dynamic systems theory to reconstruct complete state observability from incomplete state via delayed embeddings (representation). ◮ Assume real-valued vector space R M ; action a ; state dynamics function f ; and deterministic policy function a ( t ) = π ( s ( t )), where s ( t ) the state. s ( t + 1) = f ( s ( t ) , a ( t )) = f ( s ( t ) , π ( s ( t ))) = φ ( s ( t )) (1) ◮ if the system is observed via function y , s.t. ˜ s ( t ) = y ( s ( t )) (2) By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  8. Outline Introduction Method Experiments Conclusions Modeling: Manifold Embeddings for RL 2/2 ◮ Construct a vector s E ( t ) below s.t. s E lies on a subset of R E which is an embedding of s . s E ( t ) = [˜ s ( t ) , ˜ s ( t − 1) , · · · , ˜ s ( t − ( E − 1))] , E > 2 M (3) ◮ Because embeddings preserve the connectivity of the original vector space R M , in the context of RL the following mapping ψ with : s E ( t + 1) = ψ ( s E ( t )) , (4) may be substituted for f and vectors s E ( t ) may be substituted for corresponding vectors s ( t ). By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  9. Outline Introduction Method Experiments Conclusions Modeling: Nonparametric Idenfication of Manifold Embeddings 1/2 Problem left: how to compute E , the embedding dimension. Solution: Singular Value Decomposition (SVD) Algorithm s of length ˜ ◮ Given a sequence of state observations ˜ S , choose a sufficiently large fixed embedding dimension, ˆ E . ◮ For each embedding window size ˆ T min ∈ { ˆ E , · · · , ˜ S } , E ( t ) , t ∈ { ˆ T min , · · · , ˜ 1. Define a matrix S ˆ E of row vectors s ˆ S } , by the rule: s ( t − (ˆ E ( t ) = [˜ s ( t ) , ˜ s ( t − τ ) , · · · , ˜ E − 1) τ )] , (5) s ˆ 2. Compute the SVD of the matrix S ˆ E , S ˆ E = U Σ W ∗ . 3. Record the vector of singular values, σ ( ˆ T min ) in Σ. By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  10. Outline Introduction Method Experiments Conclusions Modeling: Nonparametric Idenfication of Manifold Embeddings 2/2 ◮ Estimate the embedding parameters, T min and E , of s by analysis of the second σ 2 ( ˆ T min ). 1. Approximate window size, T min , of s is the ˆ T min value of the first local maxima of the sequence of all σ 2 ( ˆ T min ), for T min ∈ { ˆ ˆ E , · · · , ˜ S } . 2. Approximate embedding dimension, E , is the number of non-trivial singular values of σ ( T min ). ◮ Embeddings s via parameters, T min and E , yields the matrix S E of row vectors s E ( t ) , t ∈ { T min , · · · , ˜ S } . By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  11. Outline Introduction Method Experiments Conclusions Modeling: Generative Local Models from Embeddings 1/2 Purpose: Generate local model, a trajectory of the underlying system and prepare the observed “state” for RL. ◮ Consider a dataset D of a set of temporally aligned sequences of s ( t ), a ( t ) and reward r ( t ), t ∈ { 1 , · · · , ˜ S } . ◮ Apply the above spectral embedding method to D and yield a sequence of vectors s E ( t ) , t ∈ { T min , · · · , ˜ S } . ◮ A local model M of D is the set of 3-tuples, m ( t ) = { s E ( t ) , a ( t ) , r ( t ) } , t ∈ { T min , · · · , ˜ S } . ◮ Define operations on these tuples, A ( m ( t )) = a ( t ) , S ( m ( t )) = s E ( t ) , Z ( m ( t )) = s z ( t ), where s z ( t ) = [ s E ( t ) , a ( t )], and U ( M , a ) = M a , where M a is the subset of tuples in M containing action a . By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  12. Outline Introduction Method Experiments Conclusions Modeling: Generative Local Models from Embeddings 2/2 Consider a state vector x ( i ) in R E indexed by simulation time i , and Compute its locality, i.e., nearest neighbor. ◮ Model’s nearest neighbor of x ( i ) when taking action a(i) defined in case of discrete set of actions and continuous case: m ( t x ( i ) ) = arg m ( t ) ∈U ( M , a ( i )) �S ( m ( t )) − x ( i ) � , a ∈ A min (6) m ( t x ( i ) ) = arg min m ( t ) ∈ M �Z ( m ( t )) − [ x ( i ) , ω a ( i )] � , a ∈ A , (7) where ω is a scaling parameter. ◮ Model gradient and numerical integration defined as: ∇ x ( i ) = S ( m ( t xi + 1)) − S ( m ( t xi )) (8) x ( i + 1) = x ( i ) + △ i ( ∇ x ( i ) + η ) (9) where η is a vector of noise and ∆ i is the integration step-size. By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Recommend


More recommend