provably efficient rl via latent state decoding
play

Provably Efficient RL via Latent State Decoding Akshay Alekh John - PowerPoint PPT Presentation

Provably Efficient RL via Latent State Decoding Akshay Alekh John Simon S. Du Krishnamurthy Nan Jiang Miro Dudk Agarwal Langford RL theory vs practice RL theory vs practice Theory Simple tabular environments No generalization RL


  1. Provably Efficient RL via Latent State Decoding Akshay Alekh John Simon S. Du Krishnamurthy Nan Jiang Miro Dudík Agarwal Langford

  2. RL theory vs practice

  3. RL theory vs practice Theory Simple tabular environments No generalization

  4. RL theory vs practice Theory Practice Simple tabular environments Complex rich-observation environments 
 No generalization Generalization via function approximation

  5. RL theory vs practice Theory Practice Simple tabular environments Complex rich-observation environments 
 No generalization Generalization via function approximation Can we design provably sample-e ffi cient RL algorithms for rich observation environments?

  6. Block MDPs A structured model for rich observation RL

  7. Block MDPs Context x A structured model for rich observation RL • Agent only observes rich context (visual signal)

  8. Block MDPs State s Context x A structured model for rich observation RL • Agent only observes rich context (visual signal) • Environment summarized by small hidden state space (agent location)

  9. Block MDPs State s Context x (Left) Action a A structured model for rich observation RL • Agent only observes rich context (visual signal) • Environment summarized by small hidden state space (agent location)

  10. Block MDPs For H steps State s State s Context x Context x (Left) Action a Action a A structured model for rich observation RL • Agent only observes rich context (visual signal) • Environment summarized by small hidden state space (agent location)

  11. Block MDPs For H steps State s State s Context x Context x (Left) Action a Action a A structured model for rich observation RL • Agent only observes rich context (visual signal) • Environment summarized by small hidden state space (agent location) • State can be decoded from observation

  12. Objective: Find a Decoder

  13. Objective: Find a Decoder Idea: Find a function that decodes f( ) = hidden states from contexts. context state

  14. Objective: Find a Decoder Idea: Find a function that decodes f( ) = hidden states from contexts. context state Reduce to a tabular problem

  15. Objective: Find a Decoder Idea: Find a function that decodes f( ) = hidden states from contexts. context state Reduce to a tabular problem Main Challenge: There is no label (we cannot observe hidden states).

  16. Approach

  17. Approach Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from f( ) = contexts. (assume access a regression oracle to s1,a1 s1,a2 s2,a1 s2,a2 context learn this function) State at level h: s1, s2 Actions: a1, a2

  18. Approach Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from f( ) = contexts. (assume access a regression oracle to s1,a1 s1,a2 s2,a1 s2,a2 context learn this function) State at level h: s1, s2 Actions: a1, a2 Di ff erent conditional probabilities correspond to di ff erent states s1,a1 s1,a2 s2,a1 s2,a2 s1,a1 s1,a2 s2,a1 s2,a2 State at level h+1: s3 s4

  19. Approach Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from f( ) = contexts. (assume access a regression oracle to s1,a1 s1,a2 s2,a1 s2,a2 context learn this function) State at level h: s1, s2 Actions: a1, a2 Di ff erent conditional probabilities correspond to di ff erent states s1,a1 s1,a2 s2,a1 s2,a2 s1,a1 s1,a2 s2,a1 s2,a2 State at level h+1: s3 s4 State classification

  20. Guarantees Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

  21. Guarantees Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box. M = Number of hidden states, K = Number of actions, H = Time horizon

  22. Guarantees Statistical e ffi ciency Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box. M = Number of hidden states, K = Number of actions, H = Time horizon

  23. Guarantees Statistical e ffi ciency Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box. M = Number of hidden states, K = Number of actions, H = Time horizon Computational e ffi ciency

  24. Guarantees Statistical e ffi ciency Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box. M = Number of hidden states, K = Number of actions, H = Time horizon Computational e ffi ciency Rich observations

  25. Guarantees Statistical e ffi ciency Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box. M = Number of hidden states, K = Number of actions, H = Time horizon Computational e ffi ciency Rich observations Assumptions • Supervised learner expressive enough • Latent states reachable and identifiable

  26. Algorithm details and experiments @ Poster #208

Recommend


More recommend