m emory a ugmented p olicy o ptimization mapo for program
play

M emory A ugmented P olicy O ptimization ( MAPO ) for Program - PowerPoint PPT Presentation

M emory A ugmented P olicy O ptimization ( MAPO ) for Program Synthesis and Semantic Parsing Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao Program Synthesis / Semantic Parsing how many more passengers flew to los angeles than to


  1. M emory A ugmented P olicy O ptimization ( MAPO ) for Program Synthesis and Semantic Parsing Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao

  2. Program Synthesis / Semantic Parsing how many more passengers flew to los angeles than to saskatoon?

  3. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  4. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  5. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  6. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  7. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  8. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? Latent (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  9. Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? Sparse Latent (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)

  10. Policy Gradient On-policy Actor Learner Samples Updated Policy Unbiased => optimal solution High variance => slow training

  11. Imitation Learning Demonstration Actor Learner Updated Policy Biased => suboptimal solution Low variance => fast training

  12. Imitation Learning Demonstration Actor Learner Updated Policy Biased => suboptimal solution Low variance => fast training Requires human supervision

  13. MAPO Actor Learner Updated Policy Unbiased => optimal solution Low variance => fast training

  14. MAPO Memory buffer High-reward samples Actor Learner Updated Policy Unbiased => optimal solution Low variance => fast training

  15. MAPO Memory buffer Samples inside High-reward memory samples Samples outside Actor Learner memory Updated Policy Unbiased => optimal solution Low variance => fast training

  16. Gradient Expectation Estimate Program space

  17. Gradient Sampling Expectation Estimate Program space Unbiased High variance

  18. MAPO Enumeration Programs inside Memory Gradient Estimate Programs outside Sampling Memory Sampling from a smaller space => variance reduction Unbiased

  19. MAPO Enumeration Sampling Programs inside Memory Gradient Estimate Programs outside Sampling Memory Stratified sampling => variance reduction Unbiased

  20. MAPO ( = a program) ( = correct or not)

  21. MAPO ( = a program) ( = correct or not)

  22. WikiTableQuestions: first SOTA using RL

  23. WikiTableQuestions: first SOTA using RL

  24. WikiSQL: strong vs. weak supervision! Strong supervision

  25. WikiSQL: strong vs. weak supervision! Strong supervision

  26. ● MAPO converges slower than iterative maximum likelihood, but reaches a better solution. ● REINFORCE doesn’t make much progress (<10% accuracy).

  27. ● MAPO converges slower than maximum likelihood training, but reaches a better solution. ● REINFORCE doesn’t make much progress (<10% accuracy).

  28. An efficient policy optimization method for learning to generate sequences from sparse rewards. https://github.com/crazydonkey200/neural-symbolic-machines https://arxiv.org/abs/1807.02322 Poster: Room 517 AB #137 http://crazydonkey200.github.io/

Recommend


More recommend