M emory A ugmented P olicy O ptimization ( MAPO ) for Program Synthesis and Semantic Parsing Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao
Program Synthesis / Semantic Parsing how many more passengers flew to los angeles than to saskatoon?
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? Latent (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Program Synthesis / Semantic Parsing how many more passengers flew to 12,467 los angeles than to saskatoon? Sparse Latent (filter in rows ['saskatoon'] r.city) (filter in rows ['los angeles'] r.city) (diff v1 v0 r.passengers)
Policy Gradient On-policy Actor Learner Samples Updated Policy Unbiased => optimal solution High variance => slow training
Imitation Learning Demonstration Actor Learner Updated Policy Biased => suboptimal solution Low variance => fast training
Imitation Learning Demonstration Actor Learner Updated Policy Biased => suboptimal solution Low variance => fast training Requires human supervision
MAPO Actor Learner Updated Policy Unbiased => optimal solution Low variance => fast training
MAPO Memory buffer High-reward samples Actor Learner Updated Policy Unbiased => optimal solution Low variance => fast training
MAPO Memory buffer Samples inside High-reward memory samples Samples outside Actor Learner memory Updated Policy Unbiased => optimal solution Low variance => fast training
Gradient Expectation Estimate Program space
Gradient Sampling Expectation Estimate Program space Unbiased High variance
MAPO Enumeration Programs inside Memory Gradient Estimate Programs outside Sampling Memory Sampling from a smaller space => variance reduction Unbiased
MAPO Enumeration Sampling Programs inside Memory Gradient Estimate Programs outside Sampling Memory Stratified sampling => variance reduction Unbiased
MAPO ( = a program) ( = correct or not)
MAPO ( = a program) ( = correct or not)
WikiTableQuestions: first SOTA using RL
WikiTableQuestions: first SOTA using RL
WikiSQL: strong vs. weak supervision! Strong supervision
WikiSQL: strong vs. weak supervision! Strong supervision
● MAPO converges slower than iterative maximum likelihood, but reaches a better solution. ● REINFORCE doesn’t make much progress (<10% accuracy).
● MAPO converges slower than maximum likelihood training, but reaches a better solution. ● REINFORCE doesn’t make much progress (<10% accuracy).
An efficient policy optimization method for learning to generate sequences from sparse rewards. https://github.com/crazydonkey200/neural-symbolic-machines https://arxiv.org/abs/1807.02322 Poster: Room 517 AB #137 http://crazydonkey200.github.io/
Recommend
More recommend