Neural Program Synthesis with Priority Queue Training Daniel A. Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, Quoc V. Le https://arxiv.org/abs/1801.03526
Why Program Synthesis? ● One of the hard AI reasoning domains ● A tool for planning in robotics ● Increased interpretability (human can read code more easily than NN weights)
Deep Reinforcement Learning Value based RL e.g. Q-learning Policy based RL e.g. policy gradient Agent https://becominghuman.ai/the-very-basics-of-reinforcement-learning-154f28a79071
Deep RL for Combinatorial Optimization ● Neural Architecture Search with Reinforcement Learning
Deep RL for Combinatorial Optimization ● Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision
Deep RL for Combinatorial Optimization ● Neural Combinatorial Optimization with Reinforcement Learning
"Fundamental" Program Synthesis ● Focus on algorithmic coding problems. ● No ground-truth program solutions. ● Simple Turing-complete language.
HelloWorld.bf ++++++++ Set Cell #0 to 8 [ >++++ Add 4 to Cell #1; this will always set Cell #1 to 4 [ as the cell will be cleared by the loop ++++++++ [ >++++ [ >++>+++>+++>+<<<<- ] >+>+>- >++ Add 2 to Cell #2 >+++ Add 3 to Cell #3 >>+ [ < ] <- ] >> . >--- . +++++++ .. +++ . >> . <- . < . ++ >+++ Add 3 to Cell #4 >+ Add 1 to Cell #5 + . ------ . -------- . >>+ . >++ . <<<<- Decrement the loop counter in Cell #1 ] Loop till Cell #1 is zero; number of iterations is 4 >+ Add 1 to Cell #2 >+ Add 1 to Cell #3 >- Subtract 1 from Cell #4 >>+ Add 1 to Cell #6 [<] Move back to the first zero cell you find; this will be Cell #1 which was cleared by the previous loop <- Decrement the loop Counter in Cell #0 ] Loop till Cell #0 is zero; number of iterations is 8 The result of this is: Cell No : 0 1 2 3 4 5 6 Contents: 0 0 72 104 88 32 8 Pointer : ^ >>. Cell #2 has value 72 which is 'H' >---. Subtract 3 from Cell #3 to get 101 which is 'e' +++++++..+++. Likewise for 'llo' from Cell #3 >>. Cell #5 is 32 for the space <-. Subtract 1 from Cell #4 for 87 to give a 'W' <. Cell #3 was set to 'o' from the end of 'Hello' +++.------.--------. Cell #3 for 'rl' and 'd' >>+. Add 1 to Cell #5 gives us an exclamation point https://en.wikipedia.org/wiki/Brainfuck >++. And finally a newline from Cell #6
Anatomy of BF Turing complete! https://esolangs.org/wiki/Brainfuck#Computational_class
BF Execution Demo: Reverse a list
Why BF ● Turing complete, and suitable for any algorithmic task in theory. ● Many algorithms have surprisingly elegant BF implementations. ● No syntax errors (with minor adjustment to interpreter). ● No names (variables, functions).
Training Setup RNN Reward Function Inference code Code inputs Scoring Reward outputs Test BF interpreter Function cases Gradient Update reward
Training Setup: Reward Function Score Test case S(Y, Y*) = d( ∅ , Y*) - d(Y, Y*) input X = [1, 2, 3, 4, 0] d(Y, Y*) = variable length Hamming distance output Y* = [4, 3, 2, 1, 0] program P = ,>,.<. program output P(X) = [2, 1] Y* = [4, 3, 2, 1, 0] Y* = [4, 3, 2, 1, 0] d = ∑ 2 2 B B B d = ∑ B B B B B ∅ = [] Y = [2, 1] Reward = ∑ S(P(I), Y*) base B = 256
Problems with policy gradient (REINFORCE) ● Catastrophic forgetting and unstable learning ● Sample inefficient
Solution: Priority Queue Training (PQT) code sampled from RNN Reward Function rewards Max-Unique Priority Queue training targets
Results
⋯
Fixed Length Programs remove <>[,[<+,.],<<]],[[-[+.>>][]>[>[>[[<+>>.+<>]>]<<>]],]>+-++--,>[+[[<----].->+]->]]]-[,.]+>>,-,,-]><,,] reverse ,[[>,<>]]-]+[<[.,++,<]<>[->.,+,[<+]<-]<,,<<>>[[[<+<[],.>->]>,<-]<]<>,-<,,[+>,<,><.[.<-+,+-<]+<[,+-<> add ,>,[-<+>][,],>]<]-<.+,,+,<.,>]>,[><<-,][+-[.[[+<[.>]],>.[]-<,],+,[,->]>>->+,[+[>]-,-]--,.,>+-<<]]<,+ length 100
Synthesized vs "Ground Truth" Synthesized Experimenter's best solution reverse ,[>,]+[,<.] >,[>,]<[.<]. remove ,-[+.,-]+[,.] ,[-[+.[-]],]. count-char ,[-[>]>+<<<<,]>. >,[-[<->[-]]<+>,]<. add ,[+>,<<->],<.,. ,>,<[->+<]>. bool-logic ,+>,<[,>],<+<. ??? print ++++++++.---.+++++++..+++. ++++++++.---.+++++++..+++. zero-cascade ,.,[.>.-<,[[[.+,>+[-.>]..<]>+<<]>+<<]] ,[.>[->+>.<<]>+[-<+>]<<,] cascade ,[.,.[.,.[..,[....,[.....,[.>]<]].]] ,>>+<<[>>[-<+>]<[->+<<.>]>+<<,]. shift-left ,>,[.,]<.>. ,>,[.,]<.,. shift-right ,[>,]<.,<<<<<.[>.] >,[>,]<.[-]<[<]>[.>]. unriffle -[,>,[.,>,]<[>,]<.] >,[>,[.[-]],]<[.<]. remove-last ,>,[<.>>,]. ,>,[[<.[-]>[-<+>]],]. remove-last-two >,<,>>,[<.,[<.[>]],]. ,>,>,[[<<.[-]>[-<+>]>[-<+>]],]. echo-alternating ,[.,>,]<<<<.[>.] >,[.,>,]<<[<]>[.>]. length ,[>+<,]>. >+>,[[<]>+[>],]<[<]>-. echo-second-seq ,[,]-[,.] ,[,],[.,]. echo-nth-seq ,-[->-[,]<]-[,.] ,-[->,[,]<],[.,].
What's next? Scale up to harder coding problems, and more complex programming languages. ● Augment RL with supervised training on a large corpus of programs. ● Give the code synthesizer access to auxiliary information, such as stack traces and program execution internals. ● Data augmentation techniques, such as Hindsight experience replay. ● Few-shot learning techniques can help with generalization issues, e.g. MAML.
Thank you! Questions? Thank you to my coauthors: Mohammad Norouzi, Jonathan Shen, Rui Zhao, Quoc V. Le.
Prior Work ● Algorithm induction ○ Neural Programmer, A. Neelakantan, et al. ○ Neural Programmer-Interpreters, S. Reed, et al. ● Domain specific languages ○ RobustFill, J. Devlin, et al. ○ DeepCoder, M. Balog, et al. ○ TerpreT, A. Gaunt, et al. ● Precursors to PQT ○ Noisy Cross-Entropy Method, I. Szita, et al. ○ Neural Symbolic Machines, C. Liang, et al.
Recommend
More recommend