deep learning over big code for program analysis and
play

Deep Learning over Big Code for Program Analysis and Synthesis - PowerPoint PPT Presentation

Deep Learning over Big Code for Program Analysis and Synthesis Swarat Chaudhuri, Vijay Murali, Chris Jermaine Programming is hard Program synthesis, debugging, verification, repair Can we automate these processes? Decades of prior


  1. 3. Sketches Prog 𝑌 : sketch, a syntactic abstraction of a program • Sketches abstract away superficial differences and known knowledge [ call FileReader.new(String) 
 call BufferedReader.new(FileReader) 
 loop ([ BufferedReader.readLine() ]) { 
 skip 
 } 
 call BufferedReader.close() 
 ] � 30

  2. 3. Sketches Prog 𝑌 Program-Sketch relation is many-to-one • Abstraction function • Concretization distribution 𝛽 (Prog) Prog 𝑍 𝑄 ( Prog 𝑍 ) is not learned from data • Fixed and defined heuristically with domain knowledge � 31

  3. 3. Sketches Prog 𝑌 𝑍 New goal: “Sketch-learning” • Learn to generate sketches from evidence Learn distribution • Data is now triplets • parameterizes the distribution • Find an optimal value � 32

  4. 3. Sketches Prog 𝑌 𝑍 Two-step synthesis 1. Sample sketch from learned distribution, i.e., 2. Synthesize from • Implemented in a combinatorial synthesizer • Uses type-directed search to prune space • Incorporates the PL grammar, language-level rules, type-safety constraints, … � 33

  5. 3. Sketches Sketches can be defined in many ways. But one has to be careful… 
 • Too concrete: patterns in training data would get lost, would suffer 
 • Too abstract: concretizing sketches to code would get too hard to compute, would suffer Our sketch language designed for API-using Java programs � 34

  6. 3. Sketches Abstract API call Type API call � 35

  7. Feature extractor Synthesized • Evidences Program • Sketches Corpus foo(File f) { of Programs Draft Program with f.read(); Evidences f.close(); foo(File f) { } /// read file Combinatorial Search } Inference 𝑍 1 • Type-based pruning 𝑌 1 : “read” Distribution over “file” Evidences & Sketches Evidences & Sketches from corpus 𝑸 ( 𝒁 𝒀 ) Training Statistical Learning (Deep Neural Network) � 36

  8. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 37

  9. Data-driven Correctness Analysis Underlying thesis: Bugs are anomalous behaviors. 
 [Engler et al., 2002; Hangal & Lam, 2002] A specification is a commonplace pattern in program behaviors seen in the real world. Learn specifications from examples of program behavior. 
 [Ammons et al., 2002; Raychev et al., 2014] � 38

  10. BayouDebug • Statistical framework for simultaneously learning a wide range of specifications from a large, heterogeneous corpus • Quantitatively estimating a program’s “anomalousness” as a measure of its correctness • BayouDebug : a system for finding API usage errors in Java/Android code • Underlying probabilistic model similar to BayouSynth but “mirrored” • Program is given, need to predict likelihood of its behaviors � 39

  11. BayouDebug • Originally called Salento • Source: github.com/capergroup/salento Bayesian Specification Learning for Finding API Usage Errors Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Foundations of Software Engineering [FSE] 2017 https://arxiv.org/abs/1703.01370 � 40

  12. BayouDebug AlertDialog.Builder b = new AlertDialog.Builder(this); b.setTitle(R.string.title_variable_to_insert); if (focus.getId() == R.id.tmpl_item) { b.setItems(R.array.templatebodyvars, 
 this); } else if (focus.getId() == R.id.tmpl_footer) { b.setItems(R.array.templateheaderfootervars, this); } b.show(); This dialog box cannot be closed � 41

  13. 1. Evidence Prog 𝑌 • We have programs and • Evidence as before • Set of API calls in program • Set of types in program • … • Can be easily extracted from programs � 42

  14. 2. Behaviors Prog 𝑌 𝑍 • represents behaviors • Traces of API calls • Program state during execution (abstraction) • Can also be extracted from programs • Dynamic/Symbolic Execution • Assuming a behavior model for a program • Behavior model derived from input distribution (dynamic) or static analysis (symbolic) � 43

  15. : Generative probabilistic automaton 
 [Murawski & Ouaknine, 2005] 1 AlertDialog.Builder b = 
 new AlertDialog.Builder(…); new A(…) 1.0 setTitle(…) b.setTitle(…); 𝜗 𝜗 if (…) { 5 4 8 b.setItems (…); 0.33 0.33 𝜗 setItems(…) 1.0 1.0 setItems(…) 0.33 } else if (…) { 11 11 ′ � ′ � 11 ′ � b.setItems(…); show() show() 1.0 show() 1.0 1.0 } b.show(); T T ′ � ′ � T ′ � Produced using static analysis � 44

  16. Specification 
 Prog 𝑌 𝑍 Learning From data, learn a distribution over program behaviors given evidence, i.e., • Data is in the form of pairs • As before, assume parameterizes the distribution • Find an optimal value using max-CLE � 45

  17. Correctness 
 Prog 𝑌 𝑍 Analysis • Goal : check if test program is correct • Look at two distributions • : how programs that look like tend to behave • : how behaves • Cast correctness analysis as statistical distance computation • Kullback-Leibler (KL) divergence: • High KL-divergence ➡ Prog is anomalous � 46

  18. Feature extractor • API calls • Behaviors Test Program F Corpus of Programs foo(File f) Features & Behaviors { of test program f.read(); f.close(); } Inference 𝑸 ( 𝒁 | 𝐐𝐬𝐩𝐡 ) Distribution over … … … Features & Behaviors Features & Behaviors from corpus 𝑸 ( 𝒁 𝒀 ) 1 Training 0.8 0.6 0.4 0.2 0 Statistical Learning -0.2 -0.4 (Deep Neural Network) Anomaly Score (Aggregate) � 47

  19. What we have covered • Formal methods have always relied on formal specifications • Uncertainty in specifications is an important consideration • ML models learned from Big Code are a new and hot way of dealing with uncertainty • PL ideas are still key • Syntactic abstractions are necessary for data-driven synthesis • Static/Dynamic analysis is necessary for data-driven debugging • How to implement all of this? Coming up next… � 48

  20. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 49

  21. What is a Neural Network? • A logical circuit transforms binary input signals into binary outputs through logical operations output = softmax(W.x 1 + …) Embedding 𝑿 𝟐 𝑿 𝟑 softmax 𝑿 𝟒 𝑿 𝟓 • A neural network is a circuit where • Input and outputs can be smooth (continuous) • Operations are differentiable (matrix multiply, exponentiate, …) � 50

  22. Code Snippets • Most common machine learning libraries • We will use Tensorflow in this talk • Build a computation graph of neural network in Python • Statically compile graph into C++/CUDA • Setup training data for each input/output variable • Execute graph with data • [Abadi et al. 2016] import tensorflow as tf � 51

  23. Encodings • Neural networks work on various kinds of inputs and outputs • Differentiable operations work on real numbers • Transform raw inputs into a suitable representation I am a student Je suis un étudiant 5 0 4 1 • Fixed vocabulary: , encode words uniquely • Naïve encoding – each word is its index (, , …) • Problem? • One-Hot encoding: typical encoding for categorical data � 52

  24. One-Hot Encoding • One-hot encoding of is an -length vector where all elements are 0 except a 1 at index Word One-hot encoding [ 1, 0, 0, … 0 ] [ 0, 1, 0, … 0 ] [ 0, 0, 0, … 1 ] • Pros/Cons + Easy to encode, no unintended relationships between words - Length of encoding affected by vocabulary size, infrequent words • All input evidences are assumed to have been converted to their one-hot representations � 53

  25. Feed-Forward Neural Network • A simple architecture of a “cell” (Tensorflow term) • Signal flows from input to output • Real-valued weight and bias matrices and where is an “activation function” W b ∗ 𝒛 + 𝒚 𝜏 � 54

  26. Activation Functions • Non-linear functions that decide the output format of cell • Sigmoid, output between 0 and 1: • , output between -1 and 1 • Rectified Linear Unit (ReLU), output between 0 and � 55

  27. Implementing a FFNN 𝑧 = 𝜏 ( W . 𝑦 + b ) W b ∗ 𝒛 𝒚 + 𝜏 # input_size: size of input vocabulary # output_size: size of output as needed x = tf.placeholder(tf.int32, [1, input_size]) W = tf.get_variable(‘W’, [input_size, output_size]) b = tf.get_variable(‘b’, [output_size]) y = tf.sigmoid(tf.add(tf.multiply(W, x), b)) � 56

  28. Hidden Layers • Notion of “internal state” can be implemented through hidden layers # num_units: number of units in the hidden layer ... W_h = tf.get_variable(‘W_h’, [input_size, num_units]) b_h = tf.get_variable(‘b_h’, [num_units]) h = tf.sigmoid(tf.add(tf.multiply(W_h, x), b_h)) W = tf.get_variable(‘W’, [num_units, output_size]) b = tf.get_variable(‘b’, [output_size]) y = tf.sigmoid(tf.add(tf.multiply(W, h), b)) � 57

  29. Stacking hidden layers • Forms the “deep” in deep learning h 1 h 2 W 1 b 1 W 2 b 2 … ∗ ∗ 𝒛 + + 𝒚 𝜏 𝜏 • Weights/biases can be shared (all and are the same) • Design choice that leads to different architectures � 58

  30. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 59

  31. Recurrent Neural Network • RNNs model sequences of things • Assume input and output • RNNs have a notion of hidden state across “time steps” • Feedback loop updates hidden state at each step 𝒛 𝒖 𝒛 𝟑 𝒛 𝒐 𝒛 𝟐 = … 𝒚 𝒖 𝒚 𝟑 𝒚 𝒐 𝒚 𝟐 � 60

  32. Recurrent Neural Network • Model hidden state at time step as a function of input and hidden state • Each hidden state encodes entire history (as permissible by memory) due to feedback loop • Important property: weights for hidden state (, , ) are shared across time steps • Most often we do not know the number of time steps a priori • Shared weights model the same function being applied at each time step • Keeps model parameters tractable and mitigates overfitting � 61

  33. Implementing an RNN • Tensorflow provides an API for RNN cells • Configure type of RNN cell (vanilla, LSTM, etc.) • Configure activation functions (sigmoid, tanh, etc.) # input: x = [x_1, x_2, ..., x_n] # expected output: y_ = [y_1, y_2, ..., y_n] # num_units: number of units in the hidden layer rnn = tf.nn.rnn_cell.BasicRNNCell(num_units, activation=tf.sigmoid) state = tf.zeros([1, rnn.state_size]) y = [] for i in range ( len (x)): output, new_state = rnn(x[i], state) state = new_state logits = tf.add(tf.multiply(W_y, output), b_y) y.append(logits) � 62

  34. RNNs for Program Synthesis • Consider a program as a sequence of tokens from a vocabulary of tokens void read() throws IOException void, read, LPAREN, RPAREN, throws … { ... } • As data is noisy, we typically want to learn a distribution over programs • Output programs can be sampled from learned distribution • For a program where each is a token, • Each token is obtained from a history of tokens • RNN hidden state is capable of handling history � 63

  35. RNNs for Program Synthesis • If we train an RNN to learn we can use it to generate code token-by-token • Synthesis strategy: sample token at time step and provide it back as input for time • No evidence: Unconditional program generation • No sketches: Learning would be difficult • Not optimal, still useful to introduce ML concepts 𝒛 𝟑 𝒛 𝒐 𝒛 𝟐 … 𝒚 𝟑 𝒚 𝒐 𝒚 𝟐 � 64

  36. Output Distributions • First, we need the RNN output to be a distribution • Softmax activation function • Converts a -sized vector of real quantities into a categorical distribution over classes for i in range ( len (x)): output, new_state = rnn(x[i], state) state = new_state logits = tf.add(tf.multiply(W_y, output), b_y) • Advantages over standard normalization y.append(tf.nn.softmax(logits)) • Handles positive and negative values • Implies raw values are in log-space, which is common in MLE � 65

  37. Loss Functions • The RNN we have built would likely not produce expected outputs immediately • For training, define what it means for a model to be bad and reduce it • Loss Functions define how bad a model is with respect to expected outputs in training data • Cross-entropy (categorical) • Mean-squared error (real-valued) • Cross-entropy measures the distance between two distributions • : ground truth “distribution” (one-hot encoding) • : predicted distribution � 66

  38. Loss Functions • Example: vocabulary size 4 • Expected output is : , predicted distribution: • Cross-entropy loss • Loss for output sequence is typically the average over sequence • Tensorflow’s API has softmax and cross-entropy sequence loss built into a single call # expected output: y_ = [y_1, y_2, ..., y_n] ... for i in range ( len (x)): output, new_state = rnn(x[i], state) state = new_state logits = tf.add(tf.multiply(W_y, output), b_y) y.append(logits) loss = tf.contrib.seq2seq.sequence_loss(y, y_, weights=tf.ones(...)) � 67

  39. Loss Functions • Tensorflow adds loss operation to computation graph _ 𝒛 𝟐 _ 𝒛 𝟑 _ 𝒛 𝒐 Targets Softmax Cross Entropy Outputs 𝒛 𝟐 𝒛 𝟑 𝒛 𝒐 … 𝒚 𝟐 𝒚 𝟑 𝒚 𝒐 � 68

  40. Ingredients for Training Loss Function High-dimensional function measuring error Neural Network w.r.t. ground truth Complex architecture to Training Data model generation of Ground truth outputs from inputs inputs and outputs Gradient Descent Find the point where function value is minimal � 69

  41. Gradient Descent • Optimization algorithm to compute (local) minimum • Iteratively move parameters in the direction of negative gradient • Need for differentiable operations A single “step” of gradient descent given: function f(x), loss for each parameter p of function p_grad = 0 # � Millions! for each data point d in training data g = gradient of loss w.r.t. p for d p_grad += g How to train neural networks efficiently? p += -p_grad * learning_rate � 70

  42. Stochastic Gradient Descent • Stochastic Gradient Descent (SGD) approximates GD • Considers only a single data point for each update • Takes advantage of redundancy often present in data • Requires more parameter updates, but each iteration is faster • In practice, mini-batch Gradient Descent • Use a small number of data points (10-100) A single “step” of gradient descent given: function f(x), loss for each parameter p of function p_grad = 0 😑 Millions! for each data point d in batch g = gradient of loss w.r.t. p for d p_grad += g p += -p_grad * learning_rate � 71

  43. Backpropagation • Reverse-mode automatic differentiation • “Magic sauce” of gradient descent & deep learning • Automatically compute partial derivates of every parameter in NN • During optimization, compute gradients in almost the same order of complexity as evaluating the function W b Output Target ∗ + 𝒛 _ 𝒛 𝒚 𝑀 � 72

  44. Backpropagation • Each basic operation is associated with a gradient operation • Use chain rule to compute derivative of loss w.r.t. operation • Example: • Efficient by computing and reusing intermediate partial derivates • During SGD, all parameters can be updated in one swoop • Learning rate controls amount of update W b 𝜖𝑀 𝜖𝑀 𝜖 𝐜 𝜖 𝐗 Output Target ∗ + 𝒛 _ 𝒛 𝒚 𝑀 𝜖𝑀 𝜖 + 𝜖𝑧 𝜖𝑧 𝜖 ∗ 𝜖 + � 73

  45. Backpropagation A single “step” of gradient descent given: function f(x), loss grad = 0 for each data point d in batch 🙃 Millions! g = gradient of loss w.r.t. each param for d grad += g backprop_gradients (grad) • For RNNs – Backpropagation Through Time (BPTT) • “Indefinite length”, unroll into multi-layer FFNNs and backprop • Problem: Due to multiplication, run into either exploding (> 1) or vanishing (< 1) gradients • In practice, Truncated BPTT – build RNN with fixed-length and backprop till length � 74

  46. Training in Tensorflow • Add training operation to loss function • Tensorflow automatically adds backpropagation operations • Create a Tensorflow “session” to initialize variables • Feed mini-batches for each iteration as dictionary ... y_ = tf.placeholder(tf.int32, [batch_size, rnn_length], ...) step = tf.train.GradientDescentOptimizer(0.5).minimize(loss) with tf.Session() as sess: tf.global_variables_initializer().run() for epoch in range (50): batches = get_mini_batches() for (batch_x, batch_y) in batches: sess.run(step, feed_dict={x: batch_x, y_: batch_y}) � 75

  47. Example: Character-level RNN • Training an RNN on Linux source to generate code character- by-character • Token level model may be easier or difficult + Character vocabulary (ASCII) is simpler than token vocabulary - Character model could generate malformed keywords ( if , while , etc.) but token model would not • Nevertheless, interesting model to consider as example http://karpathy.github.io/2015/05/21/rnn-effectiveness � 76

  48. static void do_command ( struct seq_file *m, void *v) { int column = 32 << (cmd[2] & 0x80); if (state) cmd = ( int )(int_state ^ (in_8( & ch -> ch_flags) & Cmd) ? 2 : 1); else seq = 1; for (i = 0; i < 16; i ++ ) { if (k & (1 << 1)) pipe = (in_use & UMXTHREAD_UNCCA) + ((count & 0x00000000fffffff8) & 0x000000f) << 8; if (count == 0) sub(pid, ppc_md.kexec_handle, 0x20000000); pipe_set_bytes(i, 0); } /* Free our user pages pointer to place camera if all dash */ subsystem_info = & of_changes[PAGE_SIZE]; rek_controls(offset, idx, & soffset); /* Now we want to deliberately put it to device */ control_check_polarity( & context, val, 0); for (i = 0; i < COUNTER; i ++ ) � 77 seq_puts(s, "policy ");

  49. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 78

  50. Conditional Generative Model • RNNs can learn to model generation of sequences of data • where Prog is a sequence of tokens/characters • For synthesis we need a conditional generative model • Can we condition an RNN to generate sequences based on some input? • Specifically, can we make an RNN learn ? • We can then condition the generation of code on evidence Encoder-Decoder architecture • Often used in Neural Machine Translation (NMT) • Google translate � 79

  51. Encoder-Decoder Architecture h 𝒛 𝟑 𝒛 𝒐 𝒛 𝟐 W h b h … … ∗ + 𝜏 𝒀 𝒛 𝒐−𝟐 . 𝒛 𝟐 • Key insight: To learn a conditional distribution • Use an encoder network to encode into a hidden state • Use a decoder network to generate from the encoded state � 80

  52. Implementing an Encoder- Decoder • Simply compute RNN initial state using the output of FFNN # num_units,_enc,_dec: hidden state/encoder/decoder dimensionality ... h_enc = tf.sigmoid(tf.add(tf.multiply(W_enc, x), b_enc)) # transform into hidden state dimensions W_h = tf.get_variable(‘W_h’, [num_units_enc, num_units]) b_h = tf.get_variable(‘b_h’, [num_units]) h = tf.sigmoid(tf.add(tf.multiply(W_h, h_enc), b_h)) rnn = tf.nn.rnn_cell.BasicRNNCell(num_units_dec, ...) h_dec = tf.sigmoid(tf.add(tf.multiply(W_dec h), b_dec)) for i in range ( len (y)): output, new_h_dec = rnn(y[i], h_dec) h_dec = new_h_dec ... � 81

  53. Encoder-Decoder Characteristics 1. Encoder and decoder must be trained together • Gradients from decoder passed all the way back to encoder 2. Low-dimensional hidden state • Compared to encoder inputs (one-hot) and decoder outputs (softmax) 𝒁 Softmax 𝒀 … One-hot Decoder Encoder � 82

  54. Encoder-Decoder Characteristics • “Bottleneck” due to (1) and (2) • Encoder learns to encode inputs in the most efficient way that is useful for decoder • Hidden state acts as a regularizer – captures the essence of inputs that is necessary to produce the right outputs • Mitigates overfitting • For the synthesis problem • Encoding multiple inputs (evidence) • In sequence? Concatenate hidden states? Average? • Decoding into trees (sketches) • Representing structure using sequence? • Inferring the most likely sketch? Is there a principled way to do this? � 83

  55. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 84

  56. Latent Intents 𝑌 𝑎 𝑍 • Each programming task has an intent • Example (abstractly): “file reading”, “sorting” • There is a distribution over , • Since we do not know anything about , it is latent • Assume a prior • We have evidence about the intent: • API calls, types, keywords. Example: readLine , swap • We have implementations of the intent: • Sketches – abstractions of implementation • Given , and are conditionally independent: � 85

  57. : Intent from Evidence • How should we define ? • We can have multiple evidences • We want each evidence to independently shift our belief on • Define a generative model of evidence from intent where is the encoding function • Models the assumption that encoded value of each evidence is a sample from a Normal centered on • prior • with some variance (learned) � 86

  58. : Intent from Evidence FileReader swap 𝑨 1 readLine 𝑨 2 𝑔 ( 𝑦 2 ) ~ 𝑂 ( 𝑨 2 , 𝜏 2 𝑱 ) 𝑎 ~ 𝑂 (0, 𝑱 ) From Normal-Normal conjugacy: � 87

  59. : Intent from Evidence How the encoder maps evidence to latent space (posterior) Animation Encoder AlertDialog BufferedReader � 88

  60. : Sketch from Intent • Sketch is tree-structured, RNNs work with sequences • Deconstruct sketch into set of production paths • Based on production rules in sketch grammar • Sequence of pairs where • is a node in sketch, i.e., a term in the grammar • , the type of edge between and • Sibling connects terms in the RHS of the same rule 
 (sequential composition) • Child connects a term in the LHS with the RHS of a rule 
 (loop condition with body) � 89

  61. : Sketch from Intent 4 paths in sketch 1. ( try , ), ( FR.new(String) , ), ( BR.new(FR) , ), ( while , ), ( BR.readLine() , ), ( skip , ) 2. ( try , ), ( catch , ), ( FNFException , ), ( printStackTrace() , .) � 90

  62. : Sketch from Intent • Generate sketch by recursively generating productions paths • Basic step: given and a history of fired rules, what is the distribution over the next rule? • where • Distribution dependent on history and – not context-free! • Sample a and recursively generate tree • Implemented using an RNN • Neural hidden state can encode history • Top-down Tree-Structured RNNs (Zhang et al, 2016) � 91

  63. : Sketch from Intent Distribution on rules that can be 
 Production rule in 
 fired at a point, given history so far. sketch grammar History encoded as a real vector. … 0.7 0.3 � 92

  64. Putting it all together… • Originally, we were interested in (from our probabilistic model) (from the Monte-Carlo 
 definition of expectation) (from Jensen’s inequality) (lower bound for CLE) � 93

  65. Putting it all together… • In English • encodes evidence into distribution over • A value of is sampled from the distribution • decodes into a sketch Problem? 𝒁 Sketch 𝒀 STOP … Evidence 𝒂 Decoder Encoder Intent Gradients cannot pass through stochastic operation! � 94

  66. Reparameterization • Key intuition: all Normal distributions are scaled/balanced versions of • Sampling from = sampling from , multiplying by and adding • Instead of , get sample and compute • Encoder produces and as the parameters of • is an input to the network, not part of it • Gradients can flow through! • [Kingma 2014] � 95

  67. Reparameterization Sample from 𝝑 gradients 𝒁 Sketch 𝒀 … 𝒂 Evidence Intent Encoder Decoder Gaussian Encoder Decoder (GED) � 96

  68. What we have covered… • How to implement neural network architectures • Feedforward Neural Network • Recurrent Neural Network • How to build an Encoder-Decoder network for program synthesis • GED is suited for synthesis but it is not the only architecture that can be instantiated from the Bayou framework • How neural networks are trained • Gradient descent, backpropagation, reparameterization • Coming up next • How the PL parts interact with the ML parts in BayouSynth and BayouDebug � 97

  69. What we have not covered... • Multi-modal evidences with different modes • API calls, types, keywords, etc. may each have a different variance towards • Getting a distribution over top-k likely sketches instead of sampling a single sketch • Beam search • Top-Down Tree-Structured LSTM network • Architecture for learning tree-structured data • Handling complex evidences such as Natural Language • One-hot encoding would blow up, need a more “dense” embedding � 98

  70. Outline • Introduction to the Bayou framework • BayouSynth • Underlying probabilistic model • BayouDebug • Implementing BayouSynth with deep neural networks • Feed-forward Neural Network • Recurrent Neural Network • The Encoder-Decoder architecture • Gaussian Encoder-Decoder • Type-directed synthesis • Implementing BayouDebug • Latent Dirichlet Allocation and Topic-Conditioned RNN • Conclusion � 99

  71. Feature extractor Synthesized • Evidences Program • Sketches Corpus foo(File f) { of Programs Draft Program with f.read(); Evidences f.close(); foo(File f) { } /// read file Combinatorial Search } Inference 𝑍 1 • Type-based pruning 𝑌 1 : “read” Distribution over “file” Evidences & Sketches Evidences & Sketches from corpus 𝑸 ( 𝒁 𝒀 ) Training Statistical Learning (Deep Neural Network) � 100

Recommend


More recommend