Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh
Source code is a means of human communication
Development “out in the open” 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count 3000 (x1000) 2000 Repositories 1000 (Sourceforge) 0 2011 2012 2013 2014 Year
Probabilistic modelling Model Problem (family of distributions) Supervised Learning Source Source Source (x 1, y 1 )…(x n, y n ) files Source (objective function) files Source files files files Unsupervised x 1 …x n Distribution Data Predict y from x p(y|x test ) “Explore” x p(z|x test ) Do stuff Inspect distribution p(z|x 1 …x n )
Learning Natural Coding Conventions [Allamanis, Barr, Bird, Sutton; FSE 2014]
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while (( i .read()) != -1); ... } ... }
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); Suggest while (( i .read()) != -1); alternate names ... } input ... inputStream } is stream Score and threshold input (81.93%)
Language Models for Source Code Probability distribution over token sequences: Consider naive estimator: In Naturalize : Choose the name other programmers use in similar contexts
Naming Methods and Classes [Allamanis, Barr, Bird, Sutton; FSE 2015]
Name that Tune Java Method 1 private void createDefaultShader () { String vertexShader = "literal_1"; 2 String fragmentShader = "literal_2"; 3 shader = new ShaderProgram(vertexShader, 4 fragmentShader); 5 if(shader.isCompiled() == false) 6 throw new IllegalArgumentException( 7 "literal_3" + shader.getLog()); 8 9 } Figure 1: This method is from libgdx ’s CameraGroupStrategy from libgdx “Desktop/Android/Blackberry/iOS/HTML5 Java game development framework” http://libgdx.badlogicgames.com
Embedding Identifiers [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Log Bilinear Context Model Kavukcuoglu, 2013; Maddison, Tarlow, 2014] exp { s θ ( t, c 1: m ) } q createDefaultShader P ( t | c 1: m ) = q hashCode P t 0 exp { s θ ( t 0 , c 1: m ) } ˆ r c s θ ( t, c 1: m ) = q > t ˆ r c + b t (private, void, (, ), {, String, vertexShader, c = =, “literal_1”, ;, String, …) t = createDefaultShader q v ∈ R D are “embeddings” ::: model parameters What about ? ˆ r c More complex, we need to summarize many tokens
Mining Idioms from Code [Allamanis and Sutton; FSE 2014]
Mined Idioms (General Java) Iterate through the elements of an Creating a logger for a class Iterator Looping through lines from a Defining a String constant BufferedReader
Mined Idioms (Library-Specific) Database transaction in node4j Get an HTML Document in jsoup Get the distance between Show a small popup in Android two points in Android
Model: Tree substitution grammars
TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Mining API Patterns http://arxiv.org/abs/1510.04130 [Fowkes and Sutton; NIPS WS 2014]
API patterns from code TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.getInstance Status.getUser TwitterFactory.getInstance TwitterFactory.<init> Status.getText TwitterFactory.<init> Status.getUser TwitterFactory.getInstance TwitterFactory.getInstance Status.getText TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthConsumer ConfigurationBuilder.<init> Twitter.setOAuthAccessToken TwitterFactory.<init> ConfigurationBuilder.build Twitter.setOAuthConsumer auth.AccessToken.getToken ConfigurationBuilder.<init> auth.AccessToken.getTokenSecret TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.<init> Twitter.setOAuthConsumer ConfigurationBuilder.<init> ConfigurationBuilder.setDebugEnabled ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build Status.getUser Status.getText ConfigurationBuilder.<init> TwitterFactory.getInstance ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> Twitter.setOAuthConsumer ConfigurationBuilder.build Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Twitter.setOAuthAccessToken ConfigurationBuilder.setOAuthConsumerKey Twitter.updateStatus ConfigurationBuilder.<init> ConfigurationBuilder.build ConfigurationBuilder.build ConfigurationBuilder.<init> User.getId ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> User.getId ConfigurationBuilder.setOAuthConsumerSecret Twitter.setOAuthAccessToken ConfigurationBuilder.build User.getId TwitterFactory.getInstance User.getScreenName http.AccessToken.getToken Twitter.setOAuthAccessToken http.AccessToken.getTokenSecret ConfigurationBuilder.<init> ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.setOAuthConsumerSecret TwitterFactory.getInstance Status.getId Status.getId MAPO UP-Miner IIM [Zhong et al, 2009] [Wang et al, 2013] (actually a slight extension)
Model z ( j ) π S S To sample a transaction, S ∈ I S ∈ I 1. For each itemset, sample X ( j ) z S ∼ Bernoulli ( π S ) . j ∈ 1 , ..., m 2. Deterministically set � X = S. Parameters: z s =1 I Collection of “interesting” itemsets π S ∈ [0 , 1] for each S ∈ I probability of occurrence
Stepping Back
Local conventions (naming, formatting) Mining idioms (ngram models) (probabilistic grammars) Itemset Method naming mining (word embeddings) TwitterFactory.getInstance (latent-variable TwitterFactory.<init> Twitter.setOAuthConsumer modelling) Twitter.setOAuthAccessToken Thanks! Miltiadis Allamanis • Chris Bird, MSR • Jaroslav Fowkes • Earl Barr, UCL • Hao Peng •
Key concepts in probabilistic modelling Sufficiency • what statistics of the data am I memorizing? • Latent variables, e.g., • what tree macros were used to generate AST? • what item sets were used in a transaction? •
Why patterns in software? Orthogonal interfaces Tools that “do one thing well” need to be combined well Surface-semantic correspondence Semantics available from glancing rather than reading void addOne (int[] arr) { for (int i = 0; i < arr.length; i++) { arr[i] += 1; } } void foo (int[] bar) { int baz = 0; while (true) { bar[baz] = bar[baz] + 1; Natural code: Code with baz = baz + 1; good correspondence? if (baz > bar.length) break; } }
A new type of program analysis Static analysis Construct program abstraction (loses information) Why abstract: Exact decision Turing-complete Then logical inference Statistical analysis Construct program abstraction (loses information) Why: Data sparsity, inductive bias Then statistical inference “Semantic retreat” NLP —> statistical NLP PL analysis —> statistical PL analysis
Recommend
More recommend