statistical analysis of computer program text
play

Statistical Analysis of Computer Program Text Charles Sutton - PowerPoint PPT Presentation

Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh Source code is a means of human communication Development out in the open 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count


  1. Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh

  2. Source code is a means of human communication

  3. Development “out in the open” 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count 3000 (x1000) 2000 Repositories 1000 (Sourceforge) 0 2011 2012 2013 2014 Year

  4. Probabilistic modelling Model Problem (family of distributions) Supervised Learning Source Source Source (x 1, y 1 )…(x n, y n ) files Source (objective function) files Source files files files Unsupervised x 1 …x n Distribution Data Predict y from x p(y|x test ) “Explore” x p(z|x test ) Do stuff Inspect distribution p(z|x 1 …x n )

  5. Learning Natural Coding Conventions [Allamanis, Barr, Bird, Sutton; FSE 2014]

  6. junit/src/test/java/junit/tests/runner/TextRunnerTest.java 
 public class TextRunnerTest extends TestCase { 
 void execTest(String testClass, boolean success) throws Exception { 
 ... 
 InputStream i = p.getInputStream(); 
 while (( i .read()) != -1); 
 ... 
 } 
 ... 
 } 


  7. junit/src/test/java/junit/tests/runner/TextRunnerTest.java 
 public class TextRunnerTest extends TestCase { 
 void execTest(String testClass, boolean success) throws Exception { 
 ... 
 InputStream i = p.getInputStream(); 
 Suggest while (( i .read()) != -1); 
 alternate names ... 
 } 
 input ... 
 inputStream } 
 is stream Score and threshold input (81.93%)

  8. Language Models for Source Code Probability distribution over token sequences: Consider naive estimator: In Naturalize : Choose the name other programmers use in similar contexts

  9. Naming Methods and Classes [Allamanis, Barr, Bird, Sutton; FSE 2015]

  10. Name that Tune Java Method 1 private void createDefaultShader () { String vertexShader = "literal_1"; 2 String fragmentShader = "literal_2"; 3 shader = new ShaderProgram(vertexShader, 4 fragmentShader); 5 if(shader.isCompiled() == false) 6 throw new IllegalArgumentException( 7 "literal_3" + shader.getLog()); 8 9 } Figure 1: This method is from libgdx ’s CameraGroupStrategy from libgdx “Desktop/Android/Blackberry/iOS/HTML5 Java game development framework” http://libgdx.badlogicgames.com

  11. Embedding Identifiers [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Log Bilinear Context Model Kavukcuoglu, 2013; Maddison, Tarlow, 2014] exp { s θ ( t, c 1: m ) } q createDefaultShader P ( t | c 1: m ) = q hashCode P t 0 exp { s θ ( t 0 , c 1: m ) } ˆ r c s θ ( t, c 1: m ) = q > t ˆ r c + b t (private, void, (, ), {, String, vertexShader, c = =, “literal_1”, ;, String, …) t = createDefaultShader q v ∈ R D are “embeddings” ::: model parameters What about ? ˆ r c More complex, we need to summarize many tokens

  12. Mining Idioms from Code [Allamanis and Sutton; FSE 2014]

  13. Mined Idioms (General Java) Iterate through the elements of an Creating a logger for a class Iterator Looping through lines from a Defining a String constant BufferedReader

  14. Mined Idioms (Library-Specific) Database transaction in node4j Get an HTML Document in jsoup Get the distance between Show a small popup in Android two points in Android

  15. Model: Tree substitution grammars

  16. TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Mining API Patterns http://arxiv.org/abs/1510.04130 [Fowkes and Sutton; NIPS WS 2014]

  17. API patterns from code TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.getInstance Status.getUser TwitterFactory.getInstance TwitterFactory.<init> Status.getText TwitterFactory.<init> Status.getUser TwitterFactory.getInstance TwitterFactory.getInstance Status.getText TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthConsumer ConfigurationBuilder.<init> Twitter.setOAuthAccessToken TwitterFactory.<init> ConfigurationBuilder.build Twitter.setOAuthConsumer auth.AccessToken.getToken ConfigurationBuilder.<init> auth.AccessToken.getTokenSecret TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.<init> Twitter.setOAuthConsumer ConfigurationBuilder.<init> ConfigurationBuilder.setDebugEnabled ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build Status.getUser Status.getText ConfigurationBuilder.<init> TwitterFactory.getInstance ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> Twitter.setOAuthConsumer ConfigurationBuilder.build Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Twitter.setOAuthAccessToken ConfigurationBuilder.setOAuthConsumerKey Twitter.updateStatus ConfigurationBuilder.<init> ConfigurationBuilder.build ConfigurationBuilder.build ConfigurationBuilder.<init> User.getId ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> User.getId ConfigurationBuilder.setOAuthConsumerSecret Twitter.setOAuthAccessToken ConfigurationBuilder.build User.getId TwitterFactory.getInstance User.getScreenName http.AccessToken.getToken Twitter.setOAuthAccessToken http.AccessToken.getTokenSecret ConfigurationBuilder.<init> ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.setOAuthConsumerSecret TwitterFactory.getInstance Status.getId Status.getId MAPO UP-Miner IIM [Zhong et al, 2009] [Wang et al, 2013] (actually a slight extension)

  18. Model z ( j ) π S S To sample a transaction, S ∈ I S ∈ I 1. For each itemset, sample X ( j ) z S ∼ Bernoulli ( π S ) . j ∈ 1 , ..., m 2. Deterministically set � X = S. Parameters: z s =1 I Collection of “interesting” itemsets π S ∈ [0 , 1] for each S ∈ I probability of occurrence

  19. Stepping Back

  20. Local conventions (naming, formatting) Mining idioms (ngram models) (probabilistic grammars) Itemset Method naming mining (word embeddings) TwitterFactory.getInstance (latent-variable TwitterFactory.<init> Twitter.setOAuthConsumer modelling) Twitter.setOAuthAccessToken Thanks! Miltiadis Allamanis • Chris Bird, MSR • Jaroslav Fowkes • Earl Barr, UCL • Hao Peng •

  21. Key concepts in probabilistic modelling Sufficiency • what statistics of the data am I memorizing? • Latent variables, e.g., • what tree macros were used to generate AST? • what item sets were used in a transaction? •

  22. Why patterns in software? Orthogonal interfaces Tools that “do one thing well” need to be combined well Surface-semantic correspondence Semantics available from glancing rather than reading void addOne (int[] arr) { for (int i = 0; i < arr.length; i++) { arr[i] += 1; } } void foo (int[] bar) { int baz = 0; while (true) { bar[baz] = bar[baz] + 1; Natural code: Code with baz = baz + 1; good correspondence? if (baz > bar.length) break; } }

  23. A new type of program analysis Static analysis Construct program abstraction (loses information) Why abstract: Exact decision Turing-complete Then logical inference Statistical analysis Construct program abstraction (loses information) Why: Data sparsity, inductive bias Then statistical inference “Semantic retreat” NLP —> statistical NLP PL analysis —> statistical PL analysis

Recommend


More recommend