leveraging a corpus of natural language descriptions for
play

Leveraging a Corpus of Natural Language Descriptions for Program - PowerPoint PPT Presentation

Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein & Eran Yahav Technion Israel Institute of Technology Onward! 2016 1 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM


  1. Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein & Eran Yahav Technion – Israel Institute of Technology Onward! 2016 1 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  2. Lots of snippets out there >19M users >5.9M registered users >38M repositories >12M questions >19M answers Sep ‘ 16 And also.. Google code, programming blogs, documentation sites, requirements documents, comments, identifier, commits, etc. 2 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  3. Similarity: Images VS. Programs  Code is not organized  Cannot accomplish even simple tasks (which are increasingly improving in other domains) 3 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  4. Similarity: Images VS. Programs  Images already have some solutions  Find somewhere on the web Google image   search Lago di Canzolino, Italy LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 4 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  5. Similarity: Images VS. Programs  With code we still don ’ t know what to do   Program P 5 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  6. Why are Programs Hard?  A program is a data transformer  “ infinite data ” ≫ “ big data ”  Potentially infinite number of runtime behaviors  Depends on inputs from subprocess import call cmd_to_run = raw_input() call(cmd_to_run.split()) Infinite code 6 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  7. Why are Programs Hard?  Print the exact same value  Both written in Java  Syntactic difference int scale = 100000 ; double x = (double)Math.round(8.912384 * scale) / scale; System.out.println(x); DecimalFormat df = new DecimalFormat("#0.00000 ” ); System.out.println(df.format(8.912384)); 7 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  8. Syntactic Similarity is not Sufficient  Two approaches for similarity  Textual diff  There's more than one way to do it -Perl slogan 8 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  9. Syntactic Similarity is not Sufficient import os if os.path.exist(filename): print(exist) else: print(no such file) try: fh = open(f) print “ exist ” except: print “ no such file ” 9 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  10. Syntactic Similarity is not Sufficient  Textual diff Module  Abstract Syntax Tree diff Expr Import from itertools import permutations Call permutations([ “ a ” , “ b ” ]) args from subprocess import call call(["ls", "-l"]) Name List Str Str 10 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  11. C void permute(const char *s, char *out, Cross Language int *used, int len, int lev){ if (len == lev) { Similarity out[lev] = '\0'; puts(out); Generation of all possible return; permutations of a string }  Different algorithms int i; for (i = 0; i < len; ++i) {  Similar functionality if (used[i]) continue; PYTHON def p (head, tail=''): used[i] = 1; if len(head) == 0: out[lev] = s[i]; ?  print tail permute(s,out,used,len,lev+1); else: used[i] = 0; for i in range(len(head)): } p(head[0:i] + head[i+1:], return; } tail + head[i]) 11 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  12. Our approach (simplified) 12 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  13. Semantic Relatedness  First appeared in the NLP domain  finer case of Semantic Similarity (is-a)  Can be established across different parts of speech  Based on functionality import random print random.randint(min, max)  Quantitative similarity Equivalent? NO!  Semantic relatedness public static int  Inclusion, Reversal getRandom(int min, int max){ Random rn = new Random(); int range = max- min + 1; return rn.nextInt(range) + min; } 13 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  14. Code Similarity Applications  Code similarity is a central challenge in many programming related applications, such as:  Semantic Code Search  Automatic Translation  Education I know how to get tomorrow ’ s data in JAVA, it ’ s easy! define(DATETIME_FORMAT, 'y-m-d H:i'); Date d1 = new Date (); $time = date(DATETIME_FORMAT, Date d2 = new Date (); strtotime(\"+1 day\", $time)); d2.setTime(d1.getTime() PHP though.. +1*24*60*60*1000); 14 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  15. Automatic Tagging of Snippets  Predict a set of textual labels  Semantics of the code fragment  Long-term goal: produce natural-language summaries for code snippets int foo = Integer.parseInt ( "1234" ) ; str tring ing int co conv nver erting ting 15 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  16. Overview 16 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  17. Leveraging Collective Knowledge  Stackoverflow  Community question-answering site  Programming related questions  Each question is associated with a title, content and tags  Implicit mapping between code fragments and their descriptions 17 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  18. title le que uestion tion tags vo votes es answ swer ers code de 18 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  19. Know your limits!  This work presents a radical departure from common approaches  Challenge: find representatives in the pre- computed database  The results are biased by the quality of the database  We show that this approach is feasible for snippets that serve a common purpose 19 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  20. The Importance of Data % 𝑁𝑏𝑢𝑑ℎ𝑓𝑡 12 10 8 6 4 2 log 2 (𝐸𝐶 𝑇𝑗𝑨𝑓) 0 9 10 11 12 13 14 15 16 17 20 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  21. Data Coverage ” Although the number of legal statements in the language is theoretically infinite, the number of practically useful statements is much smaller, and potentially finite. ” -- Study of the uniqueness of source Code, Gabel et al.  Software is usually an aggregation of much smaller parts  Code is repetitive and predictable  Syntactic similarity 21 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  22. Going Back to our Example 22 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  23. Text Similarity  Python code partial description:  “ How to generate all permutations of a list in Python? ”  C code partial description:  “ Generating list of all possible permutations of a string ”  Similarity score ≈ 0.8 23 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  24. Text Processing generating list of all possible permutations of a string in c ? Removing stop-words & punctuation generating list possible permutations string Lemmatization 1M docs generate list possible permutation string Vector Space Trained Model Model w(1) w(2) w(3) ... w(n-1) w(n) LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 24 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  25. Models – tf.idf 𝑢𝑔. 𝑗𝑒𝑔 𝑢,𝑒 = 𝑢𝑔 𝑢,𝑒 ∙ 𝑗𝑒𝑔 𝑢  Term Frequency Inverse Document Frequency  Each cell term is:  Higher when the term occurs many times  Lower when the term occurs in many documents Wanted document Doc 1 Doc 2 term idf term count term count term count list 0 list 2 list 1 sort 3 string 1 string 0 0.3 0.9 0 0 0 = × permutation 1 list 1 Smoothing permutation ~0.3 generate 1 list string generate permutation sort generate 2 string 1 generate ~0.3 set 1 string 1 sort ~0.3 permutation 3 Train set 25 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

Recommend


More recommend