Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein & Eran Yahav Technion – Israel Institute of Technology Onward! 2016 1 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Lots of snippets out there >19M users >5.9M registered users >38M repositories >12M questions >19M answers Sep ‘ 16 And also.. Google code, programming blogs, documentation sites, requirements documents, comments, identifier, commits, etc. 2 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Similarity: Images VS. Programs Code is not organized Cannot accomplish even simple tasks (which are increasingly improving in other domains) 3 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Similarity: Images VS. Programs Images already have some solutions Find somewhere on the web Google image search Lago di Canzolino, Italy LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 4 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Similarity: Images VS. Programs With code we still don ’ t know what to do Program P 5 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Why are Programs Hard? A program is a data transformer “ infinite data ” ≫ “ big data ” Potentially infinite number of runtime behaviors Depends on inputs from subprocess import call cmd_to_run = raw_input() call(cmd_to_run.split()) Infinite code 6 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Why are Programs Hard? Print the exact same value Both written in Java Syntactic difference int scale = 100000 ; double x = (double)Math.round(8.912384 * scale) / scale; System.out.println(x); DecimalFormat df = new DecimalFormat("#0.00000 ” ); System.out.println(df.format(8.912384)); 7 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Syntactic Similarity is not Sufficient Two approaches for similarity Textual diff There's more than one way to do it -Perl slogan 8 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Syntactic Similarity is not Sufficient import os if os.path.exist(filename): print(exist) else: print(no such file) try: fh = open(f) print “ exist ” except: print “ no such file ” 9 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Syntactic Similarity is not Sufficient Textual diff Module Abstract Syntax Tree diff Expr Import from itertools import permutations Call permutations([ “ a ” , “ b ” ]) args from subprocess import call call(["ls", "-l"]) Name List Str Str 10 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
C void permute(const char *s, char *out, Cross Language int *used, int len, int lev){ if (len == lev) { Similarity out[lev] = '\0'; puts(out); Generation of all possible return; permutations of a string } Different algorithms int i; for (i = 0; i < len; ++i) { Similar functionality if (used[i]) continue; PYTHON def p (head, tail=''): used[i] = 1; if len(head) == 0: out[lev] = s[i]; ? print tail permute(s,out,used,len,lev+1); else: used[i] = 0; for i in range(len(head)): } p(head[0:i] + head[i+1:], return; } tail + head[i]) 11 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Our approach (simplified) 12 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Semantic Relatedness First appeared in the NLP domain finer case of Semantic Similarity (is-a) Can be established across different parts of speech Based on functionality import random print random.randint(min, max) Quantitative similarity Equivalent? NO! Semantic relatedness public static int Inclusion, Reversal getRandom(int min, int max){ Random rn = new Random(); int range = max- min + 1; return rn.nextInt(range) + min; } 13 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Code Similarity Applications Code similarity is a central challenge in many programming related applications, such as: Semantic Code Search Automatic Translation Education I know how to get tomorrow ’ s data in JAVA, it ’ s easy! define(DATETIME_FORMAT, 'y-m-d H:i'); Date d1 = new Date (); $time = date(DATETIME_FORMAT, Date d2 = new Date (); strtotime(\"+1 day\", $time)); d2.setTime(d1.getTime() PHP though.. +1*24*60*60*1000); 14 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Automatic Tagging of Snippets Predict a set of textual labels Semantics of the code fragment Long-term goal: produce natural-language summaries for code snippets int foo = Integer.parseInt ( "1234" ) ; str tring ing int co conv nver erting ting 15 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Overview 16 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Leveraging Collective Knowledge Stackoverflow Community question-answering site Programming related questions Each question is associated with a title, content and tags Implicit mapping between code fragments and their descriptions 17 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
title le que uestion tion tags vo votes es answ swer ers code de 18 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Know your limits! This work presents a radical departure from common approaches Challenge: find representatives in the pre- computed database The results are biased by the quality of the database We show that this approach is feasible for snippets that serve a common purpose 19 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
The Importance of Data % 𝑁𝑏𝑢𝑑ℎ𝑓𝑡 12 10 8 6 4 2 log 2 (𝐸𝐶 𝑇𝑗𝑨𝑓) 0 9 10 11 12 13 14 15 16 17 20 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Data Coverage ” Although the number of legal statements in the language is theoretically infinite, the number of practically useful statements is much smaller, and potentially finite. ” -- Study of the uniqueness of source Code, Gabel et al. Software is usually an aggregation of much smaller parts Code is repetitive and predictable Syntactic similarity 21 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Going Back to our Example 22 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Text Similarity Python code partial description: “ How to generate all permutations of a list in Python? ” C code partial description: “ Generating list of all possible permutations of a string ” Similarity score ≈ 0.8 23 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Text Processing generating list of all possible permutations of a string in c ? Removing stop-words & punctuation generating list possible permutations string Lemmatization 1M docs generate list possible permutation string Vector Space Trained Model Model w(1) w(2) w(3) ... w(n-1) w(n) LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 24 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Models – tf.idf 𝑢𝑔. 𝑗𝑒𝑔 𝑢,𝑒 = 𝑢𝑔 𝑢,𝑒 ∙ 𝑗𝑒𝑔 𝑢 Term Frequency Inverse Document Frequency Each cell term is: Higher when the term occurs many times Lower when the term occurs in many documents Wanted document Doc 1 Doc 2 term idf term count term count term count list 0 list 2 list 1 sort 3 string 1 string 0 0.3 0.9 0 0 0 = × permutation 1 list 1 Smoothing permutation ~0.3 generate 1 list string generate permutation sort generate 2 string 1 generate ~0.3 set 1 string 1 sort ~0.3 permutation 3 Train set 25 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV
Recommend
More recommend