Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code
Why Language Models? ● Statistical models ● Learn from data ● Abundance of code available online ● Non-language specific method [Hindle et al., ICSE 2012]
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)
The Java GitHub Corpus Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/
Language Models of Code
Learning about identifiers
Learning about identifiers API calls are predictable
n-gram log probability (NGLP) as a complexity metric NGLP is Data-Driven An n-gram is more complex if it is more rare
Complexity trade-offs from elasticsearch
vs from elasticsearch
Identifier Information Metric (IIM) Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability H full - H collapsed ContinuationPending.java 5.2 JSSetter.java 1.0 FastDtoa.java 5.0 GeneratedClassLoader. 1.1 java PrivateAccessClass.java 4.7 UintMap.java 1.2
Contributions ● GitHub Java Corpus ● New gigatoken language models ● API calls are predictable ● Data-driven code complexity metrics ● Metric of domain-specificity
Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
n-gram Language Models
Language Models - Metrics Log Probability (NGLP) Cross Entropy (H)
Learning about identifiers
Learning about identifiers Method and Type identifiers are equally hard, irrespectively of the amount of data.
Recommend
More recommend