mining source code repositories at massive scale using
play

Mining Source Code Repositories at Massive Scale using Language - PowerPoint PPT Presentation

Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by: Polyglot programmers Multitude of APIs &


  1. Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:

  2. Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code

  3. Why Language Models? ● Statistical models ● Learn from data ● Abundance of code available online ● Non-language specific method [Hindle et al., ICSE 2012]

  4. n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

  5. n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

  6. n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

  7. n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

  8. n-gram Language Models Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)

  9. The Java GitHub Corpus Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/

  10. Language Models of Code

  11. Learning about identifiers

  12. Learning about identifiers API calls are predictable

  13. n-gram log probability (NGLP) as a complexity metric NGLP is Data-Driven An n-gram is more complex if it is more rare

  14. Complexity trade-offs from elasticsearch

  15. vs from elasticsearch

  16. Identifier Information Metric (IIM) Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability H full - H collapsed ContinuationPending.java 5.2 JSSetter.java 1.0 FastDtoa.java 5.0 GeneratedClassLoader. 1.1 java PrivateAccessClass.java 4.7 UintMap.java 1.2

  17. Contributions ● GitHub Java Corpus ● New gigatoken language models ● API calls are predictable ● Data-driven code complexity metrics ● Metric of domain-specificity

  18. Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:

  19. n-gram Language Models

  20. Language Models - Metrics Log Probability (NGLP) Cross Entropy (H)

  21. Learning about identifiers

  22. Learning about identifiers Method and Type identifiers are equally hard, irrespectively of the amount of data.

Recommend


More recommend