any code completion
play

Any-Code Completion public static Path[] stat2Paths(FileStatus[] - PowerPoint PPT Presentation

Any-Code Completion public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret; }


  1. Any-Code Completion public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret; } Generated: (Java) (25.2%) stats[i].getPath() (3.3%) new Path(stats[i]) (2.5%) new Path(stats[i], charset) charset)

  2. Overview: a Structural Language Model MethodCall MethodCall stats[i].getPath() Name Name ArrayAccess ArrayAccess Name Name Name Name get get path path stats stats i i 2

  3. http://AnyCodeGen.org 3

  4. Structural Language Models of Code ICML’2020 Uri Alon Roy Sadaka Omer Levy Eran Yahav Technion Technion Tel-Aviv University Technion Facebook AI Research 4

  5. Language modeling of code • Code completion • Validate existing code, detect unlikely code. public static Path[] stat2Paths(FileStatus[] stats) { public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) if (stats == null) return null; return null; Path[] ret = new Path[stats.size()]; Path[] ret = new Path[ stats.size() ]; for (int i = 0; i < stats.length; ++i){ for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); ret[i] = stats[i].getPath(); } } return ret; return ret; } } 5

  6. Key Idea #1 : predict a missing subtree Instead of representing the task as: “predict a missing sentence in a text ” Represent the task as: “predict a missing subtree in a tree ”. Learn syntactic patterns, instead of sequential patterns 6

  7. Abstract Syntax Tree Any valid code snippet can be parsed into an Abstract Syntax Tree (AST). The AST is composed of nodes and user-defined values in its leaves. MethodCall ArrayAccess Name stats[i].getPath() Name Name get path stats i 7

  8. Key Idea #2 : a structural language model (SLM) In a natural-language model : n Pr ( y t ∣ y < t ) ∏ Pr ( Y ) = Pr ( y 1 , y 2 , . . . , y n ) = t =1 But how can we compute the probability of a tree ? 8

  9. Key Idea #2 : a structural language model (SLM) Given a tree A (can be an arbitrary graph) Induce an ordering over its nodes: A (in practice: DFS) a 0 , a 1 , . . . , a n ∈ A structural language model (SLM) computes the probability of the tree A : n A Pr ( a t ∣ a < t ) ∏ Pr ( ) = t =0 Pr ( a t ∣ a < t ) But, how can we represent the partial tree when computing ? a < t 9

  10. The fundamental tradeoff in code representation Lea��i�g Implicitly re-learn syntactic & Eff��� semantic regularities model size, data, time… Sweet-spot Requires expertise, language-specific, task- specific model A�a���i� Eff��� S��face��e�� AST Ha�dc�af�ed Da�a ��� C������ ��� ... (���e� ���ea�) Pa�h� fea���e� A�a����� A�a����� [“A General Path-based Representation …”, PLDI’2018] 10 [“code2vec”, POPL’2019]

  11. Key Idea #3 : a partial tree as AST paths Pr ( a t ∣ a < t ) We compute the probability of a node by considering the paths in the Abstract Syntax Tree (AST) from all leaves into . a t Me���d R��� IfE��� ? 11

  12. Me���d R��� IfE��� ? 12

  13. AST Paths AST Paths are simple paths over nodes in the AST . In previous works, we used AST paths to read code. In this work, we generate code by predicting the next node in a set of AST paths. Me���d R��� IfE��� ? [“code2seq”, ICLR’2019] SLM, this work 13

  14. AST Paths capture long-range interactions public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path [] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret; 14 }

  15. Model • Any sequential encoder to encode each arbitrary-length path into a fixed-length vector separately (e.g., LSTM, transformer encoder) • Any contextualizer to let all paths interact (e.g., transformer encoder) • Attend to the contextualized paths using the root path as the query Me���d R��� IfE��� ? 15

  16. Model Me���d Encode paths Contextualize Attend Predict node R��� IfE��� Greater ? Context Query 16

  17. Generate the Tree of: x > 1 Me���d R��� IfE��� ? 17

  18. Generate the Tree of: x > 1 Method Root IfExpr Greater ? 18

  19. Generate the Tree of: x > 1 Me�hod Roo� IfE�pr Grea�er Name ? 19

  20. Generate the Tree of: x > 1 Me�hod Roo� IfE�pr Grea�er Name ? x 20

  21. Generate the Tree of: x > 1 Me�hod Roo� IfE�p� G�ea�e� Name In�E�p x ? 21

  22. Generate the Tree of: x > 1 Me�hod Roo� x > 1 IfE�pr Grea�er Name In�E�p x 1 22

  23. Copy Mechanism full token copy subtoken copy myNewFoo = myObj.getFoo(); myNewFoo.setFooId(id); Vocabulary 23

  24. Example - Java public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret; } Generated: (Java) (25.2%) stats[i].getPath() (3.3%) new Path(stats[i]) (2.5%) new Path(stats[i], charset) 24 charset)

  25. Example - C# public static string Camelize( this string input) { var word = input.Pascalize(); return word.Length > 0 ? word.Substring(0, 1).ToLower() + word.Substring(1) : word; } Generated: (C#) (14.1%) word.Substring(0, 1) (8.2%) word.trim() (5.8%) word.Substring(1) 25

  26. Java Results (trained on 1.3M examples) LSTM Transformer Transformer SLM seq2prod seq2tree seq2prod +copy small base (this work) 45M 45M seq2tree 12M 15M 55.3 2.9 5.6 7.9 4.8 13.6 52.4 LSTMs+attn+copy 50.5 49.7 47.4 Transformer-small+copy SLM 3.8 1.4 Transformer-base+copy (this work) 41.7 39.1 SLM 38.1 8.3 4.8 8.3 4.4 34.7 34.3 31.8 SLM 30.8 (this work) 24.8 SLM 24.1 23.2 23.0 (this work) 21.4 18.0 16.9 16.8 16.6 14.2 11.8 8.1 acc@1 acc@5 tree@1 tree@5 tree a.b > 1 c.d > 2 = NAME.NAME > INT 26

  27. C# Results GNN seq2seq seq2seq seq2tree SLM GNN seq2seq seq2seq seq2tree SLM PHOG PHOG → NAG +copy +copy +copy (this work) → NAG +copy +copy +copy (this work) 45.5 33.5 27.0 18.4 7.6 9.6 37.9 37.6 35.9 30.2 24.6 22.4 11.2 15.3 27.1 26.4 22.3 18.5 15.2 13.0 12.0 7.4 acc@1 acc@5 27

  28. Error Analysis What kind of mistakes are responsible for the gap between acc@k and tree@k ? 55.3 ? 39.1 ? SLM (this work): 24.8 18.0 acc@1 tree@1 acc@5 tree@5 28

  29. Error Analysis What kind of mistakes are responsible for the gap between acc@k and tree@k ? 74% : Single-token mismatch 30% : Single- sub token mismatch Single 55.3 sub token 30% 39.1 Single token Single token 24.8 44% 74% 18.0 29

  30. Error Analysis public float getProgress() { this .readLock.lock(); try { if ( this .currentAttempt != null ) { return this .currentAttempt.getProgress(); } return 0; } finally { this .readLock.unlock(); } } Exact-match Tree-match Compiles Generated: (31.3%) ✘ ✔ ✘ this.currentAttempt.getCount() (30.6%) ✘ ✘ ✔ -1 f (1.5%) ✘ ✔ ✘ this.currentAttempt.get() (1.2%) ✘ ✔ ✘ this.currentAttempt.getTime() (0.9%) ✔ ✔ ✔ this.currentAttempt.getProgress() 30

  31. Error Analysis public float getProgress() { this .readLock.lock(); try { if ( this .currentAttempt != null ) { return this .currentAttempt.getProgress(); } return 0; } finally { this .readLock.unlock(); } } 31

  32. http://AnyCodeGen.org 32

  33. http://AnyCodeGen.org 33

  34. Structural Language Models of Code Key points : 1. Predicting a missing subtree in a tree n 2. A structural language model over trees A Pr ( a t ∣ a < t ) ∏ Pr ( ) = t =0 Me�hod Roo� 3. A partial AST as a set of paths IfE�pr Grea�er Name In�E�p x 1 http://AnyCodeGen.org urialon@cs.technion.ac.il 34

Recommend


More recommend