graph based self supervised program repair from
play

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback - PowerPoint PPT Presentation

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback ICML 2020 Michihiro Yasunaga, Percy Liang Stanford University Why program repair? Programmers spend 75% of time fixing source code errors Automatic program repair can


  1. Our contributions 2. Self-supervised learning Collect unlabeled programs ○ Corrupt and get diagnostic feedback (e.g. run compiler) ○ ⇒ Extra training data : <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; corrupt compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 21

  2. Our results Improved performance on two applications DeepFix: correct intro programming assignments in C ● SPoC: correct output of C++ program synthesis ● DeepFix Test SPoC TestP 22

  3. Outline Innovations ● 1. Reasoning via program-feedback graph 2. Self-supervised learning Evaluations ● 1. DeepFix 2. SPoC Analysis & Examples ● Takeaways ● 23

  4. 1. Reasoning via program-feedback graph 24

  5. 1. Reasoning via program-feedback graph Challenges How to connect two modalities: program and feedback ? ● How to model the reasoning of repair (e.g. tracking symbols)? ● ? Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 25

  6. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 26

  7. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 27

  8. 1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Reason over this space using graph attention ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 28

  9. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 29

  10. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 30

  11. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 31

  12. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 32

  13. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 33

  14. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 34

  15. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 35

  16. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 36

  17. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 37

  18. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; size member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... char ‘Char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 38

  19. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 39

  20. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 40

  21. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 41

  22. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 42

  23. 1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Edges : connect identical tokens to capture semantic correspondence ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 43

  24. 1. Reasoning via program-feedback graph Model Initial encoding ● Graph attention ● Recontextualization ● Decoding ● 44

  25. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Compiler message 9: request for member ‘size ’ … 45

  26. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 Compiler message Line 1 9: request for member ‘size ’ … 46

  27. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 9: request for member ‘size ’ … 47

  28. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 Line 3 9: request for member Source code ‘size ’ … 48

  29. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 49

  30. 1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 50

  31. 1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 51

  32. 1. Reasoning via program-feedback graph Model (Graph attention) Message passing across tokens with long-range dependencies ● Source code hx 11 hx 12 hx 13 ... 1 int main() { ’ Line 1 hm 1 Multi-Head 2 char tmp, a, b; Attention 3 map<string,int> mp; Aggregate ... hx 21 hx 22 hx 23 ... hm 1 hm 2 hm 3 .. Line 2 Compiler message Compiler 9: request for member message hx 31 hx 32 hx 33 ... ‘size ’ … Line 3 Program-Feedback Graph 52

  33. 1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 53

  34. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 54

  35. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 55

  36. 1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 56

  37. 1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 MLP + softmax Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 57

  38. 1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 Repair = "string tmp,a,b;" MLP Pointer-Generator + softmax Decoder Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 58

  39. 1. Reasoning via program-feedback graph Model overview 59

  40. 2. Self-supervised learning 60

  41. 2. Self-supervised learning Why? Labeled datasets of program repair are small (10-100K examples) ● Vast amount of unlabeled programs available online ● Can we leverage them to improve learning? ● >> 1M submissions > 30M repos 61

  42. 2. Self-supervised learning Our idea (outline) Step 1. Collect unlabeled, working programs y Design (randomized) program corruption procedure P Step 2. Step 3. Corrupt and get diagnostic feedback (e.g. run compiler) ⇒ Extra training data : <broken code x , feedback f , fixed code y > Step 4. Use them for pre-training 62

  43. 2. Self-supervised learning 1. Collect unlabeled programs Our target tasks (DeepFix & SPoC) are in C/C++ ● Collect 300K working C++ programs from codeforces.com ● 63

  44. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type invalid conversion from <type> to <type> Identifier undeclared @@@ was not declared ‘else’ without a previous ‘if’ Others no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 64

  45. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type 9% invalid conversion from <type> to <type> Identifier undeclared 62% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 65

  46. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner Expected ... 48% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% missing @@@ (e.g. missing " ) ● primary expression ● 11 redeclaration/conflicting declaration Identifier type 9% 5% invalid conversion from <type> to <type> Identifier undeclared 62% 33% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 66

  47. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner SPoC Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% redeclaration/conflicting declaration Identifier type 9% 5% 18% invalid conversion from <type> to <type> Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 67

  48. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 68

  49. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 69

  50. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● 70

  51. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } 71

  52. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; 72

  53. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; 73

  54. 2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; if (n >= 0) Keyword (delete/insert/replace keyword/call ) → while (n >= 0) 74

  55. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● 75

  56. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . 76

  57. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 11 cout << i; } 77

  58. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 5 int i, n; 5 int i, n; 6 string A; 6 char A; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 11 cout << i; } 11 cout << i; } 78

  59. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 79

  60. 2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 Perturbed 3 5 int i, n; 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 11 cout << i; } 80

  61. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 11 cout << i; } 81

  62. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P 6 string A; 6 char A; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 82

  63. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 83

  64. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 84

  65. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 85

  66. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 86

  67. 2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 87

  68. 2. Self-supervised learning What’s interesting? Typically, pre-training task ≠ target task (e.g. masked LM v.s. QA) ● Here, targeted pre-training (pre-training task = target task = program repair) ● More direct pre-training structure ○ Data distributions can be different between pre-training & target ○ 88

  69. Evaluation 1: DeepFix 89

  70. Evaluation 1: DeepFix Task Repair C programs ● May have multiple error lines ● Apply repair model iteratively (up to 5 times) ● [Gupta et al., 17] 90

  71. Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } 91

  72. Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } Error message line 9: ‘i’ undeclared 92

  73. Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message line 9: ‘i’ undeclared 93

  74. Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 94

  75. Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 95

  76. Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message Compiled!! line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 96

  77. Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 97

  78. Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 98

  79. Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 99

  80. Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 100

Recommend


More recommend