meaningful variable names for decompiled code a machine
play

Meaningful Variable Names for Decompiled Code: A Machine - PowerPoint PPT Presentation

Meaningful Variable Names for Decompiled Code: A Machine Translation Approach Alan Jaffe, Jeremy Lacomis , Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu * Problem: Obfuscated Variable Names in Code Minified JavaScript: function


  1. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach Alan Jaffe, Jeremy Lacomis , Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu *

  2. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … 2

  3. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … 3

  4. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 4

  5. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 5

  6. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • 6

  7. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • Use large corpora + machine learning to predict better identifier names. • Corpora are easy to generate! • 7

  8. Problem: Obfuscated Variable Names in Code Minified JavaScript: function callback(error, response, body) { function callback(o, s, a) { if (!error && response.statusCode == 200) { if (!o && s.statusCode == 200) { var info = JSON.parse(body); var c = JSON.parse(a); … … Software is “natural” [Hindle et al., 2011]. • Use large corpora + machine learning to predict better identifier names. • Corpora are easy to generate! • Bavishi et al., Context2Name, 2017 • Vasilescu et al., JSNaughty, 2017 • Raychev et al., JSNice, 2015 • 8

  9. Problem: Obfuscated Variable Names in Code Can we use similar strategies for decompiled code? Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 9

  10. Statistical Machine Translation (SMT) • Noisy channel model 10

  11. Statistical Machine Translation (SMT) • Noisy channel model • English à French: 11

  12. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! 12

  13. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! !"#$!% & ( ) *) 13

  14. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) 14

  15. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) Translation Model: Probability that f is a translation of e 15

  16. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) Language Model: “Fluency” of e 16

  17. Statistical Machine Translation (SMT) • Noisy channel model • English à French: Go do some research! Va faire de la recherche! = "#$%"& ' ) * +) )(+) "#$%"& ' ) + *) )(*) = "#$%"& ' ) * +) )(+) ) * +) : Translation Model MOSES SMT: )(+) : Language Model 17

  18. SMT Model for Natural Language Aligned French/English corpus English corpus 18

  19. SMT Model for Minified JavaScript Aligned original/minified source corpus Original source corpus 19

  20. Problem: Obfuscated Identifiers in Code Can we use SMT for decompiled code? Decompiled C Code: cp = buf; (void)asxTab(level + 1); for (n = asnContents(asn, buf, 512); n > 0; n--) { printf(" %02X ", *(cp++)); } v14 = &v15; asxTab(a2 + 1); for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) { v9 = (unsignedchar*)(v14++); printf(" %02X ", *v9); } 21

  21. SMT Model for Decompiled Code? Aligned original/decompiled source corpus Original source corpus 22

  22. SMT Model for Decompiled Code? Nontrivial Aligned original/decompiled source corpus Original source corpus 23

  23. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } 24

  24. Difficulty: Decompilation Changes Structure 9 Lines Original Source 8 Lines Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. 25

  25. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. • Different numbers of variables. 26

  26. Difficulty: Decompilation Changes Structure Original Source Decompiled Code #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } • Different line count. • Different numbers of variables. • Different types of loops. 27

  27. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } 28

  28. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> int main() { int cur = 0; while (cur <= 9) { printf("%d \n ", cur); ++cur; } return 0; } Original Code 29

  29. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } Original Code 30

  30. Decompiled Code Corpus Generation Decompiled Code #include <stdio.h> int main() { int v1 = 0; int v2; ❌ for (v2 = 0; v2 < 10; ++v2) printf("%d \n ", v2); return v1; } #include <stdio.h> #include <stdio.h> int main() { int main() { int cur = 0; int v1 = 0; while (cur <= 9) { int v2; printf("%d \n ", cur); for (v2 = 0; v2 < 10; ++v2) ++cur; printf("%d \n ", v2); } return v1; return 0; } } Original Code 31

Recommend


More recommend