recovering clear natural identifiers from obfuscated
play

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) - PowerPoint PPT Presentation

Bogdan Vasilescu Casey Casalnuovo Prem Devanbu (CMU, ISR) (UCDavis) (UCDavis) @b_vasilescu @devanbu Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names @b_vasilescu Today var geom2d = function() { var t =


  1. Bogdan Vasilescu Casey Casalnuovo Prem Devanbu (CMU, ISR) (UCDavis) (UCDavis) @b_vasilescu @devanbu Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names

  2. @b_vasilescu Today var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();

  3. @b_vasilescu Today var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();

  4. @b_vasilescu Today Data-driven method + tool var geom2d = function() { var geom2d = function() { var t = numeric.sum; var sum = numeric.sum; function r(n, r) { function Vector2d(x, y) { this.x = n; this.x = x; this.y = r; this.y = y; } } u(r, { mix(Vector2d, { P: function e(n) { P: function dotProduct(vector) { return t([ this.x * n.x, return sum([ this.x * vector.x, this.y * n.y ]); this.y * vector.y ]); } } }); }); function u(n, r) { function mix(dest, src) { for (var t in r) n[t] = r[t]; for (var k in src) dest[k] = src[k]; return n; return dest; } } return { return { V: r V: Vector2d }; }; }(); }();

  5. @b_vasilescu Why? • Programs are (also) written to be read “ Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do .” [Don Knuth]

  6. @b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source code readability , reusability , maintainability • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams •

  7. @b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source code readability , reusability , maintainability • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams •

  8. @b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source � � � code readability , reusability , maintainability [many] • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams • Martin Vechev, “Probabilistic Learning From Big Code”. Keynote at ISSTA 2016

  9. Key ingredient • The “ naturalness ” of software [Hindle et al, 2011]

  10. Natural languages are complex Hmmmm….

  11. Natural languages are complex Tiger, Tiger 
 burning bright In the forests of the night What immortal hand or eye, Could frame thy fearful symmetry?

  12. ..but most utterances are simple & repetitive TIGER!! 
 RUN!!!

  13. English, த�� , German Can be Rich, Powerful, Expressive

  14. English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring

  15. English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models

  16. English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models

  17. The “naturalness of software” thesis Programming Languages are complex... ...but Natural Programs are simple & repetitive. and this, too, CAN BE EXPLOITED!! [Hindle et al, 2011]

  18. .org Variable Name Autonym Guesser (AUTONYM)

  19. .org Variable Name Autonym Guesser (AUTONYM) Minified 
 Source Code function u(n, r) { for (var t in r) n[t] = r[t]; return n; }

  20. .org Variable Name Autonym Guesser (AUTONYM) Un-Minified 
 Minified 
 Source Code Source Code function u(n, r) { function mix(dest, src) { for (var t in r) n[t] = r[t]; for (var k in src) dest[k] = src[k]; return n; return dest; } }

  21. .org Autonym Pre- Post- Moses SMT processing processing Un-Minified 
 Minified 
 Source Code Source Code

  22. .org Autonym Pre- Post- Moses SMT processing processing What’s the relevance of Machine Translation?

  23. Noisy channel translation model

  24. Noisy channel translation model

  25. Noisy channel translation model distorted message

  26. Noisy channel translation model channel model distorted message

  27. Noisy channel translation model channel model language model distorted message

  28. Noisy channel translation model channel model language model distorted message Goal: recover p ( e )

  29. Noisy channel translation model channel model language model distorted message Goal: recover p ( e )

  30. Noisy channel translation model channel model language model distorted message Goal: recover p ( e ) s e y a m B e r o e h t (for a given )

  31. Noisy channel translation model channel model language model distorted message Goal: recover p ( e ) Translation Language (channel distortion) model model

  32. Translating French ( ) to English ( ) Translation model Language model

  33. Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model

  34. Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model English Corpus

  35. Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model English Corpus

  36. Translating minified ( ) to clear JS ( ) Translation model Aligned Clear-Minified 
 Code Corpus Language model Clear Code Corpus

  37. Translating minified ( ) to clear JS ( ) GitHub + minifier Translation model Aligned Clear-Minified 
 Code Corpus Language model Clear Code Corpus

  38. Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words •

  39. Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words •

  40. Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words • function u(n, r) { function mix(dest, src){

  41. Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words • function u(n, r) { Minification: straightforward alignment function mix(dest, src){

  42. Complications ? function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }

  43. Complications function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }

  44. Complications function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }

  45. Complications Autonym function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }

  46. Complications (1) Overloading Autonym function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; return n; } }

  47. Complications (1) Overloading Scope Autonym analysis function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; return n; } }

  48. Complications (Sentence-by-sentence translation) (2) Consistency Autonym function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; for (var k in list ) dest[k] = list [k]; return n; return dest; } }

  49. Complications (Sentence-by-sentence translation) (2) Consistency Language model Autonym scoring function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; for (var k in list ) dest[k] = list [k]; return n; return dest; } } Translation Idea : try all, let language model model decide which is more natural, on average, across ALL lines Language model

  50. Evaluation • Held-out test set: 2,149 files • Comparison to JSNice [Raychev et al, 2015] • Metric: % names recovered

  51. Evaluation • Held-out test set: 2,149 files • Comparison to JSNice [Raychev et al, 2015] • Metric: % names recovered • Global vs. local names (globals don’t change) var geom2d = function() { var geom2d = function() { var t = numeric.sum; var sum = numeric.sum; function r(n, r) { function Vector2d(x, y) { this.x = n; this.x = x; this.y = r; this.y = y; } } ... ...

  52. % names recovered (2,149 test files) Global Local 1.00 % names recovered − 2149 files 0.75 0.50 0.25 0.00 ym (Local) ym (All) JSNice (Local) JSNice (All) JSNaughty (Local) JSNice Autonym

  53. Joining forces 1.00 0.75 JSNice File Accuracy Frequency 60 0.50 40 20 0.25 0.00 0.00 0.25 0.50 0.75 1.00 Autonym File Accuracy

  54. Becoming JSNaughty Autonym Pre- Post- Moses SMT processing processing

  55. Becoming JSNaughty Autonym Pre- Post- Moses SMT processing processing JSNice

Recommend


More recommend