Bogdan Vasilescu Casey Casalnuovo Prem Devanbu (CMU, ISR) (UCDavis) (UCDavis) @b_vasilescu @devanbu Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names
@b_vasilescu Today var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();
@b_vasilescu Today var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }();
@b_vasilescu Today Data-driven method + tool var geom2d = function() { var geom2d = function() { var t = numeric.sum; var sum = numeric.sum; function r(n, r) { function Vector2d(x, y) { this.x = n; this.x = x; this.y = r; this.y = y; } } u(r, { mix(Vector2d, { P: function e(n) { P: function dotProduct(vector) { return t([ this.x * n.x, return sum([ this.x * vector.x, this.y * n.y ]); this.y * vector.y ]); } } }); }); function u(n, r) { function mix(dest, src) { for (var t in r) n[t] = r[t]; for (var k in src) dest[k] = src[k]; return n; return dest; } } return { return { V: r V: Vector2d }; }; }(); }();
@b_vasilescu Why? • Programs are (also) written to be read “ Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do .” [Don Knuth]
@b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source code readability , reusability , maintainability • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams •
@b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source code readability , reusability , maintainability • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams •
@b_vasilescu Why? • Programs are (also) written to be read • Well-chosen variable names are critical to source � � � code readability , reusability , maintainability [many] • Example tasks: reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams • Martin Vechev, “Probabilistic Learning From Big Code”. Keynote at ISSTA 2016
Key ingredient • The “ naturalness ” of software [Hindle et al, 2011]
Natural languages are complex Hmmmm….
Natural languages are complex Tiger, Tiger burning bright In the forests of the night What immortal hand or eye, Could frame thy fearful symmetry?
..but most utterances are simple & repetitive TIGER!! RUN!!!
English, த�� , German Can be Rich, Powerful, Expressive
English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring
English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models
English, த�� , German Can be Rich, Powerful, Expressive ..but “in nature” is mostly Simple, Repetitive, Boring Statistical Models
The “naturalness of software” thesis Programming Languages are complex... ...but Natural Programs are simple & repetitive. and this, too, CAN BE EXPLOITED!! [Hindle et al, 2011]
.org Variable Name Autonym Guesser (AUTONYM)
.org Variable Name Autonym Guesser (AUTONYM) Minified Source Code function u(n, r) { for (var t in r) n[t] = r[t]; return n; }
.org Variable Name Autonym Guesser (AUTONYM) Un-Minified Minified Source Code Source Code function u(n, r) { function mix(dest, src) { for (var t in r) n[t] = r[t]; for (var k in src) dest[k] = src[k]; return n; return dest; } }
.org Autonym Pre- Post- Moses SMT processing processing Un-Minified Minified Source Code Source Code
.org Autonym Pre- Post- Moses SMT processing processing What’s the relevance of Machine Translation?
Noisy channel translation model
Noisy channel translation model
Noisy channel translation model distorted message
Noisy channel translation model channel model distorted message
Noisy channel translation model channel model language model distorted message
Noisy channel translation model channel model language model distorted message Goal: recover p ( e )
Noisy channel translation model channel model language model distorted message Goal: recover p ( e )
Noisy channel translation model channel model language model distorted message Goal: recover p ( e ) s e y a m B e r o e h t (for a given )
Noisy channel translation model channel model language model distorted message Goal: recover p ( e ) Translation Language (channel distortion) model model
Translating French ( ) to English ( ) Translation model Language model
Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model
Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model English Corpus
Translating French ( ) to English ( ) Translation model Aligned French-English Corpus Language model English Corpus
Translating minified ( ) to clear JS ( ) Translation model Aligned Clear-Minified Code Corpus Language model Clear Code Corpus
Translating minified ( ) to clear JS ( ) GitHub + minifier Translation model Aligned Clear-Minified Code Corpus Language model Clear Code Corpus
Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words •
Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words •
Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words • function u(n, r) { function mix(dest, src){
Alignment EN: I know what you named your identifiers! Natural language: non-trivial alignment Reordering • NL: Ik weet wat je je ID's genoemd! Different length • Dropped words • function u(n, r) { Minification: straightforward alignment function mix(dest, src){
Complications ? function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }
Complications function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }
Complications function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }
Complications Autonym function r (n, r ) { for (var t in r ) n[t] = r [t]; return n; }
Complications (1) Overloading Autonym function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; return n; } }
Complications (1) Overloading Scope Autonym analysis function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; return n; } }
Complications (Sentence-by-sentence translation) (2) Consistency Autonym function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; for (var k in list ) dest[k] = list [k]; return n; return dest; } }
Complications (Sentence-by-sentence translation) (2) Consistency Language model Autonym scoring function r (n, r ) { function mix (dest, src ) { for (var t in r ) n[t] = r [t]; for (var k in list ) dest[k] = list [k]; return n; return dest; } } Translation Idea : try all, let language model model decide which is more natural, on average, across ALL lines Language model
Evaluation • Held-out test set: 2,149 files • Comparison to JSNice [Raychev et al, 2015] • Metric: % names recovered
Evaluation • Held-out test set: 2,149 files • Comparison to JSNice [Raychev et al, 2015] • Metric: % names recovered • Global vs. local names (globals don’t change) var geom2d = function() { var geom2d = function() { var t = numeric.sum; var sum = numeric.sum; function r(n, r) { function Vector2d(x, y) { this.x = n; this.x = x; this.y = r; this.y = y; } } ... ...
% names recovered (2,149 test files) Global Local 1.00 % names recovered − 2149 files 0.75 0.50 0.25 0.00 ym (Local) ym (All) JSNice (Local) JSNice (All) JSNaughty (Local) JSNice Autonym
Joining forces 1.00 0.75 JSNice File Accuracy Frequency 60 0.50 40 20 0.25 0.00 0.00 0.25 0.50 0.75 1.00 Autonym File Accuracy
Becoming JSNaughty Autonym Pre- Post- Moses SMT processing processing
Becoming JSNaughty Autonym Pre- Post- Moses SMT processing processing JSNice
Recommend
More recommend