Neural-Augmented Static Analysis of Android Communication Jinman Zhao , Aws Albarghouthi, Vaibhav Rastogi, Somesh Jha, Damien Octeau University of Wisconsin-Madison, Google
Use machine learning Key Idea to refine results from static analysis.
Static Analysis: False Positives Program & Property Static Analyzer Must True Unsure... Must False False Positives Ranking problem
Machine Learning to Augment Program & Property Static Analyzer Must True Likelihood ∈ [0, 1] Unsure... Must False Train Model Predict
Link Inference for Android Communication Inter-Component Communication Links Program & Property Static Analyzer Must True May Must False Must True Likelihood ∈ [0, 1] Must False Links Links Links Train Model Predict
Task Link Inference in Android Communication
Android ICC: A User’s Experience (xxx) xxx-xxxx Restaurant Malicious APP 1234 Alice St. Orlando, FL Send a message I’d like to make a reservation ... Inter-Component Intent Component Communication w/ Filter
Android ICC: An Example Code View (part of) the resolution logic Intent ICC link? Yes! Filter
(Bigger part of) the resolution logic (Octeau et al., POPL’16)
Previous Work: PRIMO PRIMO (Octeau et al., POPL’16) uses a hand-crafted ● probabilistic model that assigns probabilities to ICC links inferred by static analysis. Laborious, error-prone and requiring expert domain knowledge. ○ Difficulty catching up with constantly evolving Android system. ○
Questions
#1 How can we triage may links with minimal expert domain knowledge? Neural networks.
#2 How can we process inputs of complex data types in a systematic way? Type-directed encoder.
#3 How do our models perform? Very good!
#4 Are the models learning the right things? Seems like so.
We are not trying to… We are trying to… Propose new NN Propose systematic way ● ● module to construct NN Eliminate use of domain Provide decent ● ● knowledge performance without Rule out manual effort expert knowledge ● Use less labour with ● more automation
How can we triage may Approach links with minimal expert domain knowledge? Part 1
Link-Inference Neural Network LINN: An end-to-end encoder-and-classifier architecture. Must Train Model True Links [0,1] May Classifier Links Predict Encoder Encoder Must True Intent Filter Links
How can we process inputs Approach of complex data types in a systematic way? Part 2
Model [0,1] Classifier Encoder Encoder Intent Filter
Type-Directed Encoder TDE: mapping type signature to neural network architecture. Rules Instan TDE Input Type TDE tiation Template Type signature Neural network Neural network template
An example: Encoding Product Types Instance t := (a, b) t-en : R l Type T := tuple(A, B) encT comb R n ⨉ R m ➝ R l a-en : R n b-en : R m encA encB encA encB a : A b : A t : T
Rules for type-directed encoding
Android ICC: Our Abstraction intent Type signatures tuple Intent intent := tuple(act, cats) act cats Action act := optional(string) Categories cats := set(string) optional set Filter filter := tuple(acts, cats) string string Actions acts := set(string) list Categories cats := set(string) list char char
Type-Directed Encoder intent-en intent comb act-en cats-en tuple act cats union aggr Rules optional set str-en str-en string string flat flat list char-en list char-en char char enum enum char char Type signature Neural network template
Type-Directed Encoder: Instantiation intent-en comb TreeLSTM act-en cats-en union aggr switch TreeLSTM Instantiation str-en str-en flat flat CNN CNN char-en char-en enum enum lookup lookup char char Neural network Neural network template ( typed-tree )
Type-Directed Encoder: Instantiation intent-en comb concat act-en cats-en union aggr switch max Instantiation str-en str-en flat flat RNN RNN char-en char-en enum enum lookup lookup char char Neural network Neural network template ( str-rnn )
A systematic way to build and explore structured NN.
Are our models correctly Experiments predicting links?
Setup ● Dataset of 10,500 Android APPs from Google Play. ● IC3 (Octeau et al., ICSE’15) for static analysis. ● PRIMO’s abstract matching for may/must partition. ● Simulated ground truth for may links. ● 4 instantiations of the TDE architecture. # pairs # positive # negative training set 105,108 63,168 41,940 testing set 43,680 29,260 14,420
All instantiated models perform as good as PRIMO.
Correlation Our best model ( typed-tree ) fills the correlation gap by 72% compared to PRIMO despite the harder setting.
More Results for Our Best Model ROC (left) and the distribution of predicted likelihood (right) from typed-tree model. Distribution Correlation
How do we know the model Interpretability is learning the right thing?
Sensitivity to Masking Picking distinctive values Ignoring less useful parts
default Learned Encodings (.*) Semantically closer values receive more similar encodings. None Visualized by t-SNE.
● Neural-augmented static analysis ● Type-directed encoder Conclusion ● Increased accuracy with less domain knowledge ● Interpretability study
● Apply to other analysis tasks Future Works ● Push machine learning into static analysis procedure
Thanks for listening! Q & A
Recommend
More recommend