1 Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis Kihong Heo 1 , Hakjoo Oh 2 , Hongseok Yang 3 Seoul National University 1 Korea University 2 University of Oxford 3 SAS 2016 @Edinburgh
2 Long Term Goal • Self-evolving static analysis by learning big data • data : similar codes, old versions, user-feedbacks, bug reports, test results, etc • mature in other fields : … + Big Data Static Analyzer
3 soundness scalability precision soundness scalability precision Long Term Goal F ∈ Pgm × Π → A • Finding a good abstraction for adaptive static analysis • Machine Learning (learner) + Static Analysis (teacher) • e.g.) relation , context, flow, etc
∞ b i ∞ 0 ∞ ∞ c ∞ ∞ 0 ∞ ∞ ∞ ∞ ∞ 0 a i c b a 4 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 *Consider x-y ≤ c only, for simplicity {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c ∞ ∞ 0 0 ∞ ∞ ∞ 0 0 a i c b a 5 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 b - a ≤ 0 a - b ≤ 0 {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c ∞ ∞ 0 0 ∞ ∞ ∞ 0 0 a i c b a 6 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 c - a ≤ ∞ c - b ≤ ∞ a - c ≤ ∞ b - c ≤ ∞ {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c -1 ∞ 0 0 ∞ ∞ ∞ 0 0 a i c b a 7 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 i - b ≤ -1 {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c -1 ∞ 0 0 -1 ∞ ∞ 0 0 a i c b a 8 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 i - a ≤ -1 {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c -1 ∞ 0 0 -1 ∞ ∞ 0 0 a i c b a 9 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 i - c ≤ ∞ {a, b, c, i}
∞ b i ∞ 0 ∞ ∞ c -1 ∞ 0 0 -1 ∞ ∞ 0 0 a i c b a 10 0 ∞ Relational Analysis • Tracking relationships among variables • e.g.) octagon analysis : ( ± x ) − ( ± y ) ≤ c In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 Do we need c? {a, b, c, i}
∞ ∞ 0 11 a b i a 0 0 -1 b 0 0 -1 i Selective Relational Analysis • Selectively tracking relationships among variables • within the same cluster In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 + - ∞ ≤ c ≤ + ∞ {a,b,i} {c}
12 PLDI’14 Previous Solution • Variable clustering by impact pre-analysis • estimating the impact of relationships • more scalable than the baseline Octagon analysis • more scalable & precise than other clustering methods
13 PLDI’14 Problem • Variable clustering by impact pre-analysis • fully relational pre-analysis as an online estimator • e.g.) 17 open source benchmarks (~100KLOC) Time Var.Clustering Main�Analysis 98% [PLDI’14] 0 10000 20000 30000 40000
14 This Work New Solution • Learning a variable-clustering strategy from big data • fully relational pre-analysis as an offline teacher • 33x faster yet similarly precise Time Var.Clustering Main�Analysis [PLDI’14] [ML-based] 0 10000 20000 30000 40000
Classifier 15 Big Picture • Learning a variable-clustering strategy from big data Static Analysis Machine Learning Training Data Codebase (Var. relationship) Variable Clustering П Target Results Clusters Program (Var. Relationship)
Classifier 16 Big Picture • Learning a variable-clustering strategy from big data Static Analysis Machine Learning Training Data Codebase (Var. relationship) Variable Clustering П Target Results Clusters Program (Var. Relationship)
i -1 c b a 0 ∞ -1 b 0 0 ∞ c a ∞ ∞ 0 ∞ i ∞ ∞ ∞ 17 0 0 Training Data • Pairs of two variables with label { ⊕ , ⊖ } • ⊕ : precise (< + ∞ ), ⊖ : imprecise (= + ∞ ) In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 ⊕ : {(a,b), (a,i), (b,a) …} ⊖ : {(a,c), (b,c), (c,a) …} Octagon Analysis
0 a ∞ ∞ a b c i 0 T 0 ∞ -1 b 0 c 0 T -1 a 0 18 a b c i ∞ i ∞ ∞ i c T ∞ T ∞ Training Data • Automatically generated by impact pre-analysis[PLDI’14] • fully relational, yet more scalable than the full octagon In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 γ ( F ) = Z γ ( > ) = Z [ { + 1 } T ★ ★ ★ b ★ ★ T ★ ⊕ : {(a,b), (a,i), (b,a) …} T ★ ⊖ : {(a,c), (b,c), (c,a) …} T ★ Octagon Analysis Impact Pre-analysis
Classifier 19 Big Picture • Learning a variable-clustering strategy from big data Static Analysis Machine Learning Training Data Codebase (Var. relationship) Variable Clustering П Target Results Clusters Program (Var. Relationship)
(General semantic features) (Negative situations for Octagon) (General syntactic features) 20 (Positive situations for Octagon) Features • 30 Features of variable pairs • boolean predicate of (x,y) in program P - x=y+k � or � y=x+k - x or y is a field - x<=y+k � or � y<=x+k - x and y represent sizes of arrays - x=malloc(y) � or � y=malloc(x) - x or y is the size of a const string - x[y] or y[x] - x or y is a global variable - …� - … - x=cy or � y=cx (c �!=�1)� - x or y has a finite interval - x=yz � or � y=xz - x or y is a local var in a recursive function - x=y/z � or � y=x/z - x, y are not accessed in the same function - … - …
(General syntactic features) 21 *Top 5 most important features (Positive situations for Octagon) (Negative situations for Octagon) (General semantic features) Features • Importance of features by Gini Index • negative & general > positive & domain-specific - x=y+k � or � y=x+k - x or y is a field - x<=y+k � or � y<=x+k - x and y represent sizes of arrays - x=malloc(y) � or � y=malloc(x) - x or y is the size of a const string - x[y] or y[x] - x or y is a global variable - …� - … - x=cy or � y=cx (c �!=�1)� - x or y has a finite interval - x=yz � or � y=xz - x or y is a local var in a recursive function - x=y/z � or � y=x/z - x, y are not accessed in the same function - … - …
22 Classifier • Learning a binary classifier C : Var ⇥ Var ! { � , } • using an off-the-shelf ML algorithm: decision tree • Why decision tree? • more expressive than linear models • e.g.) Octagon with logistic regression : 10~12x slower
Classifier 23 Big Picture • Learning a variable-clustering strategy from big data Static Analysis Machine Learning Training Data Codebase (Var. relationship) Variable Clustering П Target Results Clusters Program (Var. Relationship)
… C(x,y) … 24 c i b a ⊖ (a,c) ⊕ (b,i) ⊖ (a,i) ⊕ (a,b) Clustering Strategy • ⊕ -marked variable pairs in the same cluster • naturally covers transitive relationships In int a = b; 1 int c = input(); // User input 2 for (i = 0; i < b; i++) { 3 assert (i < a); // Query 1 4 assert (i < c); // Query 2 5 we } 6 ⊕ ⊕
25 Experiments • Implemented on top of • sound & global analyzer • a buffer overrun detector for full C • 17 open source benchmarks (~100KLOC)
Recommend
More recommend