machine learning guided selectively unsound static
play

Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis Kihong Heo - PowerPoint PPT Presentation

1 Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis Kihong Heo Hakjoo Oh Kwangkeun Yi Seoul National University Korea University Seoul National University 26 May 2017 ICSE'17 @ Buenos Aires 2 Goal False Positive


  1. 1 Machine-Learning-Guided� Selectively�Unsound�Static�Analysis Kihong Heo Hakjoo Oh Kwangkeun Yi Seoul National University Korea University Seoul National University 26 May 2017 ICSE'17 @ Buenos Aires

  2. 2 Goal False Positive Uniformly Sound Uniformly Unsound False Negative

  3. 3 Goal False Positive Uniformly Sound Uniformly Selectively Unsound Unsound False Negative

  4. 4 Selectively Unsound Analysis • Selectively apply unsound strategies • e.g.) unrolling loops, skipping lib calls while(e){ C } if(e){ C } A;lib();B; A;B; program states program states program states false positive false negative error states error states error states Uniformly Sound Selectively Unsound Uniformly Unsound

  5. 5 Example • Sound bu ff er-overrun analyzer with interval domain • soundly analyze all the loops str = "hello world"; for(i=0; !str[i]; i++)// buffer access 1 skip; size = positive_input(); for(i=0; i<size; i++) skip; ... = str[i]; // buffer access 2

  6. str.size: [12, 12] 6 i: [0, +oo] size: [0, +oo] i: [0, +oo] Example • Sound bu ff er-overrun analyzer with interval domain • soundly analyze all the loops str = "hello world"; for(i=0; !str[i]; i++)// buffer access 1 skip; size = positive_input(); for(i=0; i<size; i++) skip; ... = str[i]; // buffer access 2

  7. 7 Example • Uniformly unsound bu ff er-overrun analyzer • unsoundly unroll all the loops str = "hello world"; i = 0; if (!str[i]) // buffer access 1 skip; size = positive_input(); i = 0; if (i < size) skip; ... = str[i]; // buffer access 2

  8. i: [0, 0] 8 i: [0, 0] Example • Uniformly unsound bu ff er-overrun analyzer • unsoundly unroll all the loops str = "hello world"; i = 0; if (!str[i]) // buffer access 1 skip; size = positive_input(); i = 0; if (i < size) skip; ... = str[i]; // buffer access 2

  9. 9 Example • Selectively unsound bu ff er-overrun analyzer • unsoundly unroll only harmless loops str = "hello world"; i = 0; if(!str[i]) // buffer access 1 skip; size = positive_input(); for(i = 0; i < size; i++) skip; ... = str[i]; // buffer access 2

  10. i: [0, 0] 10 i: [0, +oo] Example • Selectively unsound bu ff er-overrun analyzer • unsoundly unroll only harmless loops str = "hello world"; i = 0; if(!str[i]) // buffer access 1 skip; size = positive_input(); for(i = 0; i < size; i++) skip; ... = str[i]; // buffer access 2

  11. 11 0 40 60 80 0 25 50 75 100 25 0 100 75 50 25 0 50 75 100 20 Performance • Experiments with 2 analyzers & open source SW • Taint: 106 format string bugs / 13 programs • Interval: 138 bu ff er overrun bugs / 23 programs Taint Analysis Interval Analysis FPR FNR FPR FNR e e m e e m e e m e e m n v n v n v n v r r r r i i i i i i i i o o o t o t l t l l t l e e e e c c c c f f f f s s e i s s e i e i e i n n n n a a a a l l l l e e e U U e U U B B B B S S S S

  12. 12 Setting F ∈ Pgm × Π → A • Find a set of targets for unsound strategies π ∈ Π • loops to analyze unsoundly ( ) Π = 2 Loop • library calls to analyze unsoundly ( ) Π = 2 Lib • Selectively apply unsound strategies to p ∈ π

  13. 13 Codebase Training Data Generation Machine Learning Training Data Inferring Harmless Unsoundness Training Harmless Unsoundness Test Program Classifier System Overview F π

  14. 14 Training Data Generation • Given a codebase w/ known bugs + a sound static analyzer • Collect precision-decreasing yet harmless pgm components • e.g.) unrolling a loop reduces only FP but retains all TP training pgm loop 1 loop 1 loop 1 loop 1 if 1 if 2 loop 2 loop 2 loop 2 loop 2 loop 3 if 3 … loop 3 loop 3 loop 3 ... ... ... ... ... loop n loop n if n loop n loop n # true alarms 3 4 5 5 5 # false alarms 3 10 5 10 8

  15. 15 Features & Learning • Encode each program component as a feature vector f(x) = <f 1 (x), f 2 (x), …, f n (x)> f(loop 1 ) = <1, 0, …, 1> f(loop 2 ) = <0, 1, …, 1> f(lib 1 ) = <0, 1, …, 0> f(lib 2 ) = <1, 1, …, 1> • Derive a classifier using an o ff -the-shelf algorithm • e.g.) SVM

  16. 16 Features • 22 features for loops Feature Property Type Description Null Syntactic Binary Whether the loop condition contains nulls or not Const Syntactic Binary Whether the loop condition contains constants or not Array Syntactic Binary Whether the loop condition contains array accesses or not Conjunction Syntactic Binary Whether the loop condition contains && or not IdxSingle Syntactic Binary Whether the loop condition contains an index for a single array in the loop IdxMulti Syntactic Binary Whether the loop condition contains an index for multiple arrays in the loop IdxOutside Syntactic Binary Whether the loop condition contains an index for an array outside of the loop InitIdx Syntactic Binary Whether an index is initialized before the loop Exit Syntactic Numeric The (normalized) number of exits in the loop Size Syntactic Numeric The (normalized) size of the loop ArrayAccess Syntactic Numeric The (normalized) number of array accesses in the loop ArithInc Syntactic Numeric The (normalized) number of arithmetic increments in the loop PointerInc Syntactic Numeric The (normalized) number of pointer increments in the loop Prune Semantic Binary Whether the loop condition prunes the abstract state or not Input Semantic Binary Whether the loop condition is determined by external inputs GVar Semantic Binary Whether global variables are accessed in the loop condition FinInterval Semantic Binary Whether a variable has a finite interval value in the loop condition FinArray Semantic Binary Whether a variable has a finite size of array in the loop condition FinString Semantic Binary Whether a variable has a finite string in the loop condition LCSize Semantic Binary Whether a variable has an array of which the size is a left-closed interval LCOffset Semantic Binary Whether a variable has an array of which the offset is a left-closed interval #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the loop Const Syntactic Binary Whether the parameters contain constants or not

  17. 17 Features • 15 features for library calls Feature Property Type Description #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the loop Null Syntactic Binary Whether the loop condition contains nulls or not Const Syntactic Binary Whether the parameters contain constants or not Void Syntactic Binary Whether the return type is void or not Int Syntactic Binary Whether the return type is int or not CString Syntactic Binary Whether the function is declared in string.h or not InsideLoop Syntactic Binary Whether the function is called in a loop or not #Args Syntactic Numeric The (normalized) number of arguments DefParam Semantic Binary Whether a parameter are defined in a loop or not UseRet Semantic Binary Whether the return value is used in a loop or not UptParam Semantic Binary Whether a parameter is update via the library call Escape Semantic Binary Whether the return value escapes the caller GVar Semantic Binary Whether a parameters points to a global variable Input Semantic Binary Whether a parameters are determined by external inputs FinInterval Semantic Binary Whether a parameter have a finite interval value #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the arguments #ArgString Semantic Numeric The (normalized) number of string arguments

  18. 18 Winning Features • Interval analysis • loops iterating on finite strings • library calls that return integers or manipulate strings str = “hello world”; for (p = str; *p; p++) ... int r = lib1(); lib2(str1, str2);

  19. 19 Winning Features • Interval analysis • loops iterating on finite strings • library calls that return integers or manipulate strings str = “hello world”; finite string for (p = str; *p; p++) ... array access ptr increment int r = lib1(); return integer lib2(str1, str2); str manipulation

  20. 20 Winning Features • Taint analysis • library calls not propagating user inputs r1 = random(); r3 = fread(fd,buf,len) r2 = strlen(s) r4 = recv(s,len,flags)

  21. 21 Winning Features • Taint analysis • library calls not propagating user inputs # arguments, # arguments, #abs. locations #abs. locations r1 = random(); r3 = fread(fd,buf,len) < r2 = strlen(s) r4 = recv(s,len,flags)

  22. 22 Summary • First selectively unsound static analysis • more e ff ective than uniformly sound / unsound ones • systematic way to tune unsoundness by ML program states program states program states Selectively Unsound Sound Uniformly Unsound

  23. 23 Summary • First selectively unsound static analysis • more e ff ective than uniformly sound / unsound ones • systematic way to tune unsoundness by ML program states program states program states Selectively Unsound Sound Uniformly Unsound Thank you

Recommend


More recommend