DeBIN: Predicting Debug Information in Stripped Binaries ht https://debin.ai Jingxuan Pesho Petar Veselin Martin He Ivanov Tsankov Raychev Vechev
Binaries with debug symbols Descriptive names for functions and variables Assembly 80534BA: int rfc1035_init() { push %ebp ... push %edi if ( num_entries <= 0 ) { v0 = ("/etc/resolv.conf", 'r'); push %esi ... if ( v0 || (v1 = Hex-rays fopen64("resolv.conf"))){ // code to read and Debug symbols // manipulate DNS settings } 80534BA rfc1035_init int ... 8053DB1 fopen64 int } 8063320 num_entries int Decompiled code ... Binary with debug symbols 2
Stripped binaries Non-descriptive names Assembly 80534BA: int sub_80534BA() { push %ebp ... push %edi if ( dword_8063320 <= 0 ) { v0 = ("/etc/resolv.conf", 'r'); push %esi ... if ( v0 || (v1 = Hex-rays sub_8053B1("resolv.conf"))){ ... Debug symbols ... } Can we recover the ... } debug symbols? Decompiled code Stripped binary Yes, with roughly 65% accuracy! 2
Challenges Stores the value of a semantic variable <sum> start: mov 4(%esp), %ecx Computes mov $0, %eax Stores intermediate 1 + 2 + … + n mov $1, %edx (non-semantic) value add %edx, %eax add $1, %edx cmp %ecx, %edx jne 8048400 repz ret <sum> end 1. No mapping from registers and memory offsets to semantic variables 3
Challenges <sum> start: mov 4(%esp), %ecx mov $0, %eax mov $1, %edx Store the values of add %edx, %eax the unsigned integer add $1, %edx variable n Stores the result in an cmp %ecx, %edx jne 8048400 integer variable res repz ret <sum> end 1. No mapping from registers and memory offsets to semantic variables 2. No names and types 3
DeBIN: Recovering debug information Assembly Assembly <sum> start: <sum> start: mov 4(%esp), %ecx mov 4(%esp), %ecx mov $0, %eax mov $0, %eax mov $1, %edx mov $1, %edx add %edx, %eax add %edx, %eax add $1, %edx add $1, %edx cmp %ecx, %edx cmp %ecx, %edx jne 8048400 jne 8048400 repz ret repz ret <sum> end <sum> end Debug information Debug information Type Location Name int sum n uint i uint int res DeBIN recovers location information, types, and names 4
DE DEMO
How does DeBIN work?
DeBIN: System overview Learning phase Binary with debug symbols Variable recovery Names/ types model model Assembly Assembly start: start: mov 4(%esp), %ecx mov 4(%esp), %ecx mov $0, %eax mov $0, %eax mov $1, %edx mov $1, %edx add %edx, %eax add %edx, %eax Debug symbols Debug symbols start sum int 4(%esp) n uint Prediction phase $eax res int $edx i uint Stripped binary Binary with debug symbols 5
Step 1: Recovering variables
Learning how to recover variables >8K binaries >10K distinct >2M vectors features plus[%edx][1] 001000001 plus[%edx][1] " Binaries with inst[add][%edx] 101010011 plus[%edx][1] dep[%edx][%edx] inst[add][%edx] plus[%edx][1] ⋮ debug symbols dep[%edx][%edx] inst[add][%edx] 011011011 plus[%edx][1] ⋮ dep[%edx][%edx] inst[add][%edx] ⋮ dep[%edx][%edx] inst[add][%edx] 111011100 ⋮ dep[%ecx][%edx] 000100100 #$%& '() [+,$] ⋮ ./&0 1# ['()] Extracted Binary feature Ensemble of trees 2(# '() ['()] features vectors … Feature 100 decision templates trees 7
Assembly Variable recovery mov 4(%esp), %ecx mov $0, %eax mov $1, %edx add %edx, %eax plus[%edx][1] add $1, %edx.2 inst[add][%edx] %edx.2 cmp %ecx, %edx Features ⋮ jne 8048400 Register repz ret Feature vector " 00100101010001 sem (DeBIN will predict name and type) " tmp (stores an intermediate value) Extremely randomized trees Extremely randomized trees , Pierre Geurts, Damien Ernst, and Louis Wehenkel, Machine Learning 2006 6
Step 2: Predicting names and types
Probabilistic graphical model EDX.3 ECX.1 weight ! i n 0.5 % ! p s 0.3 & ! a b 0.1 ' EDX.2 EDX.3 weight cond-NE-EDX-ECX ! i i 0.8 " ! i j 0.6 # EDX.3 ECX.1 ! p p 0.3 $ 1 EDX.2 EDX.3 weight 1 i i 0.8 dep-EDX-EDX 1 j i 0.6 1 p p 0.3 EDX.2 1 Known elements 1 Unknown elements ECX.1, … Binary features ! % , ! & , … Factors
Probabilistic graphical model EDX.3 ECX.1 weight ! i n 0.5 % ! p s 0.3 & ! a b 0.1 ' EDX.2 EDX.3 weight cond-NE-EDX-ECX ! i i 0.8 " ! i j 0.6 # EDX.3 ECX.1 ! p p 0.3 $ 1 EDX.2 EDX.3 weight 1 i i 0.8 dep-EDX-EDX 1 j i 0.6 1 p p 0.3 EDX.2 1 Known elements 1 Unknown elements ECX.1, … Binary features ! % , ! & , … Factors
Probabilistic graphical model EDX.3 ECX.1 weight ! i n 0.5 % ! p s 0.3 & ! a b 0.1 ' EDX.2 EDX.3 weight Next cond-NE-EDX-ECX ! i i 0.8 " ! i j 0.6 # EDX.3 ECX.1 ! p p 0.3 $ 1 EDX.2 EDX.3 weight How are the features and 1 i i 0.8 dep-EDX-EDX their weights learned? 1 j i 0.6 1 p p 0.3 EDX.2 1 Known elements 1 Unknown elements ECX.1, … Binary features ! % , ! & , … Factors 8
Learning how to predict names and types Actual graphs have >1K nodes > 8,000 name1 name2 weight binaries ! i n 0.4 " ! p s 0.5 # Dependency graphs ! a b 0.2 $ ! i i 0.3 % ! i j 0.6 & binary features ! p p 0.4 ' ! i n " ! 3-factor weight p s # ! 1 i i 0.4 Static Train a b $ Binaries with ! 1 j i 0.2 i i analysis model % 1 p p 0.1 ! i j & debug symbols ! p p ' 4-factor weight 1 i i k 0.3 3-factor (! 23456 , 78, 9:;) 1 j i a 0.5 1 i i (! <45=>?@ , 9:; " , 9:; # ) 1 p p v 0.2 1 j i … 1 p p 23 4-factor Feature Find weights that maximize 1 i i k templates templates 1 j i a ( ) = + , = - . for all 1 p p v training samples (+ . , - . ) Binary features and factors 9
End-to-end recovery of debug information
Recovering debug information <sum> start : Registers / mem offsets Semantic variables mov 4(%esp), %ecx EDX.2 EDX.3 EDX.2 EDX.3 ECX.1 mov $0, %eax mov $1, %edx Temporary ECX.1 EDX.1 add %edx, %eax EDX.1 add $1, %edx.2 Known elements cmp %ecx.1, %edx.3 Known elements jne 8048400 0 1 mov 0 1 mov repz ret <sum> end Stripped binary EDX.3 ECX.1 EDX.2 1 10
Recovering debug information <sum> start : Registers / mem offsets Unknown variables MAP inference mov 4(%esp), %ecx EDX.2 EDX.3 EDX.2 EDX.3 ECX.1 mov $0, %eax mov $1, %edx Temporary ECX.1 EDX.1 EDX.3 ECX.1 weight add %edx, %eax ! i n 0.5 % EDX.1 add $1, %edx Known elements ! p s 0.3 & cmp %ecx, %edx Known elements ! a b 0.1 ' jne 8048400 0 1 mov 0 1 mov repz ret EDX.2 EDX.3 weight <sum> end ! cond-NE-EDX-ECX p p 0.4 " ! Stripped binary i i 0.3 # EDX.3 ECX.1 ! i j 0.2 $ 1 EDX.2 EDX.3 weight 1 i i 0.8 dep-EDX-EDX Name Type Loc i n 1 EDX.3 j i ECX.1 0.6 1 p p 0.3 int sum EDX.2 uint n i EDX.2 i uint 1 res int 1 10
Recovering debug information <sum> start : Registers / mem offsets Semantic variables mov 4(%esp), %ecx EDX.2 EDX.3 EDX.2 EDX.3 ECX.1 mov $0, %eax mov $1, %edx Temporary ECX.1 EDX.1 add %edx, %eax EDX.1 add $1, %edx.2 Known elements cmp %ecx.1, %edx.3 Known elements jne 8048400 0 1 mov 0 1 mov repz ret <sum> end Stripped binary Name Type Loc i n EDX.3 ECX.1 int sum uint n i EDX.2 i uint res int 1 1 Debug information 10
DeBIN implementation
DeBIN implementation Static analysis: BAP https://github.com/BinaryAnalysisPlatform/bap/ Learning and inference http://scikit-learn.org http://nice2predict.org https://debin.ai 830 Linux packages x86, x64, ARM 11
DeBIN evaluation 1. How accurate is DeBIN’s variable recovery? 2. How accurate is DeBIN’s name and type prediction? 3. Is DeBIN useful for malware inspection?
Variable recovery accuracy # sem tmp !" #|!%| Accuracy = &'( #|)(*| = FN TN TP FP Results Arch Accuracy x86 87.1% x64 88.9% Predicted as ARM 90.6% semantic registers and memory offsets DeBIN recovers variables with nearly 90% accuracy 12
Name and type prediction accuracy P N Predicted names and types Precision = |"#| |#$| = | | | | Recall = |"#| |#| = | | Correctly | | predicted names and types F1 = %∗#'()*+*,-∗.()/00 #'()*+*,-1.()/00 Total names and types (P) = Predicted names and types (PN) = Correct Predictions (CP) = 12
Evaluation of name and type prediction Arch Precision Recall F1 Name 62.6 62.5 62.5 x86 Type 63.7 63.7 63.7 Overall 63.1 63.1 63.1 Name 63.5 63.1 63.3 x64 Type 74.1 73.4 73.8 Overall 68.8 68.3 68.6 Name 61.6 61.3 61.5 ARM Type 66.8 68.0 67.4 Overall 64.2 64.7 64.5 Consistent precision/recall of roughly 65% 13
Recommend
More recommend