Institute of System Security TypeMiner: Recovering Types in Binary Code using Machine Learning Alwin Maier Hugo Gascon Christian Wressnegger Konrad Rieck, DIMVA 2019 Institute of System Security, TU Braunschweig
Motivation Decompilation decompilers profit from type information manual analysis of usage patterns What about automatization ? DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 2 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Motivation Decompilation W/o Type Information decompilers profit from type ulong idx = 0; int *pt1, *pt2; information if (0 < (int) len) manual analysis of usage patterns do { pt1 = *(int **) (pts1 + idx * 8); What about automatization ? pt2 = *(int **) (pts2 + idx * 8); *pt1 = *pt1 + *pt2; *(double *) (pt1 + 2) = *(double *) (pt1 + 2) + *(double *) (pt2 + 2); idx = idx + 1; } while (len != idx); DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 2 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Motivation Decompilation W/ Type Information decompilers profit from type ulong idx = 0; struct point *pt1, *pt2; information if (0 < len) manual analysis of usage patterns do { pt1 = pts1[idx]; What about automatization ? pt2 = pts2[idx]; pt1->x = pt1->x + pt2->x; pt1->y = pt2->y + pt1->y; idx = idx + 1; } while (len != idx); DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 2 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Manual Rules vs. Machine Learning Manual Rules Machine Learning requires human expertise learn type recovery rules automatically process rules manually process rules automatically requires profound knowledge of the training data can be generated for the respective ISA respective ISA DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 3 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
TypeMiner Overview Binary Code Analysis data dependence analysis rdi rcx edx mov mov rdi add add data object graph ; modeling rcx r8 (indirect) data dependencies xmm0 r8 rsi mov add rsi movsd addsd extraction of data object traces by xmm0 traversing the data object graph movsd DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 4 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
TypeMiner Overview Machine Learning type recovery of data objects array (i.e. variables and parameters) ptr2struct pointer ptr2char Step 2a classification model based on ptr2func other ptr embedded data object traces char short int signed Step 1 Step 3 int prediction of data types in multiple unsigned long int classification steps float double Step 2b long double arithmetic Bool scikit DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 5 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Trace Extraction and Normalization Trace Extraction represent usage patterns mov mov add add start at access locations of data mov objects movsd addsd add traverse the data object graph movsd Trace Normalization strip irrelevant information movsd | loc_w8 | obj(0)_w8 addsd | obj(0)_w8 | loc_w8 normalize each instruction in trace movsd | loc_w8 | obj(0)_w8 consider previous instructions DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 6 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Training Phase Training Data Training Process training data is based on real software embedding of traces in a vector space projects written in C using a n-gram model traces are extracted for each data object in classifiers (SVMs, random forests) are the compiled program trained for each classification step cross-validation ; grouped by compiled debugging information is used to label training data programs DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 7 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Type Recovery Prediction Phase extract, normalize, and embed all traces of an unknown data object Step 1 Step 3 int unsigned merge all traces into a single vector representation Step 2b arithmetic run through all classification steps to recover the data type DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 8 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Evaluation Dataset Experimental Setup 14 popular open-source software projects 13 binary programs for training, one for testing optimized release configuration evaluation of each classification step compiled for X86-64 architecture evaluation of different trace lengths ground truth data types from debugging information comparison with manually created rules DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 9 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Results: Overview 1 Accuracy accuracy sensitivity total amount of recovered types 0 . 8 average over all software projects 0 . 6 each sample contributes equally to the score Sensitivity 0 . 4 amount of correctly classified data objects 0 . 2 per type average over all software projects 0 each type contributes equally to the score Step 1 Step 2a Step 2b Step 3 DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 10 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Results: Pointer Types 1 Pointer Types 0 . 8 performance increases with length of traces F 1 -Score 0 . 6 very good performance for pointer to char and pointer to structures 0 . 4 TypeMiner fails to detect function pointers 0 . 2 moderate performance for array types and array types ptr2struct 0 ptr2char other ptr other pointer types ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5 max t | t | DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 11 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Results: Arithmetic Types 1 Arithmetic Types 0 . 8 performance increases with length of traces F 1 -Score 0 . 6 very good performance for int , long int , and double 0 . 4 short int incorrectly predicted as int or 0 . 2 Bool char long int short int int 0 long int double good performance for char and _Bool ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5 max t | t | DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 12 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Other Results Signed vs. Unsigned Omitted Types ≈ 76 % accuracy union , enumeration , “ void * ” Pointer vs. Arithmetic Types Encountered Dilemmas different types, same semantic ≈ 92 % accuracy array of type T vs. pointer to type T detection of pointer types without being dereferenced structured data types DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 13 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Summary Recovery of Data Types using Machine Learning extraction of traces (characteristic traits) in compiled C code automatic identification of data types using machine learning recovery of data types in multiple classification steps Results evaluation with 14 real world software projects evaluation on X86-64 architecture correct recovery of data types in 76 % – 93 % DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 14 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Thanks for your attention. Questions? ? DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 15 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Program # Data Obj. # Instr. Program # Data Obj. # Instr. bash 6496 157 K gzip 424 10 K bc 422 10 K indent 174 10 K bison 2470 58 K less 961 20 K cflow 768 18 K libpng 1968 33 K gawk 3472 98 K nano 1526 34 K grep 1227 24 K sed 709 15 K gtypist 145 5 K wget 2720 58 K DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 16 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Pointer vs. Arithmetic Types Signed vs. Unsigned Types 1 1 0 . 8 0 . 8 0 . 6 0 . 6 F 1 -Score F 1 -Score 0 . 4 0 . 4 0 . 2 0 . 2 0 0 pointer arithmetic signed unsigned ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5 ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5 max t | t | max t | t | DIMVA 2019 Alwin Maier, Hugo Gascon, Christian Wressnegger, Konrad Rieck Page 17 TypeMiner: Recovering Types in Binary Code using Machine Learning Institute of System Security
Recommend
More recommend