Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics
CISC850 Cyber Analytics INTRODUCTION • Why is there a need for machine learning in malware detection ? • The need for different type of data sources and how to combine them. • Unified framework by using a support vector machine using multiple kernel learning.
CISC850 Cyber Analytics DATA SOURCES • STATIC SOURCES: Binary, Disassembled Binary, Control Flow Graph • DYNAMIC SOURCES: Dynamic Instruction Traces (DIT) , Dynamic System Call Traces (DST) • MISCELLANEOUS FILE INFORMATION: Entropy, Packers, Instructions in file, vertices and edges in CFG
CISC850 Cyber Analytics METHOD STEP 1: DATA REPRESENTATION • Markov chain representation for raw binary, disassembled binary, DIT and DST • Standard representation for Control Flow Graph • The miscellaneous file information is represented as a simple feature vector of length seven
STEP 2: KERNELS • The Kernel Trick • Exponential Kernel: x i : Features of the file information / transition probability of Markov chain • Graphlet Kernel: G: Graph , k : number of nodes of subgraph equal to k D G : Normalized probability vector = fg / # of all graphlets of size k fg = feature vector consisting number of times unique subgraph of size k occurs
Heatmaps for Individual Kernels
STEP 3: MULTIPLE KERNEL LEARNING • Optimization problem for classical kernel learning: Subject to constraint: Thus the Decision function is : • But for multiple kernel learning we need to estimate β k
Heatmap of Combined Kernel
RESULTS • Criteria 1 : Accuracy: Accuracy is calculated using 10-fold cross-validation.
• Criteria 2: ROC Curves / AUC Values
• Criteria 3: Speed to classify new instances
• Criteria 4: Testing on a Large Malware Sample Accuracy on validation set consisting of 20k samples
OBSERVATIONS • There were a total of 19 false positives and false negatives that were found out of 1556 instances of the original dataset. • Use of only static analysis doesn’t work well when the training instances have been packed.
LIMITATIONS AND DRAWBACKS • Selecting an appropriate value of n for n-gram analysis • Time to collect dynamic system traces will be too resource intensive on a normal system • Choosing optimal instruction call categories • Intel Pin isn’t transparent while tracing the program to collect instructions
RELATED WORK • Use of single data sources • Use of static data sources combined with ensemble learning • Result Fusion Model • Identifying packed and hidden code
CONCLUSION • Not restricting malware classification to a single data source improves classification accuracy. • In a resource constrained environment combined static analysis can result in high accuracy and low number of false positives. • Static analysis is not an optimal solution when instances have been packed or have an high entropy.
Recommend
More recommend