improving malware classification bridging the static
play

Improving Malware Classification: Bridging the Static/Dynamic Gap - PowerPoint PPT Presentation

Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics CISC850 Cyber Analytics INTRODUCTION Why is there a need for


  1. Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics

  2. CISC850 Cyber Analytics INTRODUCTION • Why is there a need for machine learning in malware detection ? • The need for different type of data sources and how to combine them. • Unified framework by using a support vector machine using multiple kernel learning.

  3. CISC850 Cyber Analytics DATA SOURCES • STATIC SOURCES: Binary, Disassembled Binary, Control Flow Graph • DYNAMIC SOURCES: Dynamic Instruction Traces (DIT) , Dynamic System Call Traces (DST) • MISCELLANEOUS FILE INFORMATION: Entropy, Packers, Instructions in file, vertices and edges in CFG

  4. CISC850 Cyber Analytics METHOD STEP 1: DATA REPRESENTATION • Markov chain representation for raw binary, disassembled binary, DIT and DST • Standard representation for Control Flow Graph • The miscellaneous file information is represented as a simple feature vector of length seven

  5. STEP 2: KERNELS • The Kernel Trick • Exponential Kernel: x i : Features of the file information / transition probability of Markov chain • Graphlet Kernel: G: Graph , k : number of nodes of subgraph equal to k D G : Normalized probability vector = fg / # of all graphlets of size k fg = feature vector consisting number of times unique subgraph of size k occurs

  6. Heatmaps for Individual Kernels

  7. STEP 3: MULTIPLE KERNEL LEARNING • Optimization problem for classical kernel learning: Subject to constraint: Thus the Decision function is : • But for multiple kernel learning we need to estimate β k

  8. Heatmap of Combined Kernel

  9. RESULTS • Criteria 1 : Accuracy: Accuracy is calculated using 10-fold cross-validation.

  10. • Criteria 2: ROC Curves / AUC Values

  11. • Criteria 3: Speed to classify new instances

  12. • Criteria 4: Testing on a Large Malware Sample Accuracy on validation set consisting of 20k samples

  13. OBSERVATIONS • There were a total of 19 false positives and false negatives that were found out of 1556 instances of the original dataset. • Use of only static analysis doesn’t work well when the training instances have been packed.

  14. LIMITATIONS AND DRAWBACKS • Selecting an appropriate value of n for n-gram analysis • Time to collect dynamic system traces will be too resource intensive on a normal system • Choosing optimal instruction call categories • Intel Pin isn’t transparent while tracing the program to collect instructions

  15. RELATED WORK • Use of single data sources • Use of static data sources combined with ensemble learning • Result Fusion Model • Identifying packed and hidden code

  16. CONCLUSION • Not restricting malware classification to a single data source improves classification accuracy. • In a resource constrained environment combined static analysis can result in high accuracy and low number of false positives. • Static analysis is not an optimal solution when instances have been packed or have an high entropy.

Recommend


More recommend