Similarity Metric Method for Binary Basic Blocks of - PowerPoint PPT Presentation

Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture Xiaochuan Zhang zhangxiaochuan@outlook.com Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China

Content • Background 01 Methodology & Implementation 02 Experiment & Result 03

Background • Binary program similarity metric can be used in: malware vulnerability authorship classification detection analysis The similarity between basic blocks is the basis

Background • Two step of basic block similarity metric sub sp, sp, #72 ldr r7, [r11, #12] [0.24, 0.37,…, 0.93] ldr r8, [r11, #8] ldr r0, .LCPI0_0 Similarity Score [0, 1] movq %rdx, %r14 movq %rsi, %r15 [0.56, 0.74,…, 0.31] movq %rdi, %rbx movabsq $.L0, %rdi Basic Block Embedding Similarity Calculation

Background • Type of methods manually each dimension corresponds to a manually selected static feature [1-3] basic block static word representation based methods [4-7] embedding automatically INNEREYE-BB, an RNN based method [8] [1] Qian Feng, et al. Scalable Graph-based Bug Search for Firmware Images. CCS 2016 [2] Xiaojun Xu,et al. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. CCS 2017 [3] Gang Zhao, Jeff Huang. DeepSim: deep learning code functional similarity. ESEC/SIGSOFT FSE 2018 [4] Yujia Li,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects. ICML 2019 [5] Luca Massarelli, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity. DIMVA 2019 [6] Uri Alon, et al. code2vec: learning distributed representations of code. PACMPL 3(POPL) 2019 [7] Steven H. H. Ding, et al. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. S&P 2019 [8] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

Background • INNEREYE-BB [1] ℎ ! = 𝐺(𝑡 ! , ℎ !"# ) ℎ # ℎ $ ℎ % ℎ & ℎ ' 𝑡 # 𝑡 $ 𝑡 % 𝑡 & 𝑡 ' ldr r0 .LCPI0_115 bl printf FUNC scanf memcpy …… [1] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

ARM BB x86 BB BB embedding Methodology & Implementation • Idealized Solution (based on PERFECT TRANSLATION assumption) Encoding Neural Machine Translation Decoding 𝒐×𝒆 matrix Aggregation Encoding Aggregation

Methodology & Implementation • Practical Solution

Methodology & Implementation • x86-encoder pre-training Ø data: x86-ARM basic block pairs Ø NMT model: Transformer [1], other NMT models also work Ø Optimization goal: minimize the translation loss [1] Ashish Vaswani, et al. Attention is All you Need. NIPS 2017

Methodology & Implementation • ARM-encoder training & x86-encoder fine-tuning Ø data: basic block triplets, {anchor, positive, negative} Ø Optimization goal: minimize the margin-based triplet loss semantically equivalent basic block pair margin positive negative anchor

Methodology & Implementation • Mixed negative sampling Hard Negatives: 33% Similar but not equivalent to anchor 67% Random Negatives Hard Negatives

Methodology & Implementation • Hard negative sampling: if anchor is a x86 basic block anchor(x86) 𝑭 𝒃𝒐𝒅𝒊𝒑𝒔 𝑬 𝟐 rand_x86_1 𝑭 𝟐 𝑬 𝟑 pretrained x86-encoder rand_x86_2 𝑭 𝟑 rand_x86_t rand_ARM_t …… …… 𝑬 𝒐 rand_x86_n 𝑭 𝒐

Methodology & Implementation • Similarity Metric Euclidean distance embedding dimension

Experiment & Result • Setup Ø prototype: MIRROR https://github.com/zhangxiaochuan/MIRROR Ø Dataset: MISA, 1,122,171 semantically equivalent x86-ARM basic block pairs https://drive.google.com/file/d/1krJbsfu6EsLhF86QAUVxVRQjbkfWx7ZF/view

Experiment & Result • Comparison with Baseline * Higher is better

Experiment & Result • Evaluation of negative sampling methods * Higher is better

Experiment & Result • Effectiveness of pre-training The pre-training phase seems redundant?

Experiment & Result • Effectiveness of pre-training * Higher is better

Experiment & Result • Visualization

Thanks! zhangxiaochuan@outlook.com

Similarity Metric Method for Binary Basic Blocks of - PowerPoint PPT Presentation

Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture Xiaochuan Zhang zhangxiaochuan@outlook.com Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

Welcome back... Metric spaces. Approximate metric using a tree. Tree metric: 16 16 A metric

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Metric Spaces Definition If d is a metric on X , then the metric topology on X induced by d is

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Metric Conversions Ladder Method T. Trimpe 2008 http://sciencespot.net/ Metric System The

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

Laboratory of Machine Learning with Python Numpy / Matplotlib / Scikit-learn Luca Erculiani

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 23: Learning Theory

Training, test and validation splits Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

L ECTURE 9: E VALUATION Prof. Julia Hockenmaier juliahmr@illinois.edu Admin Homework 1 is being

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by

The Walking School Bus: Combining Safety, Fun, and the Walk to School Wha hat t is is a W a

Machine Learning Lecture 7 Some feature engineering and Cross validation Justin Pearson 1 2020 1

Overview of Cross-cutting Requirements Part 2: Uniform Relocation Act (URA) and Labor Standards

Sambuz

Useful Links

Newsletter

Mail Us