o-glassesX: Compiler Provenance Recovery with Attention Mechanism from a Short Code Fragment Yuhei Otsubo ∗ † , Akira Otsuka † , Mamoru Mimura ‡† , Takeshi Sakaki § , and Hiroshi Ukegawa ∗ ∗ National Police Agency, Tokyo, Japan † Institute of information Security, Kanagawa, Japan ‡ National Defense Academy, Kanagawa, Japan § The University of Tokyo, Tokyo, Japan 1
Introduction 2
Forensic Scientists 3 Designed by macrovector / Freepik (http://www.freepik.com)
Computer Forensics Attackers Find Digital Evidences Deleted Malicious files Documents Victim PC Memory Email C2 Server 4
Author Identification Compiler Provenance Compiler Malicious EXE Specific Strings Family MZ Version This Program is … Optimization level Secret key abc123 JFIF Static Link Libraries Resources Compile time Language API name 5
Multiple Compiler Binary a.cpp A.exe Make What is the truth label of A.exe? b.cpp r.rc s.txt t.jpeg 6
Multiple Compiler Binary a.cpp a.obj A.exe A Compiler What is the truth label of A.exe? - A Compiler? - B Compiler? b.cpp b.obj B Compiler r.rc r.res s.txt t.jpeg 7
Multiple Compiler Binary a.cpp a.obj A.exe A Compiler What is the truth label of A.exe? - A Compiler? - B Compiler? b.cpp b.obj B Compiler - X Compiler? x.cpp x.lib X Compiler ? r.rc r.res s.txt t.jpeg 8
Multiple Compiler Binary a.cpp a.obj A.exe A Compiler C Linker What is the truth label of A.exe? - A Compiler? - B Compiler? b.cpp b.obj B Compiler - X Compiler? - C Linker? x.cpp x.lib X Compiler ? r.rc r.res s.txt t.jpeg 9
Multiple Compiler Binary a.cpp a.obj A.exe A Compiler C Linker What is the truth label of A.exe? - A Compiler? - B Compiler? b.cpp b.obj B Compiler - X Compiler? - C Linker? x.cpp x.lib X Compiler ? What is the truth label of a.obj? - A Compiler r.rc r.res What is the truth label of b.obj? - B Compiler s.txt What is the truth label of x.lib? - Hmm... I think VC, because MS provide it! t.jpeg 10
Multiple Compiler Binary Easy to make the ground truth a.cpp a.obj A.exe A Compiler C Linker What is the truth label of A.exe? - A Compiler? - B Compiler? b.cpp b.obj B Compiler - X Compiler? - C Linker? x.cpp x.lib X Compiler ? What is the truth label of a.obj? - A Compiler r.rc r.res What is the truth label of b.obj? - B Compiler s.txt What is the truth label of x.lib? - Hmm... I think VC, because MS provide it! t.jpeg 11
Fragmented Files Collect as much of the attacker's trace as possible even from fragmented files. A.exe Forensics Deleted files Recovery file Memory 12
Preliminaries 13
o-glasses x86 code (e.g., shellcode) detector [arXive.1806.05328] Binary Input : 16 x86 instructions Output : Program or not F1 : 0.9995 x86 instructions Convert Softmax 128-bit CNN CNN FNN FFN FFN BN BN length instructions 14
o-glasses x86 code (e.g., shellcode) detector [arXive.1806.05328] Binary Input : 16 x86 instructions Output : Program or not F1 : 0.9995 x86 instructions Convert Softmax 128-bit CNN CNN FNN FFN FFN BN BN length instructions ・ Applying to compiler identification ・ Black Box Problem 15
Attention Is All You Need [ Łukasz Kaiser et al., NIPS, 2017] 16
Basic of Attention Input Query [q_length, depth] [q_length, depth] matmul Softmax Key Att. W [m_length, depth] [q_length, m_length] Memory Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] 17
Basic of Attention query = 'key2' Input Query [q_length, depth] [q_length, depth] memory = {'key1':'value2', matmul Softmax 'key2':'value2', Key Att. W [m_length, depth] 'key3':'value3', [q_length, m_length] 'key4':'value4'} Memory Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] memory[query] = 'value2' 18
Basic of Attention query = 'key2' memory = {'key1':'value2', 'key2':'value2', 'key3':'value3', Input Query 'key4':'value4'} [q_length, depth] [q_length, depth] memory[query] = 'value2' matmul Softmax Key Att. W [m_length, depth] [q_length, m_length] Memory Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] 19
Basic of Attention query = 'key2' memory = {'key1':'value2', 'key2':'value2', 'key3':'value3', Input Query 'key4':'value4'} [q_length, depth] [q_length, depth] memory[query] = 'value2' matmul Softmax Key Att. W [m_length, depth] [q_length, m_length] Memory Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] 20
Basic of Attention query = 'key2' memory = {'key1':'value2', 'key2':'value2', 'key3':'value3', Input Query 'key4':'value4'} [q_length, depth] [q_length, depth] memory[query] = 'value2' matmul Softmax Key Att. W [m_length, depth] [q_length, m_length] Memory Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] 21
Dot-Product Attention vs. Dictionary Object Key Att. W Query query = 'key2' memory = {'key1':'value2', 'key2':'value2', softmax q k k k k 'key3':'value3', 'key4':'value4'} memory[query] = 'value2' Att. W Value v key v [m_length, depth] v v [m_length, depth] v value memory [m_length, depth] 22
Self-Attention Input Query [q_length, depth] [q_length, depth] matmul Softmax Key Att. W [m_length, depth] [q_length, m_length] Input Value matmul Output [m_length, depth] [m_length, depth] [q_length, depth] 23
Positional Encoding (PE) PE (Positional Encoding) adds information about the word position to the input word vectors for learning the context of words. 𝑍 𝑌 = 𝑌 + 𝛽𝑄𝐹 𝑄𝐹 𝑞𝑝𝑡,2𝑗 = sin 𝑞𝑝𝑡/10000 2𝑗/𝑒 𝑛𝑝𝑒𝑓𝑚 𝑄𝐹 𝑞𝑝𝑡,2𝑗+1 = cos 𝑞𝑝𝑡/10000 2𝑗/𝑒 𝑛𝑝𝑒𝑓𝑚 24
Proposed Method 25
o-glassesX Binary Query CNN CNN x86 instructions Convert Att. Output Att. Input Softmax Softmax matmul matmul Att. W 128-bit CNN CNN CNN FFN Key BN PE length instructions Value CNN CNN 26
o-glassesX Attention Binary Query CNN CNN x86 instructions Convert Att. Output Att. Input Softmax Softmax matmul matmul Att. W 128-bit CNN CNN CNN FFN Key BN PE length instructions Same as o-glasses Value CNN CNN 27
Preprocessing details Binary 60 B9 67 01 00 00 EB 0F Binary x86 instructions 60 PUSHA x86 B9 67 01 00 00 MOV ECX,0x167 instructions Convert EB 0F JMP loc_17 128-bit CNN length 128-bit length 16 bytes instructions instructions 60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Same as o-glasses B9 67 01 00 00 00 00 00 00 00 00 00 00 00 00 00 EB 0F 00 00 00 00 00 00 00 00 00 00 00 00 00 00 128 bits 0000011000000000 … … … … … 00000000 1001110111100110 … … … … … 00000000 1101011111110000 … … … … … 00000000 28
The 1 st CNN Layer Each unit in CNN has specially local connections to the input units, called a Kernel. Every kernel shares the weight parameters with the others in the same layer. Binary Each kernel covers a single instruction by adjusting the hyperparameters. x86 instructions Stride size = 128 Depth (=96) Kernel size = 128 Convert 128-bit length instruction Instruction vector 128-bit CNN length instructions 128-bit length instruction Instruction vector Same as o-glasses 128-bit length instruction Instruction vector CNN . . . . . . 128-bit length instruction Instruction vector 29
Evaluation 30
Training Dataset Label #Binaries #Code VC17,32,none(Od) 1,170 369,605 x86 VC2017 Collecting source code files from GitHub VC17,32,max(Ox) 1,147 255,143 x86-64 VC17,64,none(Od) 1,456 540,568 Compiling various compilers and options VC17,64,max(Ox) 1,242 542,020 VC03,32,none(Od) 1,350 292,277 x86 Total : 19 labels VC2003 VC03,32,max(Ox) 1,306 270,743 Compiler : 4 families x86-64 - - - Visual C++, GCC, Clang and Intel C++ Compiler - - - Opt. level : 2 types GCC,32,none(O0) 2,111 227,004 x86 Program maximum or not GCC,32,max(O3) 1,844 239,821 GCC x86-64 GCC,64,none(O0) 1,582 283,276 CPU Arc. : 2 types GCC,64,max(O3) 1,580 287,775 x86 or x86-64 Clang,32,none(O0) 1,205 101,024 x86 Clang Clang,32,max(O3) 1,196 86,521 x86-64 Clang,64,none(O0) 1,892 332,278 Clang,64,max(O3) 1,883 246,500 ICC,32,none(Od) 1,761 1,494,677 x86 ICC,32,max(Ox) 1,724 1,161,499 ICC x86-64 ICC,64,none(Od) 1,796 1,419,705 ICC,64,max(Ox) 1,728 1,046,958 Others 101 912,855 Total 28,074 10,110,249 31
Recommend
More recommend