Toward Automated Forensic Analysis of Obfuscated Malware Ryan J. Farley George Mason University Department of Computer Science Committee: Xinyuan Wang, Hakan Aydin, Songqing Chen, Brian Mark Where Innovation Is Tradition April 24, 2015
Overview • The Need for Forensics • Forensics Problems and Our Contribution • Background • Problem Model • Challenges and Solutions • Empirical Evaluation • Conclusion 2
The Need for Forensics 3
Why? 1 Attacker infects host • Malware is a serious threat Vulnerable System Botmaster 2 Host becomes a bot and joins – Internet of [Insecure] Things 4 Botmaster sends IRC Botnet botnet commands to bots Bots – Stuxnet, Regin 3 Bots log in IRC Server – Christmas holiday tradition – Compromise is an eventuality 5 Bots send collected data to botmaster • Forensics seeks to understand the how – Embrace the ownage – Collect evidence, Analyze, Extrapolate • Enables us to build better defenses 4
Scenario: Vulnerable Web Server Memory Attack Vulnerable Defense v1.0 Process 5
Scenario: Exploit, What Now? Memory Detection Mechanism Attack Forensic Evidence Vulnerable Defense v1.0 Process Memory Dump 6
Scenario: What Now? • Upon first non-self system call – Attack code fragments remain in memory • Packing, self-modification, armoring – Staged C2 • Can the fragments reveal clues? – Robust system needed to generically model execution 7
Scenario: Build Better Defense Forensic Evidence Memory Defense v2.0 Forensic Analysis Dump 8
Forensics Problems and Our Contribution 9
Problem • Need to automate forensic response upon detection in memory – Avoid substantial manual effort • Automatically recover malcode • Extract/unpack/recover attack code – Memory dump, transient artifacts Input Output Analysis Engine Static Memory Dump Attack String Dynamic Process Context Malicious Code Hybrid Registers Vulnerability Execution Trace Arbitration Log Files Obfuscation Removal Obfuscated Code Normalized Code 10
Problem • Human oversight is costly Attack Code • Trade-off between – Generic binary – Malware specific Heavyweight Lightweight Heavyweight Binary Malware Malware Generic Specific Specific • Need – Automated generic malware tool that Human Oversight approaches detail from Scope of generic binary tools Results 11
Motivation, Existing Tools • Only work within known boundaries – Typically exclude support for code fragments • e.g., shellcode – Things get messy without given boundaries • e.g., arbitrary byte streams • Do not generically handle: – Malformed, Misaligned – Obfuscated, Armored – Too specific or too abstracted 12
Solution: CodeXt • Discovers executable code within memory dump – Upon real-time detection DASOS Forensic Dump Vital Runtime Information Upon Detection Write Dump to Disk HDD 13
Solution: CodeXt • Extracts packed or obfuscated malcode – First to generically handle Incremental and Shikata-Ga-Nai Decoder3 w/ K3 Decoder3 w/ K3 Decoder3 w/ K3 Decoder3 w/ K3 Layer 1 decoded Transient code 1 Transient code 1 Decoder2 w/ K2 Encoded Layer 3 decoded Layer 3 decoded Layer 3 decoded Encoded code, data Encoded Transient code 2 code, data code, data Layer 2 decoded Layer 2 decoded Decoder1 w/ K1 Original memory First snapshot Second snapshot Third snapshot 14
Solution: CodeXt • Uses data-flow analysis (taint tracking) – Finds attack string within network traffic • Models both shellcode and full executables Run-time info CodeXt Report of the attack Symbolic Execution Recovered code Obfuscation info Run-time Offline Analysis Run-time Dynamic Binary Intermediate results memory dump analysis info Analysis • Framework built upon S2E – Selective means QEMU vs. KLEE (LLVM) 15
Background 16
Background • S2E, Selective Symbolic Execution – KLEE for symbolic – QEMU for concrete • We extended QEMU to detect system calls • KLEE – Expressive IR allows low level operations • Down to the bit – States = Shadow Memory + Constraints – Memory = Expressions • Even concrete values are expressions 17
Attack Code vs. Attack String • Attack string: % . /"*,*$#012)"*$,1'&1 – Crafted input to the process 34)-15'6- – May include non-code • Attack code: !"#$ !"#$%&'"(#)*'$ – Executed within process +$,*$- – May include immediate values (data) • Removing layers of obfuscation !7%8.1!7!7%88.191 % &' . – How many, and by what function? /:&;%<#)-612)"*$, – What about self-destructive code? 18
Framing the Problem • Assumptions – All malicious code exists within dump – Malicious code has not overwritten itself destructively • Requirements – No code semantics known – Coding conventions irrelevant – Capable of accuracy with self-modifying code – Capable of modeling network-based server applications 19
CodeXt Output • Instruction Trace of executed instructions – Grouping of fragments into chunks – Reveals original and unpacked malcode – Assisted by a translation trace • Data Trace of memory writes – Intelligent memory update clustering – Multi-layer snapshots • Call Trace of system calls – With CPU context 20
Data-flow Analysis Output • For each labeled byte – Follow propagration – Generate trace – Generate memory map • Add events that qualify as success – EIP contains tainted values 21
Problems + Challenges + Solutions 22
Handling Byte Streams • S2E expects well structured binaries – We wrap the binary for execution Info Host to Guest Wrapper Buffer File Transfer Guest OS CodeXt S2E Plugin Output S2E (Modified QEMU) • S2E uses basic block granularity – Our modified QEMU translation returns more info – We leverage translation and execution hooks to verify 23
Code Fragments S2E ( , offset 1 ) ( ) . . . . S2E Fragments Match . . S2E ( , offset n ) • Fragmentation – Clustering into Chunks, adjacency, execution trace • Density – Usage: Executed/Range – Overlay: Unique executed/Range over snapshots • Enclosure – Continuous executable bytes adjacent to end 24
Defeating Obfuscation • FPU instructions, fnstenv – Added small change to QEMU to comply • Intra-basic block self-modification – We know address range of each translated block – During execution we track writes – If any write is to same block we retranslate block • Emulator detection – Tested for a set of obscure instructions used as canaries 25
Multipath, Arbitrary Bytes • Multipath Execution – Existing trace tool manages path merging – KLEE manages state forking and resources • Mark Arbitrary Bytes as Symbolic Vulnerable Process Memory Network Traffic from Labeled Network Input Attacker Attack String Vulnerable a b c d e Vulnerable a b c d e Process Process 26
Executing Symbolic Code • Taint labels can be search upon events – KLEE prefers constraints over solving • Constraint cleanup – Silent concretization Exploited Process Memory Executed Segments After Decoding With Labels a e Vulnerable Vulnerable Analysis a b c d e b c Process Process b c c c c d b c c c c 27
Executing Symbolic Code, con’t • Data-flow validity, intermingled code • Symbolic EIP • Periodic or triggered custom simplifier • Inheritance enforcer • Bit-wise and mov Exploited Process Memory Executed Segments After Decoding With Labels a e Vulnerable Vulnerable Analysis a b c d e b c Process Process b c c c c d b c c c c 28
Executable Modeling • OS introspection – Snag CR3 as PID • Load and link overhead – 95,000 instructions to ignore – Canary • Real-time attacks – Buffer overflow – Sockets – SSL 29
Empirical Evaluation 30
Experiments, Part 1 • Hidden code search – 1KB to 100KB buffers, 40B to 80B shellcodes – Filled with either null, live-capture, or random bytes – Varied assistance data: EIP, EAX, both, neither • Accuracy – De-obfuscation, Anti-emulation detection – Various packers mentioned in previous research – In-shop: Junk code insertion, Ranged xor, Incremental • Symbolic Branching 31
Multi-Layered Encoders 0 5 10 15 20 25 30 35 40 xor_key1 xor_key2 of xor_key1 xor_key2 junk inserted bytes 32
Recommend
More recommend