Binary‐level program analysis: Static Disassembly Gang Tan CSE 597 Spring 2019 Penn State University 3
Disassemblers • Disassembler – Convert machine code in a binary file into assembly code or code in an equivalent IR • Assume a decode function – decode(code, offset) returns the next instruction and the instruction’s size • Assuming code is a list of bytes, and offset is the beginning of the next instruction – E.g., assume code=“6A 03 83 C4 0C B8 CC CC CC CC” • decode(code,0) => (“push 3”, 2) – “6A 03” • decode(code,2) => (“add esp, 0x0C”, 3) – “83 C4 0C” • decode(code,5) => (“mov eax, 0xCCCCCCCC”, 5) – “B8 CC CC CC CC”
Dynamic vs Static Disassemblers • Dynamic disassemblers – The binary code is executed and the execution traces are recorded and decoded – Advantages: accurate, can disassemble obfuscated binary code – Disadvantages: takes time to record traces; only covers one execution path at a time • Static disassemblers – A binary file is disassembled without executing it – Advantages: fast; covers more than one execution path – Disadvantages: many challenges; vulnerable to code obfuscation
Static Disassemblers • Input: a binary file • Goal – Disassemble the executable sections in the binary file – May use other information in the binary file • E.g., symbol tables if they are available • Output: a Control‐Flow Graph (CFG) 6
CFG • Nodes are basics blocks of assembly instructions – A basic block is a piece of straight‐line code: no jumps in or out of the middle of a basic block • Directed edges connect basic blocks – An edge from b1 to b2 means that after the execution of b1 it is possible b2 starts execution • A basic block may have multiple outgoing edges – E.g., when it ends with a conditional jump instruction 7
CFG Example 8
Static Disassembly Challenges • Variable‐sized instruction sets – Do not know instruction boundaries for stripped binaries • Embedded data in code – E.g., compiler may embed jump tables into the code section – Note: compilers do less of this nowadays; but an obfuscator might do it still • Targets of indirect jumps/calls require static analysis – E.g., “jmp 16[ebp]”, “call eax” 9
Static Disassemblers • Some major algorithms – Linear sweep – Recursive traversal – Some mixed approach
Linear Sweep • Linear Sweep – Start at the entry point of a code section – Decode instructions one by one until the end or an illegal instruction is reached • The Unix utility program objdump adopts linear sweep
Linear Sweep Pseudo Code – Input • code holds the bytes of the input code section • codeSize is the code section size Assume decode throws exception when it fails (cannot decode, the end currOffset=0; of the code buffer, etc.) instrSet={}; while (currOffset < codeSize) { (instr, size) = decode(code, currOffset); instrSet = instrSet ∪ {(currOffset, instr, size)}; currOffset += size; } Build basic blocks and buildCFG(instrSet) add CFG edges
Linear Sweep • Advantage: simple and easy to implement • Disadvantages – Mistreat data as code if they are mixed Code section instr instr data instr instr (1) Could be a jump over data to the next instr (2) Could be a ret Linear sweep: instr instr wrong wrong wrong
Recursive Traversal • Idea – disassembles instructions following the control flow graph constructed during disassembly 14
Recursive Traversal Pseudo Code worklist = {0}; processed = {}; while (worklist <> {}) { offset = removeOneNode(worklist); processed = processed ∪ {offset}; (instr, size) = decode(code, offset); switch (instr) case non‐control‐flow‐instr: add(offset+size); case unconditional‐jmp(dest): add(dest); case cond‐jmp(dest1,dest2): add(dest1); add(dest2); … } Procedure add (offset): if (offset ∉ processed) then worklist = worklist ∪ {offset} 15
Recursive Traversal • Advantage – Recursive traversal can accommodate data embedded in code section • Disadvantage: – Hard to determine the control‐flow edges out of indirect jumps and calls • IDA Pro uses recursive traversal – A commercial disassembler – An incomplete control flow graph (CFG) is emitted – The CFG is incomplete because there is no edge for indirect branch or call instructions
Disassembling Obfuscated Code • “Static Disassembly of Obfuscated Binaries” by Kruegel et al at 2004 Usenix Security – Linear sweep and recursive traversal are combined – Heuristics are used to remove spurious nodes from initial CFGs • Obfuscated binaries – No symbol info – After obfuscation such as inserting unreachable junk data into the code section • E.g., “ins1; ins2” => “ins1; mov eax, someConst; jmp eax; junk bytes; ins2
Disassembling Obfuscated Code • The algorithm: – Identify functions • Match binary code with common prologs • Common prolog: “push %ebp; mov %esp, %ebp”, i.e. 0x55 89 e5 – Construct intra‐procedural CFGs • Decode from every address; throw away illegal instructions – To accommodate variable‐sized instruction sets – May result in overlapping instructions • Identify all direct jump instructions in a function • Direct jump instructions whose targets are inside the function and direct conditional branch instructions are selected as jump candidates • An initial CFG is constructed by treating the entry instruction and jump candidates as the starting points using recursive traversal – Resolve block conflicts in the initial CFGs • Five steps are taken to remove spurious nodes
Example Program 19
Block Conflict Resolution Initial control flow graph Blue nodes represent the nodes in the real CFG; Red nodes represent spurious nodes; Node A is the entry node; Pink dash lines indicate there is a conflict between the nodes; Solid arrows represent the edges in the initial CFG.
Block Conflict Resolution • The first step removes conflicting nodes which conflict with valid nodes – Entry node must be valid – Nodes reachable from a valid node must be valid – Nodes in conflict with valid nodes must be invalid • The second step removes ancestors of conflicting nodes – Assumption: valid nodes do not overlap – If two nodes in conflict share an ancestor, the ancestor must be invalid • The third step removes conflicting nodes with less predecessors – Assumes that valid nodes are more tightly integrated into a CFG – A node with more predecessors implies tighter integration – Clearly a heuristics • The fourth step removes conflicting nodes with less direct successors – Assumes that valid nodes are more tightly integrated into a CFG – More direct successors implies tighter integration – Heuristics • The last step removes nodes in conflict randomly – Pick one from two conflicting nodes by random – Being desperate here
Block Conflict Resolution Control flow graph after the first step (Node B is removed) Blue nodes represent the nodes in the real CFG; Red nodes represent spurious nodes; Node A is the entry node; Pink dash lines indicate there is a conflict between the nodes; Solid arrows represent the edges in the initial CFG.
Block Conflict Resolution Control flow graph after the second step (Node J is removed) Blue nodes represent the nodes in the real CFG; Red nodes represent spurious nodes; Node A is the entry node; Pink dash lines indicate there is a conflict between the nodes; Solid arrows represent the edges in the initial CFG.
Block Conflict Resolution Control flow graph after the third step (Node K is removed) Blue nodes represent the nodes in the real CFG; Red nodes represent spurious nodes; Node A is the entry node; Pink dash lines indicate there is a conflict between the nodes; Solid arrows represent the edges in the initial CFG.
Block Conflict Resolution Control flow graph after the fourth step (Node C is removed) Blue nodes represent the nodes in the real CFG; Red nodes represent spurious nodes; Node A is the entry node; Pink dash lines indicate there is a conflict between the nodes; Solid arrows represent the edges in the initial CFG.
Disassembler Accuracy Program Objdump Linn/Debray IDA Pro This paper compress95 56.07 69.96 24.19 91.04 gcc 65.54 82.18 45.09 88.45 go 66.08 78.12 43.01 91.81 Ijpeg 60.82 74.23 31.46 91.60 li 56.65 72.78 29.07 89.86 m88ksim 58.42 75.66 29.56 90.39 perl 57.66 72.01 31.36 86.93 vortex 66.02 76.97 42.65 90.71 Mean 60.91 75.24 34.55 90.10 All programs went through an obfuscation tool Percentage of instructions correctly disassembled by each tool using SPECint 95
Student Presentations related to Static Disassembly • Shingled Graph Disassembly: Finding the Undecideable Path; presenter: Yi Zheng • Superset Disassembly: Statically Rewriting x86 Binaries Without Heuristics; presenter: Eric Pauley • Static Binary Rewriting without Supplemental Information; presenter: Tingwei Hua • Recognizing Functions in Binaries with Neural Networks; presenter: Ryan Sheatsley 27
Next: Static Analysis Basics • On a high‐level language – Techniques applicable to assembly code – We will read papers that apply static analysis on assembly code • Dataflow analysis – First discuss the theory – Then go through implementation in Datalog • Inter‐procedural analysis; flow‐sensitivity, path‐sensitivity, context sensitivity; 28
Recommend
More recommend