decompilation type inference and finding the code to
play

Decompilation, type inference and finding the code to decompile - PowerPoint PPT Presentation

UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding


  1. UNIVERSITY OF CAMBRIDGE Decompilation, type inference and finding the code to decompile Alan Mycroft Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/users/am/ 30 January 2012 Decompilation, type inference and finding code 1 30 January 2012

  2. Structure UNIVERSITY OF CAMBRIDGE • Part 1: What is decompilation and why is it hard? • Part 2: Type reconstruction in decompilation Decompilation, type inference and finding code 2 30 January 2012

  3. Problem: given a binary .EXE what does it do? UNIVERSITY OF CAMBRIDGE • Run it: and get a virus • Run it in a sandbox: better • Run it in a program instrumenter (‘dynamic analysis’): even better But any form of dynamic analysis under-approximates program behaviour—consider a trojan which only attacks one username and only on a Sunday evening. Running = testing = only explore some paths. • Decompile it: re-write the binary in a high-level language with the high-level program having exactly the same execution paths as the low-level one. Harder than it sounds (simple cases easy). Decompilation, type inference and finding code 3 30 January 2012

  4. Decompilation—legality UNIVERSITY OF CAMBRIDGE • Isn’t this one of those things which is illegal? Or at best ‘shady’? • Depends. Lost source code, US and EU permit decompilation for interoperability. Always a ‘vaguely suspect’ activity. • New reason: Stuxnet, Duqu. Sophisticated malware written in high-level code. Decompilation, type inference and finding code 4 30 January 2012

  5. Decompilation—techniques UNIVERSITY OF CAMBRIDGE • Not always possible. Read in some code and branch to it, or other various assembler-level tricks such as updating a return address. Not a problem for ‘dynamic binary translation’ (DBT) tools but these effectively use dynamic analysis • Always trivally possible: just prepend an x86 interpreter in your favourite high-level language to the .EXE file. Cheating solution • In practice we need to make some assumptions . . . Decompilation, type inference and finding code 5 30 January 2012

  6. Decompilation—functionality vs beauty UNIVERSITY OF CAMBRIDGE Functionality: “if we decompile foo.exe to foo.c then recompiling to foo2.exe has the same I/O behaviour as foo.exe . Safety—which requires any analysis to be over-estimate behaviour Beauty: “the code is readable to humans” (most of the rest of this talk). While there’s not obviously a conflict, functionality means we must include all possible executions, which include some a human might wish to ignore . . . Decompilation, type inference and finding code 6 30 January 2012

  7. Decompilation—functionality vs beauty (2) UNIVERSITY OF CAMBRIDGE int f(int *p) { p[read()] += 1; // might increment the return address return 0; } int main() { int r,v[10]; putvaluesin(v); r = f(v); // f always returns zero. r++; // perhaps "inc eax" [one byte] print r; } Might this program print 0? What if we only had the assembler code version? We can’t decompile back to the above code, because the compiler (or options) might differ (stack offset between x and return address). For safety we might have to assume that almost every indirect write might overwrite a return address (adding many un-beautiful lines). Decompilation, type inference and finding code 7 30 January 2012

  8. Decompilation—functionality vs beauty (3) UNIVERSITY OF CAMBRIDGE f: pushl %ebp main: pushl %ebp movl %esp, %ebp movl %esp, %ebp pushl %ebx andl $-16, %esp subl $4, %esp pushl %ebx movl 8(%ebp), %ebx subl $76, %esp call read leal 24(%esp), %ebx ;;;;; here eax=-7 hits f’s return address incl (%ebx,%eax,4) movl %ebx, (%esp) addl $4, %esp call putvaluesin xorl %eax, %eax movl %ebx, (%esp) popl %ebx call f popl %ebp incl %eax ret movl %eax, (%esp) call print Decompilation, type inference and finding code 8 30 January 2012

  9. Decompiling .EXE UNIVERSITY OF CAMBRIDGE Needs pipeline: • obtain machine code not always easy if a packer is used, e.g. self extracting archive • obtain assembler code often a choice between readable assembly and missing some execution path • obtain high-level code (reconstruct loops, high-level expressions, types, even classes) again choice between readable source and missing some behaviours. First part of the sub-pipeline here is partitioning the code into procedures—e.g. is a branch between two sections of assembler just a branch, or actually an optimised tailcall? Decompilation, type inference and finding code 9 30 January 2012

  10. Economic argument UNIVERSITY OF CAMBRIDGE Decompilation can easily give a false impression of safety as it can miss malware-style attacks such as buffer overflow. However, even richly funded malware (e.g. Stuxnet) suffers from the “it’s not cost-effective to write everything in machine code” argument, with a result that much of it admits simple decompilation techniques. So, while malware will often contain “zero-day attacks” written in carefully crafted C or assembler, much or the high-level logic (both in malware and non-malware) will be written in “C which means C”. Decompilation, type inference and finding code 10 30 January 2012

  11. Analogy to testing and verification UNIVERSITY OF CAMBRIDGE • Running in a sandbox, or DBT, is like testing . • Can use ‘coverage’ metrics to help identify non-exectuted paths. • Safe decompilation is like verification , we consider all paths. • When disassembling/decompiling for human readability we may ignore some paths (e.g. assumptions of possible destinations of indirect branches). Verification subject to assumptions of various run-time invariants. • Determining whether some paths are feasible is a least-fixed-point problem. E.g. virtual calls can only be determined as targeting a particular destination if we can resolve an alias which is only resolvable if we know the virtual calls only target expected destinations . . . Decompilation, type inference and finding code 11 30 January 2012

  12. Decompilation—which high-level language? UNIVERSITY OF CAMBRIDGE • since assembler code is type-unsafe, we probably need a type-unsafe language to express things. • however if we’ve already given up on some things (e.g. we’re assuming no wild writes change return addresses) then perhaps we are willing to only consider programs with type-sensible data flow? • if we’re decompiling type-safe assembler code (e.g. JVM) we can safely decompile to a type-safe high-level language. • however, may still need to recreate abstract data types whose interface has been compiled away (e.g. generics in Java or ADTs). Decompilation, type inference and finding code 12 30 January 2012

  13. Funtionality and Beauty (partly) reconciled UNIVERSITY OF CAMBRIDGE Could in principle decompile assembler to C which is then compiled with safe-C style checks. • Whenever there is a potential missed behaviour in the generated C (e.g. index out of bounds) then detect this at run-time and refine the decompilation. • Doesn’t work for spotting trojan malware which attempts to stay hidden unless some carefully crafted condition holds.. E.g. Akritidis PhD work on cheap run-time checks for C mis-behaviour.. Decompilation, type inference and finding code 13 30 January 2012

  14. The interpreter problem UNIVERSITY OF CAMBRIDGE What if one carefully decompiles a program and finds out that the .EXE consists of an interpreter (e.g. for some bytecode) which does decompile nicely, followed by another layer of code in some mysterious language? • Start again at the next level • Issues if encryption is added. Decompilation, type inference and finding code 14 30 January 2012

  15. Obfuscation to counter-attack decompilers UNIVERSITY OF CAMBRIDGE There are various ways to make code hard to decompile. One (Lokhmotov’s masters thesis) is: • flatten a general CFG into a loop containing a dispatch to all the basic blocks in the CFG which then branch to the main loop. • dispatcher uses a new variable representing the PC within the original CFG. • can be strengthened by using a one-way hash function on the state. Decompilation, type inference and finding code 15 30 January 2012

  16. The decompilation pipeline UNIVERSITY OF Input: assembler code CAMBRIDGE Output: high-level code (e.g. C) • Partition code into procedures (may need code duplication). Need estimates of targets of indirect branches/calls. • Reconstruct control-flow (e.g. Cifuentes’ work). Irreducible CFG (perhaps produced by compiler optimisation) may need fixing up. • Transform to SSA form. Undoes register allocation etc. • Use dataflow analysis to reconstruct high-level expressions. Note C order-of-evaluation issues with f () + g () versus let x = f () in x + g () versus let y = g () in f () + y . • Generate high-level types, add casts if needed. These task are largely independent—apart from the first. Decompilation, type inference and finding code 16 30 January 2012

Recommend


More recommend