active reversing
play

Active Reversing Andrew Schaffer Greg Hoglund The goal Solve - PDF document

Active Reversing Andrew Schaffer Greg Hoglund The goal Solve reverse engineering problems as quickly as possible without having to read disassembled code Advantages Active Reversing reveals contextual relationships between user actions,


  1. Active Reversing Andrew Schaffer Greg Hoglund

  2. The goal Solve reverse engineering problems as quickly as possible without having to read disassembled code

  3. Advantages • Active Reversing reveals contextual relationships between user actions, behavior, code, and data • Active Reversing excels at classification and sorting problems • Active Reversing is really easy to use

  4. The business case • Active Reversing can save time – lots of time if used correctly • Active Reversing increases the labor pool – People without disassembly skills can participate • Active Reversing can be used in conjunction with traditional methods to increase productivity

  5. The need for data • Static analysis does not reveal data that is calculated at runtime nor does it illustrate motion – all of these things are left to assumption or prediction • The need for data is the reason that even die-hard static reverse-engineers always drop into a debugger at some point, or perform real input testing

  6. Here we are • We present a new methodology that is very data-flow centric • Our new method demands a whole new breed of tools • We have prototyped several of these new tools and we illustrate how to use them • Our company, HBGary, is committed to commericializing this new form of reverse engineering

  7. THE METHODOLOGY Part II

  8. The methodology Code and data flow is harvested at runtime, collected into sets, and blended together into a graph… ...this graph is refined iteratively until it solves the reverse engineering problem.

  9. Yes, a graph • Software is a bunch of small interrelated moving parts, naturally suited to a graph • But, to work, the graph must be able to illustrate the solution data – Relationship between objects or events, membership in a particular set, presence of specific data or content, etc – Almost anything can be a node, and edges represent relationships between arbitrary things, so this is actually quite flexible

  10. The “large graph problem” • Historically, graphs have been too large to interact with – The key word is “interact”

  11. Pretty, but dumb

  12. Ugly, and dumb

  13. Hyperbolic Graphing • Impressive and powerful, but not for us • Designed for large directed graphs, but clumsy when dealing with smaller, more manageable sets

  14. Stick to tradition • Smaller, more manageable graphs are best drawn in the traditional 2D layout with color and annotations

  15. Data reduction and refinement • The premise of Active Reversing is to show only what matters and nothing more • There is a significant reduction in the amount of data that must be analyzed • The refinement of the data converges upon the solution to the reverse engineering problem

  16. How: the working canvas The primary workspace is known as the “working canvas”

  17. Layers • Sets are layered onto the canvas, much in the same way that layers in Photoshop™ are combined into an image

  18. Set operations • Layers are an easy and convenient way to combine sets • All set operations (union, intersection, etc) can be represented using the layer system … via order, visibility, and blending mode

  19. Set harvesting • We will cover many tools for set harvesting – Dataflow tracing – Hit counting – Function coverage – String references – Symbolic information

  20. The methodology Harvest, combine, and refine!

  21. EXECUTION PARTITIONING Part III

  22. Active Reversing reveals contextual relationships between user actions, behavior, code, and data – to begin we start with code

  23. Assumptions about Behavior • Program behavior is in response to action that was just taken • Different behaviors are represented by different code – This is how compilers build software

  24. Examples of User Actions • Sending a packet • Causing a specific transaction, such as a login or copy-file command • Using a button or menu on the GUI • Moving a game character in 3-space • Unplugging or inserting hardware

  25. Execution Partitioning 101 • Rapidly locate the function(s) responsible for a particular program feature, isolate code by functionality – Incremental coverage sets – Noise removal

  26. Function Coverage

  27. Partitioning

  28. Execution Partitioning 201 • Change up the data content of the transaction to induce many possible responses

  29. Remember the data too! • Its not just about packets and menu items, but also about the data you type or insert • The contextual data associated w/ the user initiated action plays a large part in how the program logic will respond – A packet w/ a bad checksum won’t get far – ‘$%%%%%$$$$$’ in the file-open dialog will do something different than ‘aZAzazzazAA’

  30. Partitioning more detail General login processing Handling of incorrect Handling of correct password password Handling of too many invalid attemtps

  31. Example: File Paths • TBD

  32. Execution Partitioning 301 • Force error conditions, abortive logic, and exceptions through both data and direct action

  33. Remember the error state • Many user-initiated actions can induce both success and failure logic • Sending a good password verses sending a bad password • Moving before the spell-casting is complete • Unplugging the network cable when a file transfer is in progress

  34. Example: bad login • Response to bad password will cause some error handler to execute • Response to good password will execute a whole series of connection-initialization routines • The code for these two responses are physically separated in the program code

  35. DATA SAMPLING Part IV

  36. Active Reversing reveals contextual relationships between user actions, behavior, code, and data – now that we have code we can move on to data…

  37. Assumption: Data follows code • It makes sense that code that implements behavior must also touch data related to that behavior • Code and data flows are tightly coupled • They co-exist spatially in the context of the stack and the CPU registers

  38. Where we are • At this point in the process, your graph should be well partitioned • Because we know data follows code, we can begin examining dataflow by going to the already existing partition of interest

  39. Data sampling • Collect a detailed instruction-by- instruction sample history for a defined region of code – The collection space is bounded by the partition set thus granting a manageable computational overhead

  40. Example: looking for SQL statements • Find a region of code that is related to login • See if you can recover the SQL statements

  41. Data sample searching • Specific value search – You must know the specific value ahead of time • Can you query it from the software? (XYZ coordinate?) • Use regular expressions to perform detailed pattern scans over the sample set – Allows much larger sample sets to be analyzed in much shorter time if you already know what you’re looking for

  42. Example: searching • Perform SQL search… TBD

  43. Tool: Data taps

  44. Tool: Statistical analysis on value series – Packet types over time

  45. Tool: Conditional triggers – Trigger a deep trace on a specific data state and control flow location – Extends an existing partition, or builds a new partition by leveraging an existing one as a ‘jump off point’

  46. Example: Give me Warden! • Capture all the instructions of the warden client – Conditional deep trace on packet type (2E8?) – Add new functions into new set • Avoid adding functions from system DLLs

  47. Proximity Relevance • Cluster functions by relevance to a buffer or other memory range – Good for class reconstruction

  48. Locate the allocate and copy routines in the MIME decoding class • I need the allocation and copy routines so I can locate potential buffer overflows…

  49. Freeform memory scanning – Scan all of memory for a value – Use hardware breakpoints to break on access • Limited to 4 at a time • Avoid stack addresses as they are constant flux – Works well when you don’t have a well partitioned starting space

  50. Example: Finding the code that generates the login packets for WoW… • I need to find the login function for this game so I can build an emulation server… – Rabbit snare the login name – Dataflow trace – User-determined execution partitioning • which functions execute when we log in

  51. DATAFLOW TRACING Part IV

  52. Dataflow • Trace every instruction and record how it effected the data • Trace all propagation of data • Record the arithmetic transformation at the time of propagation • View the transformation history on any data instance

  53. Functions use derived values and copies • In many cases, functions deal with copies of the original data, or values that were derived from the original data, so tracking just the initial memory range is not enough • Dataflow tracing reveals many more functions that deal with the subsequent data

  54. Tool: Follow a buffer – Follow a buffer, such as a packet, to track all derived values and copies of values that propagate into the program and reveal any function that touches any of these derived values

Recommend


More recommend