vassil roussev the current forensic workflow
play

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) - PowerPoint PPT Presentation

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 * We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2 Scalable


  1. Vassil Roussev

  2. The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 *  We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2

  3. Scalable Forensic Workflow Clone Forensic Target (3TB) Process @150MB/s  We can start working on the case immediately . 3

  4. Current Forensic Processing  Hashing/filtering/correlation  File carving/reconstruction  Indexing The ultimate goal of this work is to make similarity hash-based correlation scalable & I/O-bound. 4

  5. Motivation for similarity approach: Traditional hash filtering is failing  Known file filtering: o Crypto-hash known files, store in library (e.g. NSRL) o Hash files on target o Filter in/out depending on interest  Challenges o Static libraries are falling behind  Dynamic software updates, trivial artifact transformations  We need version correlation o Need to find embedded objects  Block/file in file/volume/network trace o Need higher-level correlations  Disk-to-RAM  Disk-to-network 5

  6. Scenario #1: Fragment Identification Source artifacts (files) v Disk fragments (sectors) Network fragments (packets)  Given a fragment, identify source o Fragments of interest are 1-4KB in size o Fragment alignment is arbitrary 6

  7. Scenario #2: Artifact Similarity Similar files Similar drives (shared content/format) (shared blocks/files)  Given two binary objects, detect similarity/versioning o Similarity here is purely syntactic; o Relies on commonality of the binary representations. 7

  8. Solution: Similarity Digests sdhash sdhash sdhash sdhash sdbf 1 sdbf 2 sdbf 1 sdbf 2 sdhash sdhash Is this fragment present on the drive? Are these artifacts correlated?  0 .. 100  0 .. 100 All correlations based on bitstream commonality 8

  9. Quick Review: Similarity digests & sdhash 9

  10. Generating sdhash fingerprints (1) Digital artifact (block/file/packet/volume) as byte stream … Features (all 64-byte sequences) 10

  11. Generating sdhash fingerprints (2) Digital artifact … Select characteristic features (statistically improbable/rare) 11

  12. Generating sdhash fingerprints (3) Feature Selection Process All features Weak H norm  Feature 0..1000 Filter 0.18 0.16 0.14 Data with low information content 0.12 Probability 0.10 0.08 Rare 0.06 Local 0.04 0.02 Feature 0.00 0 100 200 300 400 500 600 700 800 900 1000 Selector (a) H norm distribution: doc H norm  doc files 12

  13. Generating sdhash fingerprints (4) 8-10K avg 8-10K avg 8-10K avg SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf) Bloom filter  local SD fingerprint  256 bytes  up to 128/160 features 13

  14. Bloom filter (BF) comparison A bf A 0 .. 100 BF Score bitwise AND B bf B Based on BF theory, overlap due to chance is analytically predictable. Additional BF overlap is proportional to overlap in features. BF Score is tuned such that BF Score (A random , B random ) = 0. 14

  15. SDBF fingerprint comparison … SD B 1 2 m SD A bf B bf B bf B … 1 1 ,bf B 1 ) 1 ,bf B 2 ) 1 ,bf B m ) bf A BF Score (bf A BF Score (bf A BF Score (bf A max 1 … 2 bf A 2 ,bf B 1 ) 2 ,bf B 2 ) 2 ,bf B m ) BF Score (bf A BF Score (bf A BF Score (bf A max 2 … … … … … n bf A n ,bf B 1 ) n ,bf B 2 ) n ,bf B m ) max n BF Score (bf A BF Score (bf A BF Score (bf A SD Score (A,B) = Average(max 1 , max 2 , …, max n ) 15

  16. Scaling up: Block-aligned digests & parallelization 16

  17. Block-aligned similarity digests ( sdbf-dd ) 16K 16K 16K SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf-dd) Bloom filter  local SD fingerprint  256 bytes  up to 192 features 17

  18. Advantages & challenges for block- aligned similarity digests (sdbf-dd)  Advantages Parallelizable computation o Direct mapping to source data o Shorter (1.6% vs 2.6% of source) o  Faster comparisons (fewer BFs)  Challenges Less reliable for smaller files o Sparse data o Compatibility with sdbf digests o  Solution Increase features for sdbf filters: 128  160 o Use 192 features per BF for sdbf-dd filters o Use compatible BF parameters to allow sdbf  sdbf-dd comparisons o 18

  19. sdhash 1.6: sdbf vs. sdbf-dd accuracy 19

  20. Sequential throughput: sdhash 1.3  Hash generation rate o Six-core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad-Core Intel Xeon @ 2.8 GHz ~20MB/s per core  Hash comparison o 1MB vs. 1MB: 0.5ms  T5 corpus (4,457 files, all pairs) o 10mln file comparisons in ~ 15min  667K file comps per second  Single core 20

  21. sdhash 1.6: File-parallel generation rates on 27GB real data (in RAM) 21

  22. sdhash 1.6: Optimal file-parallel generation: 5GB synthetic target (RAM) 22

  23. sdhash-dd: Hash generation rates 10GB in-RAM target (RAM) 23

  24. Throughput summary: sdhash 1.6  Parallel hash generation o sdbf: file-parallel execution  260 MB/s on 12-core/24-threaded machine o sdbf-dd: block-parallel execution  370 MB/s (SHA1 — > 330MB/s)  Optimized hash comparison rates o 24 threads: 86.6 mln BF/s  1.4 TB/s for small file comparison (<16KB) I.e., we can search for a small file in a reference set of 1.4TB in 1s 24

  25. The Envisioned Architecture 25

  26. The Current State libsdbf CLI: sdhash Service: sdbf_d API Files: Network: Client: Disk: Cluster: Client: sdhash- sdhash- C/C++ C# Python sdhash-dd sdbfCluster sdbfWeb sdbfViz file pcap 26

  27. Todo List (1)  libsdbf o C++ rewrite (v2.0) o TBB parallelization  sdhash-file o More command line options/compatibility w/ssdeep o Service-based processing (w/ sdbf_d )  GPU acceleration  sdhash-pcap o Pcap-aware processing:  payload extraction, file discovery, timelining 27

  28. Todo List (2)  sdbf_d o Persistance: XML o Service interface: JSON o Server clustering  sdbfWeb o Browser-based management/query  sdbfViz o Large-scale visualization & clustering 28

  29. Further Development  Integration w/ RDS sdhash-set : construct SDBF s from existing SHA1 sets o  Compare/identify whole folders, distributions, etc.  Structural feature selection E.g., exe/dll, pdf , zip, … o  Optimizations Sampling o Skipping o  Under min continuous block assumption Cluster “core” extraction/comparison o  Representation Multi-resolution digests o New crypto hashes o Data offsets o 29

  30. Thank you!  http://roussev.net/sdhash wget http://roussev.net/sdhash/sdhash-1.6.zip o make o ./sdhash o  Contact: Vassil Roussev vassil@roussev.net  Reminder DFRWS’12: Washington DC, Aug 6-8 Paper deadline: Feb 20, 2012 Data sniffing challenge to be released shortly 30

Recommend


More recommend