docovery toward generic automatic
play

Docovery : Toward Generic Automatic Document Recovery Tomasz Kuchta - PowerPoint PPT Presentation

Docovery : Toward Generic Automatic Document Recovery Tomasz Kuchta omasz Kuchta Miguel Castro Cristian Cadar Manuel Costa ASE14, 18 th September 2014 This work is supported by Microsoft Research through its PhD Scholarship Programme


  1. Docovery : Toward Generic Automatic Document Recovery Tomasz Kuchta omasz Kuchta Miguel Castro Cristian Cadar Manuel Costa ASE’14, 18 th September 2014 This work is supported by Microsoft Research through its PhD Scholarship Programme Microsoft is a registered trademark of Microsoft Corporation

  2. The user is unable to open a document The user is unable to open a document 2

  3. Document is corrupt Document is corrupt Storage failure, network transfer failure, power outage 3

  4. Application has bugs Application has bugs Buffer overflows, divisions by zero Assertion failures, exceptions Incompatibility across versions / applications 4

  5. Such problems are highly user-visible They account for a large number of security vulnerabilities 5

  6. The root cause of the problem Application is unable to handle corrupt or uncommon documents Example: pine – a text mode e-mail client Special “From:” field crashes the program From: "\"\"\"\"\"\"\"\"\"...\"\"\"\"\"\"\""@host.fubar 6

  7. What can we do about that? Try to fix the pr ry to fix the program ogram Automatic patch generation [GenProg, WCCI’08, ICSE’09; SemFix, ICSE’13; etc.] Try to pr ry to protect the pr otect the program ogram Automatic input filter generation [ Vigilante , SOSP’05; Shieldgen, S&P’07; etc.] 7

  8. What can we do about that? Try to fix the document ry to fix the document Use format specification [DS repair, OOPSLA’03] Learn and apply the correct values [ SOAP , ICSE’12] Truncate the document Try to guess the right value Or … Or … 8

  9. Is it possible to fix a broken document, without assuming any input format, in a way that preserves the original contents as much as possible? 9

  10. 11

  11. 12

  12. 13

  13. Ident Identify Potent ify Potential ially ly 1 Corrupt Bytes Corrupt Bytes Byte #4: 'x' Byte #8: 'y' Taint Tracking 14

  14. Ident Identify Potent ify Potential ially ly 1 Corrupt Bytes Corrupt Bytes Change The Bytes T Change The Bytes To o Byte #4: 'x' 2 Execute Anot Execute Another Pat her Path Byte #8: 'y' Byte #4: 'z' Byte #8: 'y' ¡ Taint Tracking ¡ ¡ Symbolic Execution 15

  15. Identify Potent Ident ify Potential ially ly 1 Corrupt Bytes Corrupt Bytes Change The Bytes To Change The Bytes T o Byte #4: 'x' 2 Execute Another Pat Execute Anot her Path Byte #8: 'y' Byte #4: 'z' Byte #8: 'y' ¡ Taint Tracking ¡ ¡ Symbolic Execution 3 Pick The Best Candidate Pick The Best Cand idate ¡ ¡ ¡ Levenshtein distance ¡ ¡ ¡ ¡ and manual inspection ¡ ¡ ¡ ¡ ¡ 16

  16. Docovery process Broken document execution Alternative paths exploration 17

  17. Br Broken document execution oken document execution Alternative paths exploration Taint tracking Track the flow of data from a source (input) to a sink (point of crash) Identifying potentially corrupt bytes Byte-level precision No control flow dependencies Byte #4: 'x' No address tainting Byte #8: 'y' 18

  18. Br Broken document execution oken document execution Alternative paths exploration Collecting alternative paths Mark the potentially corrupt bytes as symbolic Lazily verify feasibility 19

  19. Broken document execution Alter Alternative paths exploration native paths exploration Path selection Last N deepest paths are collected Start from the paths closest to the crash point 20

  20. Broken document execution Alter Alternative paths exploration native paths exploration Negate the K th constraint and drop the remaining Ask constraint solver for a satisfying assignment Path P 3 : C 1 ∧ C 2 ∧ ᒣ C 3 Path P 2 : C 1 ∧ ᒣ C 2 21

  21. Broken document execution Alter Alternative paths exploration native paths exploration Candidate execution Store the candidate Re-run the program natively Successful if not crashing 22

  22. Evaluating candidate documents Levenshtein Levenshtein distance (edit distance) distance (edit distance) Byte-level similarity metric Independent of document format Smaller distance = higher similarity Semi-automatic evaluation of pr Semi-automatic evaluation of program output ogram output Looking for warnings / errors, exit code Similarity to the correct output 23

  23. Implementation Implementation Built on top of KLEE [OSDI’08] Using ZESTI functionality [ICSE’12] Interprets LLVM bitcode of C applications 25

  24. Benchmarks pr – a pagination utility pine – a text-mode e-mail client dwarfdump – a debug information display tool readelf – an ELF file information display tool Max number of Max number of Benchmark Benchmark Document type Document type Document Sizes Document Sizes changed bytes changed bytes Plain text up to 256 pages / 1080 KB 1 pr MBOX mailbox up to 320 e-mails / 2.3 MB 24 pine DWARF executables up to 1.1 MB 1 dwarfdump ELF object files up to 1.5 MB 8 readelf 26

  25. Bugs Known, real-world bugs injected manually pr, pine, readelf – buffer overflow dwarfdump – division by zero Benchmark Benchmark ‘Buggy’ sequence ‘Buggy’ sequence pr Lorem ipsum... 0x08 0x08...0x09 EOF pine ...From: " \"\"\"\"\"\"\"\...\"\"\"\" "@host.fubar... dwarfdump ...GCC: (Ubuntu/Linaro 4.6.3... 0x00 0x00... readelf ... 0xFD 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF ... 27

  26. Regardless of Taint tracking results document size ¡ Number of potentially Number of potentially Benchmark Benchmark Document Document corrupt bytes corrupt bytes ¡ 1 – 256 pages / 4.4 – 1080 KB 1 pr ¡ 5 – 320 e-mails / 13 KB – 2.3MB 25 pine 62 KB – 1.1 MB 2 dwarfdump 54 KB – 1.5 MB 16 readelf pr: Lorem ipsum...08 08...09 EOF pine: "\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"..."@host dwarfdump: ...GCC: (Ubuntu/Linaro 4.6.3...00 00... readelf: ...40 01 00 00 00 00 00 00...FD FF FF FF FF FF FF FF... 28

  27. Candidates for pr Candidates for Document Document ‘Buggy’ sequence ‘Buggy’ sequence Original Lorem ipsum...0x08 0x08... 0x09 EOF Candidate A Lorem ipsum...0x08 0x08... 0x00 EOF Candidate B Lorem ipsum...0x08 0x08... 0x0C EOF Candidate C Lorem ipsum...0x08 0x08... 0x0A EOF All the candidates print out correctly 29

  28. Candidates for pine Candidates for Document Document ‘Buggy’ sequence ‘Buggy’ sequence Original From: " \"\"\"\"................\" "@host.fubar Candidate A From: "\"\ ... \ 0x0E... \ 0x0E \"...\""@host.fubar Candidate B From: "\"\...\ \ \ 0x0E.. \ 0x0E \"..\""@host.fubar Candidate C From: "\"\. .. \ 0x00 \" ........... \""@host.fubar 30

  29. Candidates for dwarfdump Candidates for Document Document ‘Buggy’ sequence ‘Buggy’ sequence Original ...GCC: (Ubuntu/Linaro 4.6.3... 0x00 0x00 ... Candidate A ...GCC: (Ubuntu/Linaro 4.6.3... 0x01 0x00... Candidate B ...GCC: (Ubuntu/Linaro 4.6.3...0x00 0x01 ... Candidate A: debug dump, success return code Candidate B: error 31

  30. Candidates for readelf Candidates for Document Document ‘Buggy’ sequence ‘Buggy’ sequence Original … 40 01 00 00 00 00 00 00 … FD FF FF FF FF FF FF FF … Candidate A … 40 01 00 00 00 00 00 00 … F0 01 00 00 00 00 00 80 … Candidate B … FE FF FF FF FF FF FF FF … FD FF FF FF FF FF FF FF … Candidate C … 00 00 00 00 00 00 00 00 … FD FF FF FF FF FF FF FF … Candidate A: most of output, but with a warning Candidate B: almost no output and an error Candidate C: almost no output (no debug data) 32

  31. Performance varies across applications Sometimes, the recovery is cheap dwarfdump ¡– ¡total ¡recovery ¡;me ¡ readelf ¡– ¡total ¡recovery ¡;me ¡ 40 ¡ 50 ¡ 35 ¡ 40 ¡ 30 ¡ Time ¡[s] ¡ 25 ¡ Time ¡[s] ¡ 30 ¡ 20 ¡ 20 ¡ 15 ¡ 10 ¡ 10 ¡ 5 ¡ 0 ¡ 0 ¡ 62 ¡ 129 ¡ 268 ¡ 612 ¡ 1089 ¡ 54 ¡ 102 ¡ 202 ¡ 454 ¡ 878 ¡ 1615 ¡ Size ¡[KB] ¡ Size ¡[KB] ¡ 33

  32. Performance varies across applications Sometimes, the recovery is expensive pr ¡– ¡total ¡recovery ¡;me ¡ pine ¡– ¡total ¡recovery ¡;me ¡ 2000 ¡ 2000 ¡ 1500 ¡ 1500 ¡ Time ¡[s] ¡ Time ¡[s] ¡ 1000 ¡ 1000 ¡ 500 ¡ 500 ¡ 0 ¡ 0 ¡ 4.4 ¡ 8.6 ¡ 17 ¡ 34 ¡ 68 ¡ 136 ¡ 271 ¡ 541 ¡1080 ¡ 5 ¡ 10 ¡ 20 ¡ 40 ¡ 80 ¡ 160 ¡ 320 ¡ Size ¡[KB] ¡ # ¡of ¡e-­‑mails ¡ 34

  33. Performance depends on the executed path dwarfdump ¡– ¡total ¡recovery ¡;me ¡(-­‑r) ¡ dwarfdump ¡– ¡total ¡recovery ¡;me ¡ 40 ¡ 4000 ¡ 35 ¡ 3500 ¡ 30 ¡ 3000 ¡ Time ¡[s] ¡ 25 ¡ Time ¡[s] ¡ 2500 ¡ 20 ¡ 2000 ¡ 15 ¡ 1500 ¡ 10 ¡ 1000 ¡ 5 ¡ 500 ¡ 0 ¡ 0 ¡ 62 ¡ 129 ¡ 268 ¡ 612 ¡ 1089 ¡ 62 ¡ 129 ¡ 268 ¡ 612 ¡ 1089 ¡ Size ¡[KB] ¡ Size ¡[KB] ¡ 35

Recommend


More recommend