infotracker
play

InfoTracker: Pedigree Tracking in the Face of Ancillary Content - PowerPoint PPT Presentation

InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227


  1. InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227 rcreswick@stottlerhenke.com http://www.stottlerhenke.com

  2. Track Document Pedigree

  3. Track Document Pedigree

  4. Applications > Applications Plagiarism Information Flow Security Policies

  5. The Challenge

  6. The Challenge > Common content confuses comparisons

  7. The Challenge > Common content confuses comparisons

  8. The Challenge > Common content confuses comparisons

  9. The Challenge > Related Work Suffix Tree Document Models Fuzzy Fingerprints Hoad & Zobel's Fingerprints

  10. Solution

  11. Solution > Ignore the ancillary content

  12. Solution > How?

  13. Solution > How? Use Contrasting Corpora Open Content Sensitive Content Intellectual Property Resumes Published Footers work Introductions Secrets Web Content Homework Headers Assignments

  14. Algorithm

  15. Algorithm > Index Both Corpora with one Suffix Tree

  16. Algorithm > Search for a document “Hotel rooms as their hideout” Query:

  17. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms”

  18. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms”

  19. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive:

  20. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive: Open: “their hideout”

  21. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive: Open: “their hideout”

  22. Algorithm > Filter the resulting string overlaps Aligned Character Strings Query Doc. Sens. Overlap Open Overlap Resulting Overlap(s) Too Short

  23. Algorithm > Algorithm > Ranking

  24. Algorithm > Ranking > Overlap-based Ranking

  25. Algorithm > Ranking > Overlap-based Ranking B A Q C

  26. Algorithm > Ranking > Overlap Frequency for Ranking A: the Indonesian island of Sumatra. B: Northwest coast of the C: the Indonesian island of Sumatra. unique text common text lower frequency higher frequency Greater impact Less impact

  27. Evaluation

  28. Evaluation > InfoTracker was compared to Vector Space Cosine Similarity TF-IDF weighted vectors No stop words

  29. Evaluation > Data Set Open Content Sensitive Content Intellectual Property Resumes Web Content (on-line news, blogs, Footers etc...) Related Work Published work Headers

  30. Evaluation > Data Set 272 SBIR proposals 234 historical proposals 38 query proposals

  31. Evaluation > Oracle Image from: http://www.marketoracle.co.uk

  32. Evaluation > Evaluation > Results

  33. Evaluation > Results > InfoTracker improved precision / recall Algorithm Precision Recall Vector Space 0.119 0.764 InfoTracker 0.167 0.913

  34. Contributions / Future Work

  35. Contributions / Future Work > Ancillary content can be managed Contrasting corpora Manual/actively learned tags Detecting document sections

  36. Contributions / Future Work > (re)Evaluate on Open data Compare with differing corpora The Linux Doc. Project

  37. Contributions / Future Work > Algorithmic Improvements Active Learning Document time stamps Overlap size / encapsulation

  38. Questions?

  39. Evaluation > Calculating Precision / Recall

  40. Evaluation > Calculating Precision / Recall Consider the top 23 results. (to allow for perfect recall)

  41. Trimming Results > Ranking Scores Plummet Quickly

  42. Trimming Results > Ranking Scores Plummet Quickly

  43. Trimming Results > Trimming improves precision, retains recall

Recommend


More recommend