secure indexing search for g regulatory compliant record
play

Secure Indexing/Search for g Regulatory-Compliant Record R - PowerPoint PPT Presentation

Secure Indexing/Search for g Regulatory-Compliant Record R Retention i 1 There is a need for trustworthy record keeping k i Email Instant Messaging Spending on Files Files eDiscovery Growing eDiscovery Growing at 65% CAGR


  1. Secure Indexing/Search for g Regulatory-Compliant Record R Retention i 1

  2. There is a need for trustworthy record keeping k i Email Instant Messaging Spending on Files Files eDiscovery Growing eDiscovery Growing at 65% CAGR Corporate Corporate Digital Digital Soaring Soaring Misconduct Misconduct I nform ation I nform ation Discovery Discovery Explosion Explosion Explosion Explosion Costs Costs Costs Costs Records Average F500 Average F500 Company Has 125 IDC Forecasts Non-Frivolous 60B Business Focus on Com pliance Focus on Com pliance Lawsuits at Any Emails Annually Given Time HIPAA Sources: IDC, Network World (2003), Socha / Gelbmann (2004) Q. Zhu, W. W. Hsu: Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. 2 SIGMOD’2006, 395-406, 2006

  3. What is trustworthy record keeping? Establish solid proof of events that have occurred Storage tim e Device Com m it Record Regret Query Alice Bob Adversary Bob should get back Alice’s data 3

  4. This leads to a unique threat model tim e ti Query is Commit is Adversary has trustworthy trustworthy super-user privileges • Access to storage device R Record is created d i d Record is R d i • Access to any keys properly queried properly Adversary could be Alice herself Adversary could be Alice herself 4

  5. Traditional schemes do not work tim e Cannot rely on Alice’s signature Cannot rely on Alice s signature 5

  6. WORM storage helps address the problem Record Overwrite/ New Record Delete Adversary cannot delete Alice’s record Write Once Read Many (WORM) 6

  7. WORM storage helps address the problem Record Overwrite/ New Record Delete Build on top of Build on top of conventional Adversary cannot rewritable magnetic delete Alice’s record disk, with write-once semantics enforced ti f d through software, with file modification Write Once Read Many and premature p deletion operations disallowed. 7

  8. Index required due to high volume of records Index tim e Com m it Record Query from Update I ndex I ndex Regret Bob Alice Adversary 8

  9. In effect, records can be hidden/altered by modifying the index dif i h i d Or replace B Hide record B Hide record B with B’ from the A B B B’ index The index must also be secured (fossilized) 9

  10. Btree for increasing sequence can be created on WORM d WORM 23 13 7 31 2 4 29 31 11 23 7 19 13 10

  11. B+tree index is insecure, even on WORM 23 25 7 13 31 27 4 7 11 13 19 23 29 31 25 26 30 2  Path to an element depends on elements inserted later – Adversary can attack it y 11

  12. Is this a real threat?  Would someone want to delete a record after  Would someone want to delete a record after a day its created?  Intrusion detection logging  Intrusion detection logging  Once adversary gain control, he would like to delete records of his initial attack delete records of his initial attack  Record regretted moments after creation  Email best practice - Must be committed E il b t ti M t b itt d before its delivered 12

  13. Several levels of indexing … 1 …query … …query … q y 3 … data … … base … …index … Keywords Query 1 3 11 17 3 9 Data Base 3 19 Posting Lists 7 36 Worm I d Index 3 3 To find documents containing keywords “Query” and “Data” and “Base” * Retrieve lists for Query Data and Base and intersect the document Retrieve lists for Query, Data and Base, and intersect the document ids in the list 13

  14. GHT: A Generalized Hash Tree Fossilized I d Index  Tree grows from the root down to the leaves  Tree grows from the root down to the leaves without relocating committed entries  “Balanced” without requiring dynamic  Balanced without requiring dynamic adjustments to its structure  For hash-based scheme dynamic hashing  For hash-based scheme, dynamic hashing scheme that do not require rehashing 14

  15. GHT Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level I k 0 = k 1 … = 2 Different H values lead to  different GHT variants 15

  16. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 16

  17. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 What about h 2 ? x mod 16? h 0 = x mod 4 0 h 1 = x mod 8 17

  18. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 h 0 = x mod 4 0 h 1 = x mod 8 h 2 = h 3 = … = x mod 8 18

  19. GHT Variant (Fat Tree) Can tolerate non-ideal hash functions better h 0 because there are many because there are many more potential target buckets at each level h 1 Hashing at different Hashing at different levels is independent h 2 Can allocate different levels to different disks and access them in parallel m 0 = m 1 … = 4 h 0 = x mod 4 Expensive to maintain k 0 = k 1 … = 2 h 1 = x mod 8 children pointers in each h 2 = x mod 16 node – number of h i = x mod 4*2 i i pointers grow exponentially 19

  20. GHT (Standard) Insertion Bucket = (Level, Child – left or right, Entry within bucket) (0, 0, 1) ( , , ) (1 1 2) (1, 1, 2) (2, 0, 1) 20

  21. GHT Insertion Insert whose hash values at the various levels are shown. (0, 0, 1) ( , , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 21

  22. GHT Insertion Insert whose hash values at the various levels are shown. ( , (0, 0, 1) , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) If hash functions are uniform, tree grows top-down in a balanced fashion 22

  23. GHT Search Search for Search for whose hash values at the various levels are shown whose hash values at the various levels are shown - Similar to insertion - Need to deal with duplicate key values (0, 0, 1) ( , , ) h0(key) = 1 (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) Only for point queries   Cannot support range search 23

  24. Summary  Trustworthy record keeping is important  Trustworthy record keeping is important  However, need to also ensure efficient retrieval retrieval  Existing indexing structures may be manipulated manipulated  GHT is a “trustworthy” index structure  Once record is committed, it cannot be Once record is committed it cannot be manipulated! 24

  25. Most business records are unstructured, searched by inverted index h d b i d i d Keywords Posting Lists Query 1 3 11 17 Data 3 9 3 19 Base Worm 7 36 3 3 Index Index One WORM file for each posting list One WORM file for each posting list 25 S. Mitra, W. W. Hsu, M. Winslett: Trustworthy Keyword Search for Regulatory-Compliant Record Retention. VLDB’2006, 1001-1012, 2006

  26. Index must be updated as new documents arrive i Keywords Keywords Posting Lists Posting Lists Doc: 79 Query 1 3 11 17 79 Data Data 3 9 79 Query Query Base 3 19 Data Worm 7 36 Index Index 3 79  500 keywords = 500 disk seeks 500 k d 500 di k k  ~1 sec per document 26

  27. Amortize cost by updating in batch Buffer Keywords Keywords Posting Lists Posting Lists D Doc: 79 79 Query 79 81 83 Query 1 3 11 17 Doc: 80 Doc: 80 Data Data 3 9 Doc: 81 Base 3 19 Query Worm 7 36 Index 3 Doc: 82 Doc: 83  1 seek per keyword in batch  1 seek per keyword in batch Query Query  Large buffer to benefit infrequent terms  Over 100,000 documents to achieve 2 docs/sec Over 100 000 documents to achieve 2 docs/sec 27

  28. Index is not updated immediately Index Alice Alice Com m it tim e Record Alter Omit Buffer Buffer Adversary  Prevailing practice – email must be committed before it is delivered 28

Recommend


More recommend