Secure Indexing/Search for g Regulatory-Compliant Record R Retention i 1
There is a need for trustworthy record keeping k i Email Instant Messaging Spending on Files Files eDiscovery Growing eDiscovery Growing at 65% CAGR Corporate Corporate Digital Digital Soaring Soaring Misconduct Misconduct I nform ation I nform ation Discovery Discovery Explosion Explosion Explosion Explosion Costs Costs Costs Costs Records Average F500 Average F500 Company Has 125 IDC Forecasts Non-Frivolous 60B Business Focus on Com pliance Focus on Com pliance Lawsuits at Any Emails Annually Given Time HIPAA Sources: IDC, Network World (2003), Socha / Gelbmann (2004) Q. Zhu, W. W. Hsu: Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. 2 SIGMOD’2006, 395-406, 2006
What is trustworthy record keeping? Establish solid proof of events that have occurred Storage tim e Device Com m it Record Regret Query Alice Bob Adversary Bob should get back Alice’s data 3
This leads to a unique threat model tim e ti Query is Commit is Adversary has trustworthy trustworthy super-user privileges • Access to storage device R Record is created d i d Record is R d i • Access to any keys properly queried properly Adversary could be Alice herself Adversary could be Alice herself 4
Traditional schemes do not work tim e Cannot rely on Alice’s signature Cannot rely on Alice s signature 5
WORM storage helps address the problem Record Overwrite/ New Record Delete Adversary cannot delete Alice’s record Write Once Read Many (WORM) 6
WORM storage helps address the problem Record Overwrite/ New Record Delete Build on top of Build on top of conventional Adversary cannot rewritable magnetic delete Alice’s record disk, with write-once semantics enforced ti f d through software, with file modification Write Once Read Many and premature p deletion operations disallowed. 7
Index required due to high volume of records Index tim e Com m it Record Query from Update I ndex I ndex Regret Bob Alice Adversary 8
In effect, records can be hidden/altered by modifying the index dif i h i d Or replace B Hide record B Hide record B with B’ from the A B B B’ index The index must also be secured (fossilized) 9
Btree for increasing sequence can be created on WORM d WORM 23 13 7 31 2 4 29 31 11 23 7 19 13 10
B+tree index is insecure, even on WORM 23 25 7 13 31 27 4 7 11 13 19 23 29 31 25 26 30 2 Path to an element depends on elements inserted later – Adversary can attack it y 11
Is this a real threat? Would someone want to delete a record after Would someone want to delete a record after a day its created? Intrusion detection logging Intrusion detection logging Once adversary gain control, he would like to delete records of his initial attack delete records of his initial attack Record regretted moments after creation Email best practice - Must be committed E il b t ti M t b itt d before its delivered 12
Several levels of indexing … 1 …query … …query … q y 3 … data … … base … …index … Keywords Query 1 3 11 17 3 9 Data Base 3 19 Posting Lists 7 36 Worm I d Index 3 3 To find documents containing keywords “Query” and “Data” and “Base” * Retrieve lists for Query Data and Base and intersect the document Retrieve lists for Query, Data and Base, and intersect the document ids in the list 13
GHT: A Generalized Hash Tree Fossilized I d Index Tree grows from the root down to the leaves Tree grows from the root down to the leaves without relocating committed entries “Balanced” without requiring dynamic Balanced without requiring dynamic adjustments to its structure For hash-based scheme dynamic hashing For hash-based scheme, dynamic hashing scheme that do not require rehashing 14
GHT Defined by {M,K, H} Defined by {M,K, H} M = {m 0 , m 1 , …}, m i is size of a tree node (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the growth factor for level i growth factor for level i A tree has k i times as many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a m 0 = m 1 … = 4 hash function for level I k 0 = k 1 … = 2 Different H values lead to different GHT variants 15
Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H} M = {m 0 , m 1 , …}, m i is size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 16
Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H} M = {m 0 , m 1 , …}, m i is size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 What about h 2 ? x mod 16? h 0 = x mod 4 0 h 1 = x mod 8 17
Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H} M = {m 0 , m 1 , …}, m i is size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 h 0 = x mod 4 0 h 1 = x mod 8 h 2 = h 3 = … = x mod 8 18
GHT Variant (Fat Tree) Can tolerate non-ideal hash functions better h 0 because there are many because there are many more potential target buckets at each level h 1 Hashing at different Hashing at different levels is independent h 2 Can allocate different levels to different disks and access them in parallel m 0 = m 1 … = 4 h 0 = x mod 4 Expensive to maintain k 0 = k 1 … = 2 h 1 = x mod 8 children pointers in each h 2 = x mod 16 node – number of h i = x mod 4*2 i i pointers grow exponentially 19
GHT (Standard) Insertion Bucket = (Level, Child – left or right, Entry within bucket) (0, 0, 1) ( , , ) (1 1 2) (1, 1, 2) (2, 0, 1) 20
GHT Insertion Insert whose hash values at the various levels are shown. (0, 0, 1) ( , , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 21
GHT Insertion Insert whose hash values at the various levels are shown. ( , (0, 0, 1) , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) If hash functions are uniform, tree grows top-down in a balanced fashion 22
GHT Search Search for Search for whose hash values at the various levels are shown whose hash values at the various levels are shown - Similar to insertion - Need to deal with duplicate key values (0, 0, 1) ( , , ) h0(key) = 1 (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) Only for point queries Cannot support range search 23
Summary Trustworthy record keeping is important Trustworthy record keeping is important However, need to also ensure efficient retrieval retrieval Existing indexing structures may be manipulated manipulated GHT is a “trustworthy” index structure Once record is committed, it cannot be Once record is committed it cannot be manipulated! 24
Most business records are unstructured, searched by inverted index h d b i d i d Keywords Posting Lists Query 1 3 11 17 Data 3 9 3 19 Base Worm 7 36 3 3 Index Index One WORM file for each posting list One WORM file for each posting list 25 S. Mitra, W. W. Hsu, M. Winslett: Trustworthy Keyword Search for Regulatory-Compliant Record Retention. VLDB’2006, 1001-1012, 2006
Index must be updated as new documents arrive i Keywords Keywords Posting Lists Posting Lists Doc: 79 Query 1 3 11 17 79 Data Data 3 9 79 Query Query Base 3 19 Data Worm 7 36 Index Index 3 79 500 keywords = 500 disk seeks 500 k d 500 di k k ~1 sec per document 26
Amortize cost by updating in batch Buffer Keywords Keywords Posting Lists Posting Lists D Doc: 79 79 Query 79 81 83 Query 1 3 11 17 Doc: 80 Doc: 80 Data Data 3 9 Doc: 81 Base 3 19 Query Worm 7 36 Index 3 Doc: 82 Doc: 83 1 seek per keyword in batch 1 seek per keyword in batch Query Query Large buffer to benefit infrequent terms Over 100,000 documents to achieve 2 docs/sec Over 100 000 documents to achieve 2 docs/sec 27
Index is not updated immediately Index Alice Alice Com m it tim e Record Alter Omit Buffer Buffer Adversary Prevailing practice – email must be committed before it is delivered 28
Recommend
More recommend