succinct data structures for retrieval and approximate
play

Succinct Data Structures for Retrieval and Approximate Membership - PowerPoint PPT Presentation

Succinct Data Structures for Retrieval and Approximate Membership Martin Dietzfelbinger Technische Universit at Ilmenau Joint work with Rasmus Pagh February 18, 2008 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 Retrieval Adam 0 Peter


  1. Basic approach: Hash-read-add If M A has full row rank . . . then the system     a 0 f ( x 1 ) . . . . M A ·  = . .        a m − 1 f ( x n ) has a solution ( a 0 , . . . , a m − 1 ) T (goes into T [0 . . m − 1] ). If r = 1 : Linear algebra in Z 2 . If r ≥ 2 : Work in parallel in the components. Alternative: R ⊆ GF ( q ) , any finite field (like Z p ), calculate in GF ( q ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

  2. Previous work Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  3. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  4. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1 , m = O ( n ) . Analysis via random (hyper)graph theory. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  5. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1 , m = O ( n ) . Analysis via random (hyper)graph theory. [3] [Chazelle, Kilian, Rubinfeld, Tal 2004] Implicit in work on “Bloomier filter”, r ≥ 1 , m = O ( n ) . Ad-hoc analysis. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  6. Previous work In [2] + [3]: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  7. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  8. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Equivalent: Each nonempty subset of the A ( x i ) ’s covers at least one node only once. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  9. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Equivalent: Each nonempty subset of the A ( x i ) ’s covers at least one node only once. Equivalent: M A can be brought into echelon form by row and column exchanges . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  10. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  11. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Thresholds: (Acyclicity whp if m ≥ (1 + γ ) n > γ k n ): k 2 3 4 5 6 asympt. γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  12. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Thresholds: (Acyclicity whp if m ≥ (1 + γ ) n > γ k n ): k 2 3 4 5 6 asympt. γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Advantage: Solve linear system in time O ( nk ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  13. New in this context: Calkin’s theorem Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  14. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  15. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . • β − 1 ≈ 1 + e − k / ln 2 , for growing k . k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  16. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . • β − 1 ≈ 1 + e − k / ln 2 , for growing k . k Thresholds: k 2 3 4 5 6 asympt. β − 1 1 + e − k / ln 2 2 1 . 1243 1 . 034 1 . 011 1 . 0038 k γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  17. New retrieval structures Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  18. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  19. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  20. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O ( n 1+ δ ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  21. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O ( n 1+ δ ) . Proof : Theorem 1 plus “splitting”. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  22. Construction time O ( n 1+ δ ) S U h split 1− δ /2 t=n S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 14

  23. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  24. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  25. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Construct separate retrieval data structure for each of the chunks: construction time t · O (( n δ/ 2 ) 3 ) = O ( n 1+ δ ) . Extra space: o ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  26. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Construct separate retrieval data structure for each of the chunks: construction time t · O (( n δ/ 2 ) 3 ) = O ( n 1+ δ ) . Extra space: o ( n ) . Retrieval for y : access retrieval data structure for S h split ( y ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  27. Approximate membership Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  28. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  29. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  30. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  31. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Or (folklore): Use (minimal) perfect hashing to store an r -bit fingerprint for x ∈ S . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  32. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Or (folklore): Use (minimal) perfect hashing to store an r -bit fingerprint for x ∈ S . Space: nr + O ( n ) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  33. Approximate membership General construction: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  34. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  35. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  36. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  37. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  38. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  39. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . New: Space (1 + e − k ) nr bits, evaluation time O ( k ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  40. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . New: Space (1 + e − k ) nr bits, evaluation time O ( k ) . Construction time O ( n 3 ) resp. O ( n 1+ δ ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  41. Retrieval: Construction in linear time Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  42. Retrieval: Construction in linear time Theoretical construction, works for very large n . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  43. Retrieval: Construction in linear time Theoretical construction, works for very large n . Extra level of hashing: h split : U → [ t ] . S U h split t=n/b S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  44. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  45. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  46. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  47. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  48. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  49. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  50. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  51. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Total space (1 + γ ) nr + o ( n ) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  52. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Total space (1 + γ ) nr + o ( n ) bits. Done? Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  53. Construction in linear time No! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  54. Construction in linear time No! “Bad chunks”: (1) Overflow (2) Construction fails. S U h split t=n/b S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  55. Construction in linear time No! “Bad chunks”: (1) Overflow (2) Construction fails. S U h split t=n/b S S S 0 1 2 bad T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  56. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  57. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . The (keys from) bad chunks are accommodated in a secondary structure with table T ′ [0 ..o ( n )] , hash functions h ′ 1 , . . . , h ′ 3 , with o ( n ) construction time (e.g. [Chazelle et al . 2004]). Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  58. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . The (keys from) bad chunks are accommodated in a secondary structure with table T ′ [0 ..o ( n )] , hash functions h ′ 1 , . . . , h ′ 3 , with o ( n ) construction time (e.g. [Chazelle et al . 2004]). Flag: v [ i ] = 0 if chunk i is bad and v [ i ] = 1 otherwise. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  59. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  60. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Lookup operation: i ← h split ( x ) , then . . . � � T ′ [ h ′ f ( x ) = v [ i ] · T [ h ℓ ( x ) + d i ] ⊕ v [ i ] · ℓ ( x )] 1 ≤ ℓ ≤ k 1 ≤ ℓ ≤ 3 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  61. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Lookup operation: i ← h split ( x ) , then . . . � � T ′ [ h ′ f ( x ) = v [ i ] · T [ h ℓ ( x ) + d i ] ⊕ v [ i ] · ℓ ( x )] 1 ≤ ℓ ≤ k 1 ≤ ℓ ≤ 3 Time O ( k ) . Nonadaptive reads! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  62. Almost optimal space, logarithmic evaluation time Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

  63. Almost optimal space, logarithmic evaluation time Use random sets A ( x ) = { h 1 ( x ) , . . . , h k ( x ) ( x ) } ⊆ [ n ] , with E ( k ( x )) = Θ(log n ) , binomially distributed. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

  64. Almost optimal space, logarithmic evaluation time Use random sets A ( x ) = { h 1 ( x ) , . . . , h k ( x ) ( x ) } ⊆ [ n ] , with E ( k ( x )) = Θ(log n ) , binomially distributed. [Cooper, 1999] ⇒ Pr ( M A is regular ) ≥ 0 . 28 . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

Recommend


More recommend