space efficient data structures and fm index
play

Space Efficient Data Structures and FM index Venkatesh Raman The - PowerPoint PPT Presentation

Space Efficient Data Structures and FM index Venkatesh Raman The Institute of Mathematical Sciences, Chennai NISER Bhubaneshwar, February 9, 2019 Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures


  1. Introduction Data Structures Libraries Conclusions “Space for Data” Definition (Information-theoretic Lower Bound) If an object x is chosen from a set S then in the worst case we need log 2 | S | bits to represent x . • x is a binary tree of n nodes. • S is the set of all binary trees of n nodes. � 2 n � 1 • log 2 | S | = log 2 = 2 n − O (log n ) n +1 n bits Note that the standard binary tree representation uses Θ(1) pointers per node, or Θ( n ) pointers; each pointer is an address needing log n bits, so totally Θ( n log n ) bits, log n times more than necessary.

  2. Introduction Data Structures Libraries Conclusions “Space for Data” Definition (Information-theoretic Lower Bound) If an object x is chosen from a set S then in the worst case we need log 2 | S | bits to represent x . • x is a triangulated planar graph of n nodes. • S is the set of all triangulated planar graphs with n nodes. • log 2 | S | ∼ 3 . 24 n bits. There are also bounds for general graphs, chordal graphs, bounded treewidth graphs.

  3. Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures Goals Bit Vectors Strings from a larger alphabet Sparse Bit Vectors Trees Burrows-Wheeler Transform and Indexing Libraries Conclusions

  4. Introduction Data Structures Libraries Conclusions Succinct Data Structures Aim is to store using space: Space usage = “space for data” + “space for index” . � �� � lower-order term and perform operations directly on it. • For static DS, often get O (1) time operations. • Representation often tightly tied to set of operations. • They work in practice!

  5. Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits.

  6. Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits. Operations: • rank 1 ( i ): number of 1s in x 1 , . . . , x i . • select 1 ( i ): position of i th 1. Also rank 0 , select 0 . Ideally all in O (1) time. Example: X = 01101001, rank 1 (4) = 2, select 0 (4) = 7.

  7. Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits. Operations: • rank 1 ( i ): number of 1s in x 1 , . . . , x i . • select 1 ( i ): position of i th 1. Also rank 0 , select 0 . Ideally all in O (1) time. Example: X = 01101001, rank 1 (4) = 2, select 0 (4) = 7. Operations introduced in [Elias, J. ACM ’75 ], [Tarjan and Yao, C. ACM ’78 ], [Chazelle, SIAM J. Comput ’85 ], [Jacobson, FOCS ’89 ].

  8. Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 658 658 659 659 659 660 661 661 662 662 662 662 663 663 664 664 664 664 664 664 665 666 667 668 669 670 671 672 673 674 675 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits.

  9. Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits. • Sample: store answer only to every (log n ) / 2-th rank 1 queries. Space: O ( n ) bits.

  10. Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits. • Sample: store answer only to every (log n ) / 2-th rank 1 queries. Space: O ( n ) bits. • How to support rank 1 in O (1) time?

  11. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668

  12. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time.

  13. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”.

  14. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries.

  15. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.)

  16. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� � 3

  17. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� � 3 • O ( n ) bits, O (1) time.

  18. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� � 3 • O ( n ) bits, O (1) time. • Many theoretical SDS: decompose + sample + table lookup.

  19. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach.

  20. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits.

  21. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits. • Then for every (log n ) / 2 positions, store answer within the block. This takes O ( n (log log n ) / log n ) = o ( n ) bits.

  22. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits. • Then for every (log n ) / 2 positions, store answer within the block. This takes O ( n (log log n ) / log n ) = o ( n ) bits. • Then store, as before, a table to find answers within (log n ) / 2 positions.

  23. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ).

  24. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ). • Redundancy O ( n lg lg n / lg n ) bits, optimal for O (1) time operations [Golynski, TCS’07 ].

  25. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ). • Redundancy O ( n lg lg n / lg n ) bits, optimal for O (1) time operations [Golynski, TCS’07 ]. • Supporting select 1 is similar, though a bit complicated.

  26. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea

  27. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits.

  28. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits.

  29. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range;

  30. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n .

  31. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n . • Otherwise recurse.

  32. Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n . • Otherwise recurse. After a couple of levels, the range will be small enough ( O ((lg lg n ) 4 )) that a table look up can complete the job.

  33. Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet

  34. Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet Data: Sequence S [1 .. n ] of symbols from an alphabet of size σ . Operations:  rank ( c , i ): number of c ’s in S [1 .. i ] .  select ( c , i ): position of i -th c .  in O (log σ ) time. access ( i ): return S [ i ].

  35. Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet Data: Sequence S [1 .. n ] of symbols from an alphabet of size σ . Operations:  rank ( c , i ): number of c ’s in S [1 .. i ] .  select ( c , i ): position of i -th c .  in O (log σ ) time. access ( i ): return S [ i ]. Store log 2 σ BVs: n log σ + o ( n log σ ) bits [Grossi, Vitter, SJC ’05 ]. � �� � raw size 4 3 0 5 3 2 3 2 6 3 1 1 4 3 0 5 3 2 3 2 6 3 1 1 1 0 0 1 0 0 0 0 1 0 0 0 3 0 3 2 3 2 3 1 1 0 1 2 1 0 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0

  36. Introduction Data Structures Libraries Conclusions A Bit vector with only m 1s

  37. Introduction Data Structures Libraries Conclusions A Bit vector with only m 1s Sequence X of n bits, Data: Set X = { x 1 , . . . , x m } ⊆ Data: x 1 , . . . , x n with m 1s. { 1 , . . . , n } , x 1 < x 2 < . . . < x m . Operations: Operations: • select 1 ( i ). • access ( i ) : return x i . � n � ITLB: log 2 = m log 2 ( n / m ) + O ( m ) bits. m [Elias, J. ACM’75 ], [Grossi/Vitter, SICOMP’06 ], [Raman et al., TALG’07 ].

  38. Introduction Data Structures Libraries Conclusions Elias-Fano Representation Bucket according to most significant b bits. Example. b = 3 , ⌈ log 2 n ⌉ = 5 , m = 7. Bucket Keys 000 − x 1 0 1 0 0 0 001 − x 2 0 1 0 0 1 010 x 1 , x 2 , x 3 x 3 0 1 0 1 1 011 x 4 x 4 0 1 1 0 1 100 x 5 , x 6 x 5 1 0 0 0 0 101 x 7 x 6 1 0 0 1 0 110 − x 7 1 0 1 1 1 111 −

  39. Introduction Data Structures Libraries Conclusions Elias-Fano bkt sz data 000 0 − 001 0 − 010 3 00 , 01 , 11 , ���� ���� ���� ⊲ Store only low-order bits. x 1 x 2 x 3 ⊲ Keep sizes of all buckets. 011 1 01 ���� x 4 Example 100 2 00 , 10 ���� ���� select (6) x 5 x 6 101 1 11 ���� x 7 110 0 − 111 0 −

  40. Introduction Data Structures Libraries Conclusions Elias-Fano

  41. Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys.

  42. Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part.

  43. Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111.

  44. Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111. • z buckets, total size m ⇒ m + z = O ( m ) bits ( z = 2 ⌊ log 2 m ⌋ ). • Overall space of E-F bit-vector is m log( n / m ) + O ( m ) bits. • In which bucket is the 6th key? ⊲ “ rank 1 of 6th 0”. • select 1 in O (1) time.

  45. Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111. • z buckets, total size m ⇒ m + z = O ( m ) bits ( z = 2 ⌊ log 2 m ⌋ ). • Overall space of E-F bit-vector is m log( n / m ) + O ( m ) bits. • In which bucket is the 6th key? ⊲ “ rank 1 of 6th 0”. • select 1 in O (1) time. • Redundancy can be made o ( m ) and membership and Rankone can also be supported (RRR01)

  46. Introduction Data Structures Libraries Conclusions Tree Representations

  47. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree.

  48. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent).

  49. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0

  50. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string

  51. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string • Left child = 2 ∗ rank 1 ( i ).

  52. Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string • Left child = 2 ∗ rank 1 ( i ). E.g. Left child of node 7 = 7 * 2 = 14. Right child = 2 ∗ rank 1 ( i ) + 1. parent = select 1 ( ⌊ i / 2 ⌋ ).

  53. Introduction Data Structures Libraries Conclusions Tree Representations

  54. Introduction Data Structures Libraries Conclusions Tree Representations • ”Optimal” representations of many kinds of trees e.g. ordinal trees (rooted arbitrary degree (un-)labelled trees, e.g. XML documents), tries. • Wide range of O (1)-time operations, e.g.: • ordinal trees in 2 n + o ( n ) bits [Navarro, Sadakane, TALG’12 ].

  55. Introduction Data Structures Libraries Conclusions Tree Representations

  56. Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing

  57. Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc).

  58. Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4.

  59. Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4. • Standard data structure is suffix tree , which answers this query in O ( | P | ) time but takes O ( n log n ) bits of space. • In practice, a ST is about 10-30 times larger than the text.

  60. Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4. • Standard data structure is suffix tree , which answers this query in O ( | P | ) time but takes O ( n log n ) bits of space. • In practice, a ST is about 10-30 times larger than the text. • A number of SDS have been developed: we’ll focus on the FM-Index [Ferragina, Manzini, JACM ’05 ].

  61. Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees

  62. Su ffi x trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees internal nodes have >1 child

  63. Su ffi x tree T = abaaba$ a $ With respect to m : ba How many leaves? m $ How many non-leaf nodes? ≤ m - 1 ba $ aba$ ≤ 2 m -1 nodes total, or O ( m ) nodes aba$ $ aba$ No : total length of edge Is the total size O ( m ) now? labels is quadratic in m

  64. Su ffi x tree T = abaaba$ Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (o ff set, length) pairs with respect to T. T = abaaba$ (6, 1) $ a ba (0, 1) (1, 2) (6, 1) $ ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) aba$ $ (6, 1) (3, 4) aba$ Space required for su ffi x tree is now O ( m )

  65. Su ffi x tree: leaves hold o ff sets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) 5 (3, 4) (3, 4) 4 (3, 4) (3, 4) (6, 1) (6, 1) 1 3 (3, 4) (3, 4) 2 0

  66. Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees

  67. Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees • A (compressed) trie containing all the suffixes of T . The tree contains m + 1 leaves and at most m other nodes.

  68. Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees • A (compressed) trie containing all the suffixes of T . The tree contains m + 1 leaves and at most m other nodes. • Each leaf is labelled with the starting position of the suffix ending at that leaf.

Recommend


More recommend