Introduction CST design CST in practice Compressed Suffix Trees in Practice Simon Gog Computing and Information Systems The University of Melbourne February 13th 2013
Introduction CST design CST in practice Outline Introduction 1 Basic data structures The suffix tree CST design 2 NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes) CST in practice 3 The sdsl library
Introduction CST design CST in practice Succinct data structures (1) Data structure D representation of an + operations on X object X Example: Rank-bit-vector bit vector b of length n + access b [ i ] in O ( 1 ) time (0,1,0,1,1,0,1,1) rank ( i ) = � i − 1 in O ( n ) time j = 0 b [ j ] (0,0,1,1,2,3,3,4) in n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
Introduction CST design CST in practice Succinct data structures (1) Data structure D representation of an + operations on X object X Example: Rank-bit-vector bit vector b of length n + access b [ i ] in O ( 1 ) time (0,1,0,1,1,0,1,1) rank ( i ) = � i − 1 in O ( n ) time j = 0 b [ j ] (0,0,1,1,2,3,3,4) in n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
Introduction CST design CST in practice Succinct data structures (1) Data structure D representation of an + operations on X object X Example: Rank-bit-vector bit vector b of length n + access b [ i ] in O ( 1 ) time (0,1,0,1,1,0,1,1) rank ( i ) = � i − 1 in O ( 1 ) time j = 0 b [ j ] (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
Introduction CST design CST in practice Succinct data structures (1) Data structure D representation of an + operations on X object X Example: Rank-bit-vector bit vector b of length n + access b [ i ] in O ( 1 ) time (0,1,0,1,1,0,1,1) rank ( i ) = � i − 1 in O ( 1 ) time j = 0 b [ j ] (0,0,1,1,2,3,3,4) in n + n log n bits space Succinct data structure D Space of D is close the information theoretic lower bound to represent X , while operations can still be performed efficient.
Introduction CST design CST in practice Succinct data structures (2) Can succinct data structures replace classic uncompressed data structures in practice ? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory develop succinct data structures in practice constants in O ( 1 ) -time terms are large o ( n ) -space term is not negligible complex data structures are hard to implement
Introduction CST design CST in practice Succinct data structures (2) Can succinct data structures replace classic uncompressed data structures in practice ? Less memory ⇒ fewer CPU cycles !? ≈ 1 CPU cycle CPU ≈ 100 B ≈ 5 CPU cycles L1-Cache ≈ 10 KB ≈ 10-20 L2-Cache ≈ 512 KB ≈ 20-100 L3-Cache ≈ 1-8 MB ≈ 100-500 DRAM ≈ 4 GB ≈ 10 6 Disk ≈ x · 100 GB Less memory ⇒ less costs !? Problems: in theory develop succinct data structures in practice constants in O ( 1 ) -time terms are large o ( n ) -space term is not negligible complex data structures are hard to implement
Introduction CST design CST in practice Succinct data structures (2) Can succinct data structures replace classic uncompressed data structures in practice ? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Instance name main memory price per hour Micro 613.0 MB 0.02 US$ High-Memory Quadruple Extra Large 68.4 GB 2.00 US$ Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011. Problems: in theory develop succinct data structures in practice constants in O ( 1 ) -time terms are large o ( n ) -space term is not negligible complex data structures are hard to implement
Introduction CST design CST in practice Succinct data structures (2) Can succinct data structures replace classic uncompressed data structures in practice ? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory develop succinct data structures in practice constants in O ( 1 ) -time terms are large o ( n ) -space term is not negligible complex data structures are hard to implement
Introduction CST design CST in practice The classic index data structure: The suffix tree (ST) Let T be a text of length n over alphabet Σ of size σ . Suffix tree index data structure for T (construction O ( n ) ) can be used to solve many problems in optimal time complexity bioinformatics data compression uses O ( n log n ) bits! In practice (ASCII-alphabet) ≥ 17 times the size of T Can not handle „The Attack of Massive Data” DNA sequencing data (NGS) ...
Introduction CST design CST in practice Example: ST of T= umulmundumulmum$ u $ n = 16 dumulmum$ lmu 15 m ndumulmum$ ndumulmum$ lmu 7 m Σ = { $,d,l,m,n,u } $ u m$ 14 σ = 6 ndumulmum$ lmu m$ ndumulmum$ $ ulmu n 11 d u 10 13 5 m m$ u m ndumulmum$ l 6 $ ndumulmum$ m 9 m$ u m 3 12 $ Classic implementation 8 uses pointers each of 4 2 size 4 or 8 bytes! 1 0
Introduction CST design CST in practice Example: ST of T= umulmundumulmum$ Operations u $ dumulmum$ lmu root () 15 m ndumulmum$ ndumulmum$ lmu is leaf ( v ) 7 m $ parent ( v ) u m$ 14 ndumulmum$ lmu degree ( v ) m$ ndumulmum$ $ ulmu n 11 d child ( v , c ) u 10 13 5 m m$ u select child ( v , i ) m ndumulmum$ l 6 $ ndumulmum$ m 9 m$ depth ( v ) u m 3 12 $ edge ( v , d ) 8 4 2 lca ( v , w ) sl ( v ) 1 0 wl ( v , c )
Introduction CST design CST in practice CSTs Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et al. cstY , Ohlebusch et al. cst_sct3 )
Introduction CST design CST in practice CSTs Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals which might work for (a) and (b) Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et al. cstY , Ohlebusch et al. cst_sct3 )
Introduction CST design CST in practice Outline Introduction 1 Basic data structures The suffix tree CST design 2 NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes) CST in practice 3 The sdsl library
Introduction CST design CST in practice Big picture of CST design e c n e e e u r T q - x e a M S - excess n i s M NAV e P s RMQ e i o h t n n e e r a e P e r d e d r s T l e i h t c e c l n e t s v r a a fi 2 n bits W l Ψ CSA LCP a n B PLCP a LF m T f B f u W H T
Introduction CST design CST in practice Big picture of CST design u $ lmu d u 15 m m u l m n u m ndumulmum$ $ d lmu u 7 m m $ u u l m m$ 14 u m ndumulmum$ lmu m$ $ $ ulmu 11 ndumulmum$ n d 5 10 13 u m m$ u m 6 l m ndumulmum$ $ 9 u m$ n m d u $ m 3 12 u l 8 m 4 2 u m $ 1 0 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 0 0 3 0 1 5 2 2 0 0 4 1 2 6 1 (()()(()())(()((()())()()))()((()())(()(()()))()))
Introduction CST design CST in practice Example: Compressing NAV u $ 15 dumulmum$ m lmu 7 lmu m ndumulmum$ 14 $ u ndumulmum$ m$ m$ lmu $ 11 ndumulmum$ ulmu 5 10 13 m$ ndumulmum$ m$ ndumulmum$ 6 9 tree uncompressed m$ ndumulmum$ 3 12 ndumulmum$ O ( n log n ) bits 8 4 2 1 0 BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4 n bits
Introduction CST design CST in practice Example: Compressing NAV 0 u $ 15 dumulmum$ m lmu 7 lmu m ndumulmum$ 14 $ u ndumulmum$ m$ m$ lmu $ 11 ndumulmum$ ulmu 5 10 13 m$ ndumulmum$ m$ ndumulmum$ 6 9 tree uncompressed m$ ndumulmum$ 3 12 ndumulmum$ O ( n log n ) bits 8 4 2 1 0 BPS dfs = ( BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4 n bits
Introduction CST design CST in practice Example: Compressing NAV 0 u $ 15 1 dumulmum$ m lmu 7 lmu m ndumulmum$ 14 $ u ndumulmum$ m$ m$ lmu $ 11 ndumulmum$ ulmu 5 10 13 m$ ndumulmum$ m$ ndumulmum$ 6 9 tree uncompressed m$ ndumulmum$ 3 12 ndumulmum$ O ( n log n ) bits 8 4 2 1 0 BPS dfs = (( BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4 n bits
Introduction CST design CST in practice Example: Compressing NAV 0 u $ 15 1 dumulmum$ m lmu 7 lmu m ndumulmum$ 14 $ u ndumulmum$ m$ m$ lmu $ 11 ndumulmum$ ulmu 5 10 13 m$ ndumulmum$ m$ ndumulmum$ 6 9 tree uncompressed m$ ndumulmum$ 3 12 ndumulmum$ O ( n log n ) bits 8 4 2 1 0 BPS dfs = (() BPS dfs = (()()(()())(()((()())()()))()((()())(()(()()))())) compressed 4 n bits
Recommend
More recommend