In-Place (Bijective) BWT Transforms Dominik Köppl Kyushu University Daiki Hashimoto Tohoku University Diptarama Ayumi Shinohara
data structures Burrows-Wheeler Transform (BWT) [Burrows,Wheeler '94] Bijective BWT (BBWT) [Gil,Scott '12] 2
BWT of bacabbabb T = bacabbabb$ 3
BWT of bacabbabb T = bacabbabb$ all suffjxes bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $ 4
BWT of bacabbabb T = bacabbabb$ all suffjxes $ bacabbabb$ b acabbabb$ a cabbabb$ c abbabb$ a bbabb$ b babb$ prev. char b abb$ a bb$ b b$ b $ 5
BWT of bacabbabb T = bacabbabb$ all suffjxes $ bacabbabb$ $ bacabbabb$ b acabbabb$ b acabbabb$ a cabbabb$ a cabbabb$ c abbabb$ c abbabb$ a bbabb$ a bbabb$ b babb$ b babb$ align prev. char b abb$ b abb$ left a bb$ a bb$ b b$ b b$ b $ b $ 6
BWT of bacabbabb T = bacabbabb$ all suffjxes BWT $ bacabbabb$ $ bacabbabb$ b $ b acabbabb$ b acabbabb$ b abb$ a cabbabb$ a cabbabb$ c abbabb$ c abbabb$ c abbabb$ b acabbabb$ a bbabb$ a bbabb$ b babb$ b babb$ b babb$ b b$ < lex sort align prev. char b abb$ b abb$ $ bacabbabb$ left a bb$ a bb$ a bb$ b b$ b b$ a bbabb$ b $ b $ a cabbabb$ lex. order 7
the BBWT is the BWT of the Lyndon factorization of an input text with respect to ≺ ω 8
the BBWT is the BWT of the Lyndon factorization 1. of an input text with respect to ≺ ω 2. 9
Lyndon words – a – aabab Lyndon word is smaller than ● any proper suffix ● any rotation 10
Lyndon words – a – aabab Lyndon word is smaller than ● any proper suffix ● any rotation not Lyndon words: – abaab (rotation aabab smaller) – abab ( abab not smaller than suffjx ab ) 11
Lyndon factorization [Chen+ '58] ● input: text T = T 1 T 2 T t ⋯ ● output: factorization T 1 ... T t with – T x is Lyndon word – T x ≥ lex T x +1 – factorization uniquely defjned – linear time [Duval'88] (Chen-Fox-Lyndon Theorem) (Chen-Fox-Lyndon theorem) 12
example T = bacabbabb Lyndon factorization : b|ac|abb|abb – b,ac,abb , and abb are Lyndon – b > lex ac > lex abb ≥ lex abb 13
≺ ω order ● u ≺ ω w : ⟺ u u u u ... < lex w w w w ... ● ab < lex aba ● aba ≺ ω ab 14
≺ ω order ● u ≺ ω w : ⟺ u u u u ... < lex w w w w ... ● ab < lex aba abababab⋯ abaabaaba⋯ ● aba ≺ ω ab 15
BBWT of bacabbabb b|ac|abb|abb 16
BBWT of bacabbabb b|ac|abb|abb b ac abb abb ca bab bab bba bba 17
BBWT of bacabbabb b|ac|abb|abb b b ac abb abb ac ca bab bab ca bba bba abb bab bba abb bab bba 18
BBWT of bacabbabb b|ac|abb|abb b abb b ac abb abb ac abb ca bab bab ca ac bba bba abb bab bab bab ≺ ω bba bba abb bba bab b bba ca 19
BBWT of bacabbabb b|ac|abb|abb BBWT b abb abb b b ac abb abb ac abb abb b ca bab bab ca ac ac c bba bba abb bab bab b bab bab bab b ≺ ω bba bba bba a abb bba bba a bab b b b bba ca ca a BBWT( T ) = bbcbbaaba 20
BBWT of bacabbabb b|ac|abb|abb BBWT b abb abb b b ac abb abb ac abb abb b ca bab bab ca ac ac c bba bba abb bab bab b bab bab bab b ≺ ω bba bba bba a abb bba bba a bab b b b bba ca ca a BBWT( T ) = bbcbbaaba BWT( T $ ) = bbcbbb$aaa 21
motivation properties of BBWT : ● no $ necessary ● BBWT is more compressible than BWT for various inputs [Scott and Gill '12] ● BBWT is indexible (full text index) ● is computable in O( n ) time with O( n ) words [Bannai+ '19] however, O( n ) words can be too much for large n 22
in-place computation ● Σ: alphabet, σ := |Σ| alphabet size ● T : text, n := | T | ● L := n lg σ bits workspace ● aim : in-place computation transform T BWT BBWT with ↔ ↔ | L | + O(lg n ) bits of workspace L T := b a c a b b a b b 23
known solutions work- input output time reference space text BWT in-place O( n 2 ) Crochemore+ '15 BWT text in-place O( n 2+ε ) O( n lg σ ) O( n text BBWT Bonomo+ '14 bits lg n /lg lg n ) σ : alphabet size, n : text length, 24 ε is a constant with 0 < ε < 1
in-place conversions text known O( n 2 ) O( n 2+ ε ) O( n 2 ) O( n 2+ ε ) BWT BBWT O( n 2+ ε ) working space: n lg σ + O(lg n ) bits (including text) 25
forward search F L T = bacabbabb$ b $ a b a c b a b b b b b $ b a b a c a 26
forward search F L T = bacabbabb$ b $ a b a c b a b b b b b $ b a b a c a 27
forward search F L T = bacabbabb$ b $ a b a c b a b b b b b $ b a b a c a 28
forward search F L T = bacabbabb$ b $ a b a c b a b b b b b $ b a b a c a 29
forward search F L T = bacabbabb$ b $ a b a c can calculate with b a rank and select on F and L b b b b b $ b a b a c a 30
L .rank L [ i ] ( L [ i ]) forward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 2 a c 1 FL mapping: 3 a b 3 FL( i ) = L .select F [ i ] ( F .rank F [ i ] ( F [ i ]) ) 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b a 2 1 c a 3 F .rank F [i] ( F [ i ]) 31
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 2 a c 1 3 a b 3 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 32 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 2 a c 1 3 a b 3 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 33 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 2 a c 1 3 a b 3 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 34 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 2 a c 1 3 a b 3 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 35 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 LF mapping: 2 a c 1 LF( i ) := F .select L [ i ] ( L .rank L [ i ] ( i ) ) 3 a b 3 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 36 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 LF mapping: 2 a c 1 LF( i ) := F .select L [ i ] ( L .rank L [ i ] ( i ) ) 3 a b 3 = F .select L [ i ] (1) + L .rank L [ i ] ( i )-1 1 b b 4 2 b b 5 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 37 FM index [Ferragina, Manzini '00]
L .rank L [ i ] ( L [ i ]) backward search F L T = bacabbabb$ 1 $ b 1 1 a b 2 LF mapping: 2 a c 1 LF( i ) := F .select L [ i ] ( L .rank L [ i ] ( i ) ) 3 a b 3 = F .select L [ i ] (1) + L .rank L [ i ] ( i )-1 1 b b 4 2 b b 5 = |{ j : L [ j ] < L [ i ]}| + L .rank L [ i ] ( i ) 3 b $ 1 4 b a 1 5 b F .rank F [i] ( F [ i ]) a 2 1 c a 3 38 FM index [Ferragina, Manzini '00]
LF: time complexity If we store BWT( T ) in L : – L [ i ] = BWT[ i ]: O(1) time ⇒ for any c : L .rank c ( i ) in O( n ) time – LF( i ) = |{ j : L [ j ] < L [ i ]}| + L .rank L [ i ] ( i ) O( n ) time O( n ) time 39
FL: time complexity ● FL( i ) = L .select F [ i ] ( F .rank F [ i ] ( F [ i ]) ) FL(i) = L .select F [ i ] ( i - |{ j : L [ j ] < i }| ) ● If we know F [ i ]: FL( i ) in O( n ) time ● however, the fastest in-place computation of F [ i ] takes O( n 1+ε ) time [Munro,Raman '96] for any constant ε with 0 < ε < 1 40
road map text 1. O( n 2+ ε ) O( n 2 ) BWT BBWT 2. O( n 2+ ε ) working space: n lg σ + O(lg n ) bits (including text) 41
text BBWT → 42
text BBWT → for each Lyndon factor T x with x = 1 up to t : prepend T x [| T x |] to BBWT p 1 (insert position in BBWT ) ← for each i = | T x |-1 down to 1 : p LF( p ) + 1 ← insert T x [ i ] at BBWT[ p ] [Bonomo+ '14] 43
text BBWT → T = bacabbabb ● Lyndon factorization: b|ac|abb|abb ● fjrst: insert b 44
text BBWT → T = bacabbabb ● Lyndon factorization: b|ac|abb|abb ● fjrst: insert b F L 1 b b 1 45
text BBWT → T = bacabbabb F L 1 a b 1 ● Lyndon factorization: 2 a b 2 3 a c 1 b|ac|abb|abb 1 b b 3 ● fjrst: insert b 2 b b 4 3 b a 1 F L how to calculate? 4 b a 2 1 b b 1 5 b b 5 1 c a 3 46
BBWT( T 1 T 2 ) T = b|ac|abb|abb = T 1 T 2 T 3 T 4 ● next Lyndon factor: ac F L 1 b b 1 47
BBWT( T 1 T 2 ) T = b|ac|abb|abb = T 1 T 2 T 3 T 4 ● next Lyndon factor: ac F L F L 1 b b 1 1 b c 1 1 c b 1 48
BBWT( T 1 T 2 ) T = b|ac|abb|abb = T 1 T 2 T 3 T 4 ● next Lyndon factor: ac F L F L F L 1 b b 1 1 b c 1 1 a c 1 1 c b 1 1 b b 1 1 c a 1 49
BBWT( T 1 T 2 T 3 ) T = b|ac|abb|abb ● next Lyndon factor: abb F L 1 a c 1 1 b b 1 1 c a 1 50
BBWT( T 1 T 2 T 3 ) T = b|ac|abb|abb ● next Lyndon factor: abb F L F L 1 a c 1 1 a b 1 1 b b 1 1 b c 1 1 c a 1 2 b b 2 1 c a 1 51
Recommend
More recommend