burrows wheeler transform
play

BurrowsWheeler Transform Idea: The BurrowsWheeler Transform (BWT) - PowerPoint PPT Presentation

BurrowsWheeler Transform Idea: The BurrowsWheeler Transform (BWT) produces from a string S of length n a new string T of length n , and so is not a compression method in itself. Compression algorithms can be based on the idea that it is


  1. Burrows–Wheeler Transform Idea: The Burrows–Wheeler Transform (BWT) produces from a string S of length n a new string T of length n , and so is not a compression method in itself. Compression algorithms can be based on the idea that it is easier to compress T than S . ** The popular bzip utilities are examples. End of string assumption: We assume that S [ n ] is a special character $ that occurs nowhere else in S , and is greatest in a sorted ordering of the characters of S ; this assumption is not necessary but simplifies our presentation. - 1 -

  2. Matrix Definition of the BWT Let M ( S ) be the matrix of all cyclic rotations of S , listed in lexicographic order. We refer to M as the Burrows-Wheeler matrix for S and the Burrows-Wheeler Transform of S , denoted BWT( S ), is the last column of M . Example: 1 2 3 4 5 6 7 8 9 10 S = b r a t a t b a t $ 1 2 3 4 5 6 7 8 9 10 1 a t a t b a t $ b r 2 a t b a t $ b r a t 3 a t $ b r a t a t b 4 b a t $ b r a t a t 5 b r a t a t b a t $ 6 r a t a t b a t $ b 7 t a t b a t $ b r a 8 t b a t $ b r a t a 9 t $ b r a t a t b a 10 $ b r a t a t b a t - 2 -

  3. Suffix Trie Definition of the BWT • The Burrows–Wheeler Transform of S , BWT ( S ), is a new string of exactly n characters that is a permutation of the characters of S corresponding to a pre-order traversal of the suffix trie of S (assuming children are visited in sorted order). • Specifically, after performing a pre-order traversal of the suffix trie to obtain a sequence of indices, subtract 1 from each index (with the convention that 0 = n ), and then list the characters at the corresponding positions. - 3 -

  4. Example: The string S = bratatbat$ is shown with its positions labeled from 1 through 10, and below the corresponding suffix trie. 1 2 3 4 5 6 7 8 9 10 S = b r a t a t b a t $ $ 10 at t b $ r(3,8) 9 $ r(3,8) 2 8 b(8,3) a(6,5) a(9,2) a(6,5) 1 b(8,3) 6 3 7 4 5 Traversal in pre-order gives the sequence: 3 5 8 7 1 2 4 6 9 10 Subtracting 1 from each number (where 0 maps back to 10) gives 2 4 7 6 10 1 3 5 8 9 which corresponds to the string: r t b t $ b a a a t - 4 -

  5. *** Although the suffix trie and matrix definitions of the BWT are equivalent, when considering efficient implementation of the BWT, it will be convenient to sometimes use one and sometimes use the other when motivating specific algorithms. - 5 -

  6. Intuition: • By examining the two equivalent definitions of the BWT one can gain an intuition as to why it may be more straightforward to compress BWT( S ) than S . • We are already familiar with the idea of using a context to predict the next character. • For example, after seeing elephan in English text, we know that with high likelihood the next character is a t . • BWT( S ) clusters symbols according to their context so that runs of identical symbols and runs of symbols drawn from a small subset occur often within BWT( S ). • Thus it is more straightforward to compress BWT(S) than S . - 6 -

  7. Prefix v. Suffix Contexts • The contexts that are effectively employed by the BWT are the strings that follow a character, rather than the ones that precede it, as is the case, for example, with PPM methods. • If we want to reflect preceding contexts, we can simply do BWT( S R ), where we use S R to denote the string that is S reversed except for the final $ , which we leave at the right end. Note : It is not actually necessary to reverse S , instead we could define the sorting of rows of the matrix to be done based on visiting the characters of a row from the second to last to the first. - 7 -

  8. O ( n ) computation of BWT ( S ) For a string S of length n , the straightforward computation BWT( S ) based on the Burrows-Wheeler matrix is O ( n 2 ) since M has n 2 entries. For an O ( n ) computation, we can employ the suffix trie definition: • Construct a suffix trie for S in such a way that children of a leaf are accessed in lexicographic order. • Traverse the leaves of the suffix trie in pre-order and output S [ i –1] at the leaf for position i (except output S [ n ] when i= 1). *** Assuming that the alphabet size is constant with respect to n , the suffix trie construction is linear time and space, and so is the pre-order traversal, for a total of O ( n ) time and space. - 8 -

  9. Computation of the inverse BWT Given the index q of the row in the BWT matrix M that contains S , S can be recovered from BWT( S ) in linear time. Computing q , the index of S in M , from BWT( S ): If $ is the q th character in BWT( S ), then S is the q th row in M . Computing the first column of M from BWT( S ): • We already have the last column (since it is BWT( S )). • The first column, which we denote by F [1]... F [ n ], is just the characters of S listed in sorted order (each character occurs in a block of positions in F ). • Since the characters of BWT( S ) are the same as the characters of S ; we can simply sort the characters of BWT( S ) to get F . - 9 -

  10. The Inverse BWT is Well Defined • Let M 1 = F , and let M 2 be the matrix of two columns that is formed by placing BWT( S ) in the first column, F in the second column, and then rearranging the rows in sorted order. • Then M 2 lists all pairs of characters of S in sorted order, and so it is the first 2 columns of M . • M 3 , the first 3 columns of M , is formed by prepending the column BWT( S ) to M 2 and sorting the rows. • This process can be continued until we have M n = M . • We can then read the q th row of M to recover S . *** It is not very practical to actually construct M ; we shall see how to avoid it. - 10 -

  11. Example: In our previous example of S = bratatbat$ , to form M 2 1 2 1 2 1 r a 1 a t 2 t a 2 a t 3 b a 3 a t 4 t b 4 b a form (BWT( S ) F ) → → Sort → 5 $ b 5 b r 6 b r 6 r a 7 a t 7 t a 8 a t 8 t b 9 a t 9 t $ 10 t $ 10 $ b and to form M 3 : 1 2 3 1 2 3 1 r a t 1 a t a 2 t a t 2 a t b 3 b a t 3 a t $ 4 t b a 4 b a t form (BWT( S ) M 2 ) → → Sort → 5 $ b r 5 b r a 6 b r a 6 r a t 7 a t a 7 t a t 8 a t b 8 t b a 9 a t $ 9 t $ b 10 t $ b 10 $ b r - 11 -

  12. Lemma: Let S be a string ending in a special character $ that occurs nowhere else in S , and suppose a character c occurs at more than one position in S . Let c 1 and c 2 denote any two of these occurrences of c. Then c 1 comes before c 2 in F ( S ) if and only if c 1 comes before c 2 in BWT( S ). Proof: • Suppose that in F ( S ) , c 1 is at position i and c 2 is at position j , and suppose that in BWT( S ), c 1 is at position x and c 2 is at position y . • Let I and J denote the i th and j th rows of M less their first characters. • Then i < j implies I < J ( I ≠ J since S ends in $ ), which implies that Ic < Jc , which implies that x < y , and symmetric reasoning applies for i > j . • For the reverse direction, let X and Y be the x th and y th rows of M less their last characters. • Then x < y implies X < Y ( X ≠ Y since S ends in $ ), which implies that cX < cY , which implies that i < j , and symmetric reasoning applies for x > y . - 12 -

  13. Example: Using again S = bratatbat$ , we can check that the three a 's occur in the same order in columns 1 and 10, the two b 's occur in the same order in columns 1 and 10, and the three t 's occur in the same order in columns 1 and 10. 1 2 3 4 5 6 7 8 9 10 S = b r a t a t b a t $ 1 2 3 4 5 6 7 8 9 10 1 a t a t b a t $ b r 2 a t b a t $ b r a t 3 a t $ b r a t a t b 4 b a t $ b r a t a t 5 b r a t a t b a t $ 6 r a t a t b a t $ b 7 t a t b a t $ b r a 8 t b a t $ b r a t a 9 t $ b r a t a t b a 10 $ b r a t a t b a t - 13 -

  14. Idea: • Trace between BWT( S ) and F ( S ) to discover the characters of S . • At stage 1, we know that S [1] = F [ q ]. • Now suppose that F [ q ] is the k th copy of that character in F , then we find the k th copy of F [ q ] in BWT( S ), suppose it is r th character of BWT(S). • The we set q = r , set S [2]= F [ q ]. • Now we can repeat the same process for S [3], and so on. - 14 -

  15. Example: 1 2 3 4 5 6 7 8 9 10 a t a t b a t $ b r 1 a t b a t $ b r a t 2 a t $ b r a t a t b 3 b a t $ b r a t a t 4 b r a t a t b a t $ 5 r a t a t b a t $ b 6 t a t b a t $ b r a 7 t b a t $ b r a t a 8 t $ b r a t a t b a 9 $ b r a t a t b a t 10 Stage 1: S [1] = F [ q ] = F [5] = b We see that this is the second b in F . The second b in BWT( S ) is at the end of row 6. Set q = 6. Stage 2: S [2] = F [ q ] = F [6] = r We see that this is the first r in F . The first r in BWT( S ) is at the end of row 1. Set q = 1. Stage 3: S [3] = F [ q ] = F [1] = a We see that this is the first a in F . The first a in BWT( S ) is at the end of row 7. Set q = 7. Stage 4: S [4] = F [ q ] = F [7] = t We see that this is the first t in F . The first t in BWT( S ) is at the end of row 2. Set q = 2. - 15 -

Recommend


More recommend