Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu
1 Motivation • Sequence alignment techniques assume there is conservation of contiguity between homologous segments. • The assumption of contiguity is violated by genetic processes such as: recombination - the exchange of regions of the genome between paired homologous chromosomes during meiosis (crossover). genome shuffling - an experimental technique which allows for recombination between multiple parents with the goal of improving individual genes. • Alignment-free techniques attempt to overcome this limitation. The alignment-free technique we will discuss is called Universal Se- quence Maps (USM) [1]. It is based on a sequence representation tech- nique called “Chaos Game Representation” [2]. 0-1
2 Chaos Game Representation • Chaos Game Representation (CGR) maps a nucleotide sequence into a continuous two dimensional space on the unit square, CGR- space. • CGR has the property that each unique genomic sequence S = s (1) s (2) . . . s ( i ) . . . , of any length, has a unique position in CGR- space. • The iterative function used in CGR is defined by [3]: 1 CGR j ( s (0) ) = 2 2 ( CGR j ( s ( i − 1) ) − u ( i ) = CGR j ( s ( i − 1) ) + 1 CGR j ( s ( i ) ) j ) where u ( i ) is the j th bit of the binary encoding for sequence symbol j s ( i ) , and 1 ≤ j ≤ 2. • This mapping is a bijection and so it is invertible. 0-2
3 CGR Binary Encoding • Each unique symbol in the genomic sequence is encoded in binary. • For CGR the binary encoding is defined as: Unit Code A 00 T 01 C 10 G 11 0-3
Figure 1: CGR Representation of ATGCGAGATGT. 0-4
Figure 2: Using quadrants we can recover the original sequence. 0-5
4 Universal Sequence Maps • USMs generalizes the 2-D CGR-space to an n -dimensional space, where n depends on the number of unique symbols in the sequence. • In addition to the “forward” map of CGR, there is also a “back- ward” map. • Together, the “forward” and “backward” maps allow us to esti- mate the length of a similar segment. 0-6
4.1 Binary Encoding of Unique Symbols • As a concrete example, we let these two stanzas of a poem be our two sequences. I am a poet. I am very fond of bananas. I am of very fond bananas. Am I a poet? • There are nineteen unique symbols, so we encode each symbol with a 5-bit binary code, where 5 = ⌈ log 2 (19) ⌉ . • Each unique symbol is placed in a unique corner of the 5-dimensional unit hypercube. 0-7
4.2 USM Binary Code Unit Code Unit Code 00000 I 00100 . 00001 m 01010 ? 00010 n 01011 A 00011 o 01100 a 00101 p 01101 b 00110 r 01110 d 00111 s 01111 e 01000 t 10000 f 01001 v 10001 y 10010 0-8
4.3 Forward Map • The forward map defines a position in an n -dimensional unit hypercube for each prefix S i = s (1) s (2) . . . s ( i ) of sequence S = s (1) s (2) . . . s ( k ) . • For each coordinate j = 1 , . . . , n of n -space, the USM j coordi- nates for each prefix S i are determined as follows: USM j ( s (0) ) = Unif ([0 , 1]) � � u ( i ) USM j ( s ( i ) ) = USM j ( s ( i − 1) ) + 1 − USM j ( s ( i − 1) ) j 2 2 u ( i ) 1 2 USM j ( s ( i − 1) ) + 1 = j where u ( i ) ∈ { 0 , 1 } and 1 ≤ i ≤ k , 1 ≤ j ≤ n j • Notice the above formulas use the binary encoding u ( i ) of s ( i ) . 0-9
4.4 Backward Map • The backward map defines another n -dimensional unit hypercube. • For each coordinate j = 1 , . . . , n of n -space, the USM j coordi- nates for each suffix S i are determined as follows: USM n + j ( s ( k +1) ) = Unif ([0 , 1]) 2 u ( i ) 1 2 USM n + j ( s ( i +1) ) + 1 USM n + j ( s ( i ) ) = j where i = k, k − 1 , . . . , 1. • Together the forward and backward maps transform a sequence to a point in 2 n -dimensional space. 0-10
5 Sequence Similarity • The distance between two sequences in USM-space estimates se- quence similarity. • The “bi-directional distance” measure D between two sequences A = a (1) a (2) . . . a ( r ) . . . and B = b (1) b (2) . . . b ( s ) . . . is defined as: D ( a ( r ) , b ( s ) ) = d f ( a ( r ) , b ( s ) ) + d b ( a ( r ) , b ( s ) ) • The “forward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j ) − USM ( a ( s ) d f ( a ( r ) , b ( s ) ) = − log 2 ( max j ) | ) • The forward distance measures the number of similar contiguous symbols proceeding a r . 0-11
• The “backward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j + n ) − USM ( a ( s ) d b ( a ( r ) , b ( s ) ) = − log 2 ( max j + n ) | ) • The backward distance measures the number of similar contiguous symbols succeeding b s . • Both distance measures d f and d b overestimate the true number of similar contiguous symbols. So D overestimates the total number of similar contiguous symbols. 0-12
Algorithm 5.1 USM Compare Input : Two sequences Output : A matrix D of bi-directional distance values 1 . Identify the unique symbols in the input sequences 2 . Find the dimension n of the unit hypercube 3 . Map each unique symbol to a unique corner of the unit hypercube 4 . Iteratively generate forward USM coordinates for each input sequence 5 . Iteratively generate backward USM coordinates for each input sequence 6 . Find the bi-directional distance matrix D 7 . Return D 0-13
6 Recovering the Sequence • In theory we can recover the original sequence from a given point in USM -space since the forward and backward mappings are bi- jections, i.e., each inverses exists. • In practice we are limited by the precision of the machine. • As an example, the USM coordinates of the 16th character ’a’ are: [0 . 0156 , 0 . 0138 , 0 . 6314 , 0 . 0001 , 0 . 5338 , � �� � forward 0 . 0703 , 0 . 3004 , 0 . 5169 , 0 . 2742 , 0 . 5652 ] � �� � backward 0-14
• We recover the original sequence by inverting the corresponding map. For the inverse forward map: = 2 USM j ( a (16) ) − u (16) USM j ( a (15) ) j . 0156 0 . 0138 0 = 2 . 6314 − 1 . 0001 0 . 5338 1 . 0312 . 0276 = . 2628 . 0002 . 0676 0 0 → 0 a space 0 0 0-15
Inverse forward map Inverse backward map 00100 I 00101 a 00000 01010 m 00101 a 00000 01010 m 10001 v 00000 01000 e 00101 a 01110 r 00000 10010 y 01101 p 00000 01100 o 01001 f 01000 e 01100 o 10000 t 01011 n 00001 . 00111 d 00000 00000 00100 I 01100 o 00000 01001 f 00101 a 00000 0-16
Figure 3: Distance matrices D for USM(left) versus bUSM on the two stanzas. Brighter regions indicate larger distance values. 0-17
Figure 4: Distance matrices D for a 100 nucleotide mRNA using both USM(left) and bUSM. 0-18
7 Boolean USM There are some problems with USMs: • The length of the sequence that can be recovered is limited to the machines precision. • The distance measure D over estimates the true number of similar symbols in a similar segment. The boolean USM (bUSM) procedure [4] fixes both these problems by: • Replacing arithmetic operations with equivalent boolean opera- tions. • Changes the symbol encoding scheme. 0-19
8 Boolean USM Coordinates • We will look at the forward map only. • Conceptually, we represent a coordinate(dimension) c in boolean USM-space as an infinite bit sequence. • Each coordinate in bUSM-space represents an infinite bit history of the symbols that have been seen so far. c = ∨ ∞ i =1 R i a i i =1 represents bit-wise logical OR, and R i is the where a i ∈ { 0 , 1 } , ∨ ∞ right shift operator repeated i times. 0-20
9 Tail Symbols • We also need to add two new symbols to the set of sequence symbols called “tail symbols,” one for each sequence. • Suppose we have the following encoding, where V t and W t are the tail symbols for the sequences V = ATGA and W = CTGA respectively. Unit Code A 000 T 001 G 010 C 011 100 V t W t 101 • Conceptually we add these tail symbols to the beginning and the end of the original sequences. 0-21
10 bUSM Recursion Formula • Consider a computer word of 8-bits, i.e., our coordinate bit history is 8 bits long. • The bUSM recursion is initialized with the tail symbol for each sequence. 00000000 USM ( v (0) ) = 00000000 11111111 • The bUSM recursion is written as: USM j ( v ( k ) ) = R 1 � � ⊙ u ( k ) USM j ( v ( k − 1) ) j where u ( k ) ∈ { 0 , 1 } , 1 ≤ j ≤ n . ⊙ is defined below. j 0-22
11 Example • For instance, for the first symbol of V, A = (000) , we get: 00000000 0 00000000 USM ( v (1) ) = R 1 ⊙ = 00000000 0 00000000 11111111 0 01111111 • Where ⊙ means replace the 1’st columns entry, in the right shifted coordinate histories, with a 1, if the corresponding entry in the column vector is 1. 00000000 1 10000000 ⊙ = USM ( v (2) ) = R 1 00000000 0 00000000 01111111 0 00111111 0-23
Recommend
More recommend