psc
play

PSC LSD & LAW 2019 February 7, 2019 Outline 1. Motivation - PowerPoint PPT Presentation

On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Prochzka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019 Outline


  1. On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Procházka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019

  2. Outline 1. Motivation 2. Basic Concepts 3. BADPM data structures 4. BADPM pattern preprocessing 5. BADPM searching 6. BADPM complexities 7. Experiments LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 2 / 21

  3. Motivation DNA sequencing the population of many individuals. ■ 1000 Genomes Projects, UK10K project. ■ Pan-genomics: a consensus sequences is a way of representing the ■ sequenced population. Consensus sequence can be expressed as so-called degenerate string. ■ Need for fast on-line algorithms searching for different patterns in the ■ consensus sequence. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 3 / 21

  4. Basic Concepts: IUPAC alphabet IUPAC symbol Subset Bit coding { A } � 0001 � A { C } � 0010 � C { G } � 0100 � G { T } � 1000 � T { A, G } � 0101 � R { C, T } � 1010 � Y { C, G } � 0110 � S { A, T } � 1001 � W { G, T } � 1100 � K { A, C } � 0011 � M { C, G, T } � 1110 � B { A, G, T } � 1101 � D { A, C, T } � 1011 � H { A, C, G } � 0111 � V { A, C, G, T } � 1111 � N LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 4 / 21

  5. Basic Concepts: DNA Consensus Sequence homo sapiens: T C T A G C A C T T A C T C T A T G C C T G C T C T A G C A C T T A C T C T A T G C C T G C pan paniscus: T C C A G C A C T T A C T C T G T G C C C G C chlorocebus sabaeus: macaca fascicularis: T C C A G C A C T T A C T C T G T G C C C A C macaca mulatta: T C C A G C A C T T A C T C T G T G C C C A C papio anubis: T C C A G C A C T T A C T C T G T G C C C G C callithrix jacchus: T C C A G C G C T T A C T C T A T A C C T A A T C Y A G C R C T T A C T C T R T R C C Y R M CONSENSUS: Figure 1: Consensus sequence over IUPAC alphabet for different species (chro- mosome 7: 55 187 593 – 55 187 615 ). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 5 / 21

  6. Basic Concepts: Degenerate Pattern Matching Problem Given a degenerate text T and a degenerate pattern P . The problem is to find all the occurrences of P in T , i.e., to find all i such that for all j in [1 , m ] , T i + j − 1 ∩ P j � = ∅ . LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 6 / 21

  7. BADPM : Basic Properties Byte-Aligned Degenerate Pattern Matching ( BADPM ). ■ Sublinear average time complexity in searching over consensus DNA ■ sequences. Extremely fast for long patterns because of long shifts. ■ Simple pattern preprocessing: tabulating all pattern factors. ■ Processing at the byte level (omitting most of the bitwise operations). ■ Easy cooperating with n -gram inverted index. ■ LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 7 / 21

  8. BADPM : Data Structures Preprocessed pattern Source sequence Encoded sequence ... A C V T A A T ... T A R T B dictionary 0 4 879 Bi Bi +1 Bi +2 5 903 ... ... 00 01 00 11 00 00 11 11 00 00 11 01 baseSeq 6 927 j j + 1 ... ... A → 00 ... ... i i + 2 variantPos 00 01 01 11 00 00 11 10 C → 01 ... ... 3 6 G → 10 00 01 10 11 00 00 11 11 variantNum T → 11 ... 00 10 11 01 00 10 11 10 variants 00 10 11 11 ... variants LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 8 / 21

  9. BADPM : Data structures (2) Consensus sequence divided into: ■ Base sequence. Consisting of only solid symbols. ◆ Variants. Encoded variants (given by the degenerate symbols) in ◆ terms of a whole byte. Base sequence and variants encoded using bytes substituting 4-grams of ■ symbols/bases. Auxiliary array variantPos storing positions of “degenerate bytes” in base ■ sequence. Auxiliary array variantNum storing number of “byte variants” for a given ■ byte. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 9 / 21

  10. BADPM : Data structures (3) Dictionary of all possible two-byte values ( 256 2 = 65 536 values). ■ Dictionary entries point to lists of occurrences (of a two-byte values) in ■ the encoded pattern P C . List elements: ■ Byte offset in terms of the encoded pattern P C . ◆ Alignment to the encoded pattern P C . ◆ LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 10 / 21

  11. BADPM : Pattern Preprocessing alignment = 0 dictionary 0 A C G T A A T T A A T ... T T A T T T A A C ... C T alignment offset 6 927 ... 0 0 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 alignment = 1 A C G T A A T ... C T T A T T T A A C T A A T ... T 0 1 ... 27 708 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 32 575 n B − 2 1 ... alignment = 2 A C G T A A T T A A T T T A T T T A A C ... ... C T 0 2 ... 45 296 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 0 3 50 115 ... 53 185 alignment = 3 n B − 1 0 ... A C G T A A T T A A T ... ... C T T A T T T A A C T n B − 2 3 62 448 ... 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 64 764 n B − 2 2 ... A → 00 C → 01 Preprocessing process Preprocessed pattern G → 10 T → 11 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 11 / 21

  12. BADPM : Pattern Preprocessing (2) For different alignments a ∈ { 0 , 1 , 2 , 3 } : 1. Scan all relevant double-byte values. 2. Store byte offset (in terms of the encoded pattern P E ) and alignment a to the corresponding list (a dictionary entry corresponding to the double-byte value). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 12 / 21

  13. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 alignment offset o l i i a l i l i o 1 a 1 ... 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  14. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  15. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 O ( m ) alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  16. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 O ( m ) alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) O (log m ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  17. BADPM : Pattern Preprocessing Time O ( mα 2 ) Scan O ( m ) bytes of the encoded pattern P E . ■ Check O ( α 2 ) double-byte values at each position (pathological patterns ■ . . . NNNNNNNN . . . ). Store offset and alignment for each double-byte value to the ■ corresponding list ( O (1) time). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 14 / 21

  18. BADPM Searching baseSeq dictionary ... 1. Read short value and check the dictionary. ... 2. Byte-level check according to the offset. offset, alignment ... 3. Prefix and suffix check according to the alignment. Figure 2: BADPM : Conceptual schema of searching. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 15 / 21

  19. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i A C A A G T T A T A T A T A A A C T T A G G C baseSeq ... variants dictionary A C G A ... 4 284 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  20. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i T A A A C T T A G G C A C A A G T T A T A T A baseSeq ... variants dictionary A C G A ... 4 284 1 0 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  21. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i T A A C A A G T T A T A T A T A A A C T G G C baseSeq ... variants dictionary A C G A ... 4 284 1 0 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  22. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i A C A A T A T A T A A A C T T A G G C G T T A baseSeq ... variants dictionary A C G A ... variantPos ... i ... 6 332 0 variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

Recommend


More recommend