dynamic mappers of ngs reads
play

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) - PowerPoint PPT Presentation

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit Paris-Est) Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant


  1. Dynamic mappers of NGS reads Karel Břinda (LIGM Universit é Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit é Paris-Est)

  2. Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development of new mappers None of them supports dynamic updates of the reference during the mapping

  3. Idea: update reference during the mapping Only few papers on this topic exist ◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013. ◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013. ◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10 , 2012.

  4. Mapping – from static to dynamic 1. Static mapping ◦ Classical mappers, no updates 2. Iterative referencing ◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations 3. Dynamic mapping ◦ Mapper is dynamically updating its index accordingly to already mapped reads

  5. 1) Static mapping (standard mappers) READS OUTPUT MAPPER Reference (index) SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping

  6. 2) Iterative referencing (Ghanayim&Geiger, 2013) READS MAPPER OUTPUT Statistics Update of the 1 iter. 1 2 n reference Pileup, 1 iter. 1 2 n consensus . Reference (index) . . SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping

  7. 3) Dynamic mapping (no existing mapper until now) READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping

  8. Estimating the usefulness Memory requirements Speed Quality of alignment + -- ++ Iterative referencing -- + + Dynamic mapping + ++ - Static mapping

  9. Dynamic mappers

  10. Difficulties – dynamic data structures Two basic types of mappers: ◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.) ◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.) Data structures must be dynamic ◦ Difficult to make dynamic versions ◦ More memory needed ◦ Worse cache-optimization (=> significant decrease of speed) Dynamic FM-index – already studied: ◦ M. Salson, T. Lecroq , M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows – Wheeler transform. Theoretical Computer Science 410 (43), 2009. ◦ M. Salson, T. Lecroq, M. Léonard , and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8 (2), 2010. ◦ Implementation: http://dfmi.sourceforge.net/

  11. Difficulties – statistics and reference Example (memory needed for To make updates, it is necessary to keep simplified pileups (nucleotide statistics for a single nucleotide) counts in an alignment column). ‘A’ ‘C’ ‘G’ ‘T’ DEL Sum counter counter counter counter counter It is difficult to deal with insertions. 3 bits 3 bits 3 bits 3 bits 3 bits 15 bits The coordinates of already mapped Example (padded reference, an reads can change during the mapping. insertion at pos. 14) ◦ Possible solution: padded reference, many 1 3 5 7 9 11 13 15 17 19 initial place holders (‘*’ character), final C * * A * * G * * C * * G C * C * * A * … small post-processing corrections of the SAM file.

  12. Difficulties – remapping, unmapping When reference sequence changes too ... AAAAATATATAT AT CGATCTGC ... Reference: CC _ much, some of the already mapped reads should be remapped or Reads: 1: ATCTATATATCG unmapped 2: C CGATCTGC 3: CC CGATCTG 4: AT CC CGATC Possible solution: ◦ Ignore it ◦ Iterate over the set of reads more times and take only the last reported alignments for each read

  13. Simulating dynamic mapping

  14. Dynamic mapping READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping

  15. Simulation (ideal approach) READS MAPPER OUTPUT Statistics 1 1 iter. Update of the reference Pileup, 1 1 iter. 2 consensus . Reference (index) . . SAM/BAM 1 iter. 1 2 n Static mapper file Read mapping

  16. 1 Simulation (feasible approach: 𝑒 iterations) READS MAPPER OUTPUT Statistics Update of the 1 iter. d reads reference Pileup, 1 iter. d reads d reads consensus . Reference (index) . . 1 iter. SAM/BAM d reads d reads d reads Static mapper file Read mapping

  17. Our pipeline Goals: ◦ Simulating dynamic mapper using existing static mappers ◦ Estimating usefulness of dynamic mapping ◦ Making general statements about its benefit Implementation: ◦ Set of several scripts (BASH, Python) and programs (C++) ◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated) ◦ Updates are made by own simple variant caller (simulating real capabilities of mapper) ◦ Currently only SNP updates (no indels) and single-end reads supported

  18. Comparing mappers and alignments

  19. Comparison of mappers Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful .

  20. Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . Threshold 20 (on mapping qualities) Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997

  21. Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . It is important to consider all thresholds on mapping qualities! Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997

  22. LAVEnder A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of: ◦ DWGsim read simulator (will be extended) ◦ Single-end reads Availability ◦ Currently a private repository on GitHub ◦ In case of interest, don’t hesitate to contact me at karel.brinda@univ-mlv.fr

  23. Fraction of wrongly mapped reads in mapped reads Example of a comparison • Human chromosome 21 • Sequencing error rate: 0.04 Part of all • Mutation rate: 0.10 reads in % • Single-end reads • Simulated by DWGsim • Aligned by BWA-MEM

  24. EXPERIMENTS

  25. Setup Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder

  26. BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.01 mut. rate

  27. BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  28. BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.01 mut. rate

  29. BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  30. BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.03 mut. rate

  31. BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  32. BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.03 mut. rate

  33. BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  34. BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.05 mut. rate

  35. BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  36. BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.05 mut. rate

  37. BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

Recommend


More recommend