Dynamic mappers of NGS reads Karel Břinda (LIGM Universit é Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit é Paris-Est)
Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development of new mappers None of them supports dynamic updates of the reference during the mapping
Idea: update reference during the mapping Only few papers on this topic exist ◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013. ◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013. ◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10 , 2012.
Mapping – from static to dynamic 1. Static mapping ◦ Classical mappers, no updates 2. Iterative referencing ◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations 3. Dynamic mapping ◦ Mapper is dynamically updating its index accordingly to already mapped reads
1) Static mapping (standard mappers) READS OUTPUT MAPPER Reference (index) SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping
2) Iterative referencing (Ghanayim&Geiger, 2013) READS MAPPER OUTPUT Statistics Update of the 1 iter. 1 2 n reference Pileup, 1 iter. 1 2 n consensus . Reference (index) . . SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping
3) Dynamic mapping (no existing mapper until now) READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping
Estimating the usefulness Memory requirements Speed Quality of alignment + -- ++ Iterative referencing -- + + Dynamic mapping + ++ - Static mapping
Dynamic mappers
Difficulties – dynamic data structures Two basic types of mappers: ◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.) ◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.) Data structures must be dynamic ◦ Difficult to make dynamic versions ◦ More memory needed ◦ Worse cache-optimization (=> significant decrease of speed) Dynamic FM-index – already studied: ◦ M. Salson, T. Lecroq , M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows – Wheeler transform. Theoretical Computer Science 410 (43), 2009. ◦ M. Salson, T. Lecroq, M. Léonard , and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8 (2), 2010. ◦ Implementation: http://dfmi.sourceforge.net/
Difficulties – statistics and reference Example (memory needed for To make updates, it is necessary to keep simplified pileups (nucleotide statistics for a single nucleotide) counts in an alignment column). ‘A’ ‘C’ ‘G’ ‘T’ DEL Sum counter counter counter counter counter It is difficult to deal with insertions. 3 bits 3 bits 3 bits 3 bits 3 bits 15 bits The coordinates of already mapped Example (padded reference, an reads can change during the mapping. insertion at pos. 14) ◦ Possible solution: padded reference, many 1 3 5 7 9 11 13 15 17 19 initial place holders (‘*’ character), final C * * A * * G * * C * * G C * C * * A * … small post-processing corrections of the SAM file.
Difficulties – remapping, unmapping When reference sequence changes too ... AAAAATATATAT AT CGATCTGC ... Reference: CC _ much, some of the already mapped reads should be remapped or Reads: 1: ATCTATATATCG unmapped 2: C CGATCTGC 3: CC CGATCTG 4: AT CC CGATC Possible solution: ◦ Ignore it ◦ Iterate over the set of reads more times and take only the last reported alignments for each read
Simulating dynamic mapping
Dynamic mapping READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping
Simulation (ideal approach) READS MAPPER OUTPUT Statistics 1 1 iter. Update of the reference Pileup, 1 1 iter. 2 consensus . Reference (index) . . SAM/BAM 1 iter. 1 2 n Static mapper file Read mapping
1 Simulation (feasible approach: 𝑒 iterations) READS MAPPER OUTPUT Statistics Update of the 1 iter. d reads reference Pileup, 1 iter. d reads d reads consensus . Reference (index) . . 1 iter. SAM/BAM d reads d reads d reads Static mapper file Read mapping
Our pipeline Goals: ◦ Simulating dynamic mapper using existing static mappers ◦ Estimating usefulness of dynamic mapping ◦ Making general statements about its benefit Implementation: ◦ Set of several scripts (BASH, Python) and programs (C++) ◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated) ◦ Updates are made by own simple variant caller (simulating real capabilities of mapper) ◦ Currently only SNP updates (no indels) and single-end reads supported
Comparing mappers and alignments
Comparison of mappers Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful .
Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . Threshold 20 (on mapping qualities) Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997
Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . It is important to consider all thresholds on mapping qualities! Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997
LAVEnder A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of: ◦ DWGsim read simulator (will be extended) ◦ Single-end reads Availability ◦ Currently a private repository on GitHub ◦ In case of interest, don’t hesitate to contact me at karel.brinda@univ-mlv.fr
Fraction of wrongly mapped reads in mapped reads Example of a comparison • Human chromosome 21 • Sequencing error rate: 0.04 Part of all • Mutation rate: 0.10 reads in % • Single-end reads • Simulated by DWGsim • Aligned by BWA-MEM
EXPERIMENTS
Setup Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder
BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.01 mut. rate
BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.01 mut. rate
BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.03 mut. rate
BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.03 mut. rate
BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.05 mut. rate
BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.05 mut. rate
BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING
Recommend
More recommend