Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions Damla Senol Cali, Jeremie S. Kim, Saugata Ghose, Can Alkan and Onur Mutlu Contact: dsenol@andrew.cmu.edu February 16, 2019
Nanopore Sequencing & Tools Damla Senol Cali, Jeremie S. Kim, Saugata Ghose, Can Alkan, and Onur Mutlu. "Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks BiBVersion arXivVersion and Future Directions." Briefings in Bioinformatics (2018). Damla Senol Cali 2 02/16/2019
Executive Summary q Motivation: Nanopore sequencing is an emerging and a promising technology with its ability to generate long reads and provide portability . q Problem: q High error rates of the technology q Critical importance of the tools to 1) overcome the high error rates of the technology, and 2) enable fast, real-time data analysis. q Goal: Analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data. q Key Contributions: o Analysis of the tools in multiple dimensions: accuracy , performance , memory usage and scalability . o New bottlenecks and tradeoffs that different combinations of tools lead to o Guidelines for both practitioners and tool developers Damla Senol Cali 3 02/16/2019
Outline q Background and Motivation o Nanopore Sequencing Technology o Comparison with Prior Technologies o Nanopore Genome Assembly Pipeline o Our Goal q Experimental Methodology q Results and Analysis q Conclusion Damla Senol Cali 4 02/16/2019
Nanopore Sequencing Technology q Nanopore sequencing is an emerging and a promising single-molecule DNA sequencing technology. q First nanopore sequencing device, MinION , made commercially available by Oxford Nanopore Technologies (ONT) in May 2014. o Inexpensive o Long read length (> 882 Kbp) o Produces data in real time o Pocket-sized and portable Damla Senol Cali 5 02/16/2019
Nanopore Sequencing q Nanopore is a nano-scale hole. q In nanopore sequencers, an ionic current passes through the nanopores. q When the DNA strand passes through the nanopore, the sequencer measures the change in current . q This change is used to identify the bases in the strand with the help of different electrochemical structures of the different bases. Damla Senol Cali 6 02/16/2019
Why Nanopore Sequencing? Nanopore Sequencing (Prior) High-Throughput Technology Sequencing Technologies q q Require an amplification step Do not require an amplification before the sequencing process, step before the sequencing q Require labeling of the DNA or process, q nucleotide for detection during Do not require any labeling of sequencing, the DNA or nucleotide for q Generate billions of short but detection during sequencing, accurate reads, q Allow sequencing of very long q Provide high throughput, high reads , and speed and low cost, q Provide portability, low cost and q Suffers from massive amount of high throughput. data and short reads, which poses q One major drawback: high error challenges due to the repetitive rates ( ∽ 10-15%) sequences in the genome. Damla Senol Cali 7 02/16/2019
Nanopore Genome Assembly Pipeline Raw signal Basecalling data DNA reads Read-to-Read Overlap Finding Overlaps Assembly Assembly Draft assembly Read Mapping (Optional) Mappings of reads against Improved draft assembly Polishing (Optional) assembly Damla Senol Cali 8 02/16/2019
Our Goal q Comprehensively analyze the multiple steps and the associated state-of-the-art tools in genome assembly pipelines using nanopore sequence data in terms of accuracy , performance , memory usage , and scalability . q Reveal bottlenecks and trade-offs that different combinations of tools lead to. q Provide guidelines for both practitioners , such that they can determine the appropriate tools and tool combinations that can satisfy their goals, and tool developers , such that they can make design choices to improve current and future tools. Damla Senol Cali 9 02/16/2019
Outline q Background and Motivation q Experimental Methodology q Results and Analysis q Conclusion Damla Senol Cali 10 02/16/2019
Experimental Methodology Damla Senol Cali 11 02/16/2019
Experimental Methodology (cont.) Accuracy Metrics Performance Metrics q q Average Identity Wall clock time q o Percentage similarity between the assembly Peak memory usage q and the reference genome Parallel speedup o Higher ( ≃100% ) is preferred q Coverage o Ratio of the #aligned bases in the reference genome to the length of reference genome o Higher ( ≃100% ) is preferred q Number of mismatches o Total number of single-base differences between the assembly and the reference genome o Lower ( ≃0 ) is preferred q Number of indels o Total number of insertions and deletions between the assembly and the reference genome o Lower ( ≃0 ) is preferred Damla Senol Cali 12 02/16/2019
Outline q Background and Motivation q Experimental Methodology q Results and Analysis o Basecalling Tools § Accuracy § Performance o Read-to-Read Overlap Finding Tools o Assembly Tools o Read Mapping and Polishing Tools (optional) q Conclusion Damla Senol Cali 13 02/16/2019
Nanopore Genome Assembly Pipeline Raw signal Basecalling data Tools: Metrichor, Nanonet, Scrappie, Nanocall, DeepNano DNA reads Read-to-Read Overlap Finding Tools: GraphMap, Minimap Overlaps Assembly Assembly Tools: Canu, Miniasm Draft assembly Read Mapping Tools: BWA-MEM, Minimap, (GraphMap) Mappings of reads against Polishing Improved draft assembly Tools: Nanopolish, Racon assembly Damla Senol Cali 14 02/16/2019
Basecalling Tools q Metrichor o ONT’s cloud-based basecaller o Uses recurrent neural networks ( RNN ) for basecalling q Nanonet o ONT’s offline and open-source alternative for Metrichor o Uses RNN for basecalling q Scrappie o ONT’s newest basecaller that explicitly addresses basecalling errors in homopolymer regions q Nanocall [David+, Bioinformatics 2016] o Uses Hidden Markov Models ( HMM ) for basecalling q DeepNano [Boža+, PloS One 2017] o Uses RNN for basecalling Damla Senol Cali 15 02/16/2019
Nanopore Genome Assembly Pipeline Raw signal Basecalling data Tools: Metrichor, Nanonet, Scrappie, Nanocall, DeepNano DNA reads Pipeline A: [Basecalling tool] Read-to-Read Overlap Finding + Canu Tools: GraphMap, Minimap Pipeline B: [Basecalling tool] Overlaps + GraphMap + Miniasm Assembly Pipeline C: [Basecalling tool] Assembly Tools: Canu, Miniasm + Minimap + Miniasm Draft assembly Read Mapping Tools: BWA-MEM, Minimap, (GraphMap) Mappings of reads against Polishing Improved draft assembly Tools: Nanopolish, Racon assembly Damla Senol Cali 16 02/16/2019
Basecalling –Accuracy Accuracy An Ac Analysis Re Results for Ba Basecalling Tools 100 100 450 450 90 90 400 400 80 80 350 350 70 70 300 300 Percentage (%) Percentage (%) 60 60 250 250 KBp) # (KBp 50 50 200 200 # 40 40 150 150 30 30 100 100 20 20 50 50 10 10 0 0 Metrichor Scrappie Nanocall DeepNano Nanonet Observation 1-a: Metrichor, Nanonet and Scrappie have similar A A B B C C A A B B C C A A B B C C A A B B C C A A B B C C L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P identity and coverage trends among all of the evaluated Iden entity (%) Cov over erage e (%) # Mismatches es # Indel els scenarios. Damla Senol Cali 17 02/16/2019
Basecalling –Accuracy Accuracy An Ac Analysis Re Results for Ba Basecalling Tools 100 100 450 450 90 90 400 400 80 80 350 350 70 70 300 300 Percentage (%) Percentage (%) 60 60 250 250 KBp) # (KBp 50 50 200 200 # 40 40 150 150 30 30 100 100 20 20 50 50 10 10 0 0 Metrichor Scrappie Nanocall DeepNano Nanonet Observation 1-b: However, Nanocall and DeepNano cannot A A B B C C A A B B C C A A B B C C A A B B C C A A B B C C L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . L L . . P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P reach these three basecallers’ accuracies: they have lower identity Iden entity (%) Cov over erage e (%) # Mismatches es # Indel els and lower coverage . Damla Senol Cali 18 02/16/2019
Recommend
More recommend