Optimizing early steps of long-read genome assembly ephane VARR - PowerPoint PPT Presentation

Optimizing early steps of long-read genome assembly ephane VARR´ Pierre MARIJON, Ma¨ el KERBIRIOU, Jean-St´ E, Rayan CHIKHI November 20, 2018 盆栽 team, Lille 1

What’s a long-read? Third generation reads are : • Long > 10kb 1 • Erroneous ≈ 16% 1 • Chimeric 2 1 Jain et al. 2018 2 Laver et al. 2016 2

Sequencing faster, cheaper, stronger 3

What we can do with long-read? By mapping against reference: • read correction • variant calling • . . . against themselves: • self correction • assembly • . . . 4

Long-read mapping Many tools : • minimap[2] • mhap • ngmlr • graphmap • daligner • . . . Some output format: • MHAP: read1 read2 0.14 1955 0 998 20480 21581 0 45 19527 19801 • Pairwise Alignement Format: read1 21581 998 20480 + read2 19801 45 19527 1955 19482 255 • SAM 5

Correction? Correction involves a lot of operations and costs time and memory. I just want to detect chimeras. 6

What is a chimera? ”Error profile of a typical long read. The average error rate is say 12% but it varies and occasionally is pure junk.” Gene Myers 4 Chimeric read: when a part of the read is not well supported (i.e. covered) by other reads of the dataset. 4 https://dazzlerblog.wordpress.com/2017/04/22/1344/ 7

Yet Another Chimeric Read Detector 8

Yet Another Chimeric Read Detector Test dataset: 20x synthetic long read 5 of T. roseus 5 LongISLND with pacbio error model 9

Yet Another Chimeric Read Detector DAScrubber 6 minimap2 + yacrd wallclock time (seconds) 48.13 365.79 precision 100.00% 87.70% sensitivity 70.34% 71.16% 6 run by https://github.com/rrwick/DASCRUBBER-wrapper 10

Another trouble: the disk space 18 flowcells produce ≈ 180Gb-540Gb A summary of troubles and some possible solutions: https://blog.pierre.marijon.fr/binary-mapping-format/ 11

Filter Pairwise Alignment FPA can filter on: • type : • containment • internal match • dovetails • self match • overlap length • read match against a regex FPA can rename your read, compress (gzip, bzip, lzma) and convert your pairwise alignment in an overlap graph (GFA1) 12

Filter Pairwise Alignment output length (Mb) wallclock time (s) / % space saved throughput (kb/s) minimap2 866 565 652.320 minimap2 + fpa no filter 869 565 (0%) 650.047 minimap2 + fpa ovl length > 2000 868 452 (20%) 520.468 minimap2 + fpa dovetails only 869 401 (29%) 462.007 Dataset: SQK-MAP-006 2D nanopore read http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/ 13

Filter Pairwise Alignment minimap2 minimap2 + miniasm fpa + miniasm diff PAF file size (Mb) 565 452 -20% assembly time (s) 6.5 6 0.5 ∅ assembly result Dataset: SQK-MAP-006 2D nanopore read http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/ 14

Conclusion What we have: • more and more third generation sequencing data • analyses generate even more intermediate data • with simple algorithms we can save time and space What we need: • compressed pairwise alignement format • to detect more precisely poor quality regions 15

Questions? yacrd : https://gitlab.inria.fr/pmarijon/yacrd fpa: https://gitlab.inria.fr/pmarijon/fpa twitter : @pierre marijon slides are avaible on my website: https://pierre.marijon.fr 16

Optimizing early steps of long-read genome assembly ephane VARR - PowerPoint PPT Presentation

Optimizing early steps of long-read genome assembly ephane VARR Pierre MARIJON, Ma el KERBIRIOU, Jean-St E, Rayan CHIKHI November 20, 2018 team, Lille 1 Whats a long-read? Third generation reads are : Long > 10kb 1

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Operating Experience with the New RHIC Control Room. Peter Ingrassia. Collider-Accelerator

Overview/Questions Where did computers come from? When were computers first discovered?

The Multilingual Web Where Are We? Next Generation Localisation Josef van Genabith, CNGL

Status and Roadmap of the CernVM-FS Graphdriver Plugin for Docker CERN, SFT Group Meeting Nikola

Web Security: Basic Web Security Model [continued] Spring

Adaptive Multilevel BDDC Jan Mandel Includes joint work with Clark Dohrmann, Bed rich Soused

Consistency algorithms Chapter 3 Fall 2010 1 Consistency methods Approximation of inference:

WORLDS BEST WORKFORCE Lakes Country Service Cooperative Western Lakes Center of Excellence

Sambuz

Useful Links

Newsletter

Mail Us

Optimizing early steps of long-read genome assembly ephane VARR - PowerPoint PPT Presentation

Optimizing early steps of long-read genome assembly ephane VARR Pierre MARIJON, Ma el KERBIRIOU, Jean-St E, Rayan CHIKHI November 20, 2018 team, Lille 1 Whats a long-read? Third generation reads are : Long > 10kb 1

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Operating Experience with the New RHIC Control Room. Peter Ingrassia. Collider-Accelerator

Overview/Questions Where did computers come from? When were computers first discovered?

The Multilingual Web Where Are We? Next Generation Localisation Josef van Genabith, CNGL

Status and Roadmap of the CernVM-FS Graphdriver Plugin for Docker CERN, SFT Group Meeting Nikola

Web Security: Basic Web Security Model [continued] Spring

Adaptive Multilevel BDDC Jan Mandel Includes joint work with Clark Dohrmann, Bed rich Soused

Consistency algorithms Chapter 3 Fall 2010 1 Consistency methods Approximation of inference:

WORLDS BEST WORKFORCE Lakes Country Service Cooperative Western Lakes Center of Excellence

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference