consent scalable self correction of long reads with
play

CONSENT: Scalable self-correction of long reads with multiple - PowerPoint PPT Presentation

CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ,


  1. CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ, CNRS, CRIStAL, Lille 59000, France. JOBIM 2019 Nantes July 5th

  2. Introduction Workflow Experiments Conclusion Introduction Context 2011: Inception of third generation sequencing technologies Two main actors: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) Sequencing of much longer reads, tens of kbps on average Expected to solve various problem in the genome assembly field But also very noisy (10-30% error rates), most errors being indels Morisse et al. CONSENT 2/31

  3. Introduction Workflow Experiments Conclusion Introduction Error correction Correction: efficient way to handle these errors Two approaches: Hybrid correction (makes use of complementary short reads) Self-correction (corrects the long reads solely based on the information they contain) Morisse et al. CONSENT 3/31

  4. Introduction Workflow Experiments Conclusion Introduction Self-correction Third generation sequencing technologies evolve fast: Error rates greatly decreased, and now reach 10-12% on average Read length is evergrowing, especially with ONT ultra-long reads (up to 1Mbp) Error correction is still the first step of many analysis projects Self-correction is now much more developped Morisse et al. CONSENT 4/31

  5. Introduction Workflow Experiments Conclusion Introduction Self-correction State-of-the-art: Compute overlaps between the LRs 1 Compute consensus from the overlaps 2 Morisse et al. CONSENT 5/31

  6. Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 ACCAA GG T R 1 AC A A G GGT R 2 ACCAA .. T R 3 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 1 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 3 3 2 R 3 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG 3 3 3 A C A G G T 1 1 1 1 A G Morisse et al. CONSENT 6/31

  7. Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 ACCAA GG T R 1 AC A A G GGT R 2 ACCAA .. T R 3 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 1 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 3 3 2 R 3 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG 3 3 3 A C A G G T 1 1 1 1 A G Morisse et al. CONSENT 6/31

  8. Introduction Workflow Experiments Conclusion Introduction Contribution Major issue: no self-correction tool scales to ONT ultra-long reads We introduce CONSENT, a new self-correction method that: Combines the two previous approaches (MSA + DBG) Computes actual MSA Compares well to the state-of-the-art, and scales better Is also able to polish contigs Morisse et al. CONSENT 7/31

  9. Introduction Workflow Experiments Conclusion Pre-treatment Overlap the long reads Currently with Minimap2 [Li, 2018] But not dependent on the aligner Morisse et al. CONSENT 8/31

  10. Introduction Workflow Experiments Conclusion First step: retrieve alignment piles Select a long read to correct A Morisse et al. CONSENT 9/31

  11. Introduction Workflow Experiments Conclusion First step: retrieve alignment piles Retrieve overlapping long reads A Morisse et al. CONSENT 10/31

  12. Introduction Workflow Experiments Conclusion First step: retrieve alignment piles Get the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 11/31

  13. Introduction Workflow Experiments Conclusion First step: retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 12/31

  14. Introduction Workflow Experiments Conclusion First step: retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 13/31

  15. Introduction Workflow Experiments Conclusion Second step: divide piles into windows For correction, we will only consider windows that: Have a fixed length Are supported by at least c reads Example On the previous example, with c = 4: A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 14/31

  16. Introduction Workflow Experiments Conclusion Second step: divide piles into windows For correction, we will only consider windows that: Have a fixed length Are supported by at least c reads Example On the previous example, with c = 4: A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 14/31

  17. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window 2. Compute consensus Compute MSA of the sequences Compute consensus from the MSA Unlike other methods, actual MSA is computed ⇒ POA [Lee et al., 2002] Morisse et al. CONSENT 15/31

  18. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 16/31

  19. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 16/31

  20. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 16/31

  21. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy In practice, we use windows of a few hundred bases POA is time consuming, even on such windows We developed a segmentation strategy Compute MSA and consensus for smaller sequences ⇒ faster Morisse et al. CONSENT 17/31

  22. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 1. Compute shared anchors between the window’s sequences Morisse et al. CONSENT 18/31

  23. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 1. Compute shared anchors between the window’s sequences Morisse et al. CONSENT 18/31

  24. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 19/31

  25. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 19/31

  26. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 19/31

  27. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 19/31

  28. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 3. Compute MSA / consensus for sequences bordered by anchors cons. cons. cons. cons. cons. cons. Morisse et al. CONSENT 20/31

  29. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 3. Compute MSA / consensus for sequences bordered by anchors cons. cons. cons. cons. cons. cons. Morisse et al. CONSENT 20/31

  30. Introduction Workflow Experiments Conclusion Third step: compute consensus of a window Segmentation strategy 3. Compute MSA / consensus for sequences bordered by anchors cons. cons. cons. cons. cons. cons. Morisse et al. CONSENT 20/31

  31. Introduction Workflow Experiments Conclusion Fourth step: polish the window’s consensus Approach Consensus ⇒ solid k -mers in uppercase, weak k -mers in lowercase GATCGGGTcatTGCCCGTGTTTATGCGTgtg Build a DBG from the window’s sequences Correct lowercase regions Morisse et al. CONSENT 21/31

  32. Introduction Workflow Experiments Conclusion Fifth step: anchor the consensus to the read By alignment Local alignment, around the positions of the window Repeat with other windows Morisse et al. CONSENT 22/31

Recommend


More recommend