relaxations of the seriation problem and applications to
play

Relaxations of the Seriation Problem and Applications to de novo - PowerPoint PPT Presentation

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de th` ese Antoine Recanati sous la direction dAlexandre dAspremont 29 Novembre 2018 Introduction Genome sequencing ...ATGGCGTGCAATG...


  1. Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de th` ese Antoine Recanati sous la direction d’Alexandre d’Aspremont 29 Novembre 2018

  2. Introduction

  3. Genome sequencing ...ATGGCGTGCAATG... � ...TACCGCACGTTAC... 1

  4. DNA sequencing Genome is cut into overlapping fragments ( reads ). Ex: ATGGCGTGCAATG          Image: Nik Spencer/Nature 2

  5. DNA sequencing Genome is cut into overlapping fragments ( reads ). Ex: ATGGCGTGCAATG  CGTGCAA         Image: Nik Spencer/Nature 2

  6. DNA sequencing Genome is cut into overlapping fragments ( reads ). Ex: ATGGCGTGCAATG  CGTGCAA    ATGGCGT      Image: Nik Spencer/Nature 2

  7. DNA sequencing Genome is cut into overlapping fragments ( reads ). Ex: ATGGCGTGCAATG  CGTGCAA    ATGGCGT  TGCAATG     Image: Nik Spencer/Nature 2

  8. DNA sequencing Genome is cut into overlapping fragments ( reads ). Ex: ATGGCGTGCAATG  CGTGCAA    ATGGCGT  TGCAATG    GGCGTGC  Image: Nik Spencer/Nature 2

  9. Assembly Goal: assemble reads together to reconstruct the full sequence. The position and ordering of the reads are unknown. ATGGCGTGCAATG  CGTGCAA ATGGCGTGCAATG    ATGGCGT  � ATGGCGTGCAATG TGCAATG  ATGGCGTGCAATG   GGCGTGC  ATGGCGTGCAATG 3

  10. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence CGTGCAA ATGGCGT TGCAATG GGCGTGC 4

  11. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) CGTGCAA ATGGCGT TGCAATG GGCGTGC 4

  12. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGT TGCAATG GGCGTGC 4

  13. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGTGCAATG ATGGCGT TGCAATG GGCGTGC 4

  14. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGT ATGGCGTGCAATG ATGGCGTGCAATG TGCAATG GGCGTGC 4

  15. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGTGCAATG ATGGCGT ATGGCGTGCAATG TGCAATG ATGGCGTGCAATG GGCGTGC 4

  16. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGTGCAATG ATGGCGT ATGGCGTGCAATG TGCAATG ATGGCGTGCAATG GGCGTGC 4

  17. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGTGCAATG ATGGCGT ATGGCGTGCAATG TGCAATG ATGGCGTGCAATG GGCGTGC ATGGCGTGCAATG (assembly) 4

  18. Genome assembly: mapping If reference genome available: map the fragments to it, then derive consensus sequence AAGGCGTGCATTG (ref. (proxy)) ATGGCGTGCAATG CGTGCAA ATGGCGT ATGGCGTGCAATG ATGGCGTGCAATG TGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG (assembly) 4

  19. Genome assembly: de novo No reference available. Greedy assembly: take one read, “add” the one with largest overlap, etc., until all reads are included. CGTGCAA ATGGCGT TGCAATG GGCGTGC 5

  20. Genome assembly: de novo No reference available. Greedy assembly: take one read, “add” the one with largest overlap, etc., until all reads are included. CGTGCAA ATGGCGTGCAATG ATGGCGT TGCAATG GGCGTGC ATGGCGTGCAATG 5

  21. Genome assembly: de novo No reference available. Greedy assembly: take one read, “add” the one with largest overlap, etc., until all reads are included. CGTGCAA ATGGCGTGCAATG ATGGCGT TGCAATG ATGGCGTGCAATG GGCGTGC ATGGCGTGCAATG 5

  22. Genome assembly: de novo No reference available. Greedy assembly: take one read, “add” the one with largest overlap, etc., until all reads are included. CGTGCAA ATGGCGTGCAATG ATGGCGT TGCAATG ATGGCGTGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG 5

  23. Genome assembly: de novo No reference available. Greedy assembly: take one read, “add” the one with largest overlap, etc., until all reads are included. CGTGCAA ATGGCGTGCAATG ATGGCGT ATGGCGTGCAATG TGCAATG ATGGCGTGCAATG GGCGTGC ATGGCGTGCAATG ATGGCGTGCAATG 5

  24. De novo assembly paradigms • Greedy methods • De Bruijn graphs • Overlap-Layout-Consensus 6

  25. Overlap-Layout-Consensus • Compute overlaps between all read pairs • Find tiling of reads consistent with overlaps • Average reads values to create consensus sequence ATGGCGTGCAATG CGTGCAA TGCAA ATGGCGTGCAATG CGT ATGGCGT ATGGCGTGCAATG CGTGC TGCAATG ATGGCGTGCAATG GGCGT TGC GGCGTGC ATGGCGTGCAATG 7

  26. Modern sequencing technologies • 2nd gen. (SGS): short ( ∼ 100bp), accurate ( < 2% err.)reads (Illumina/Solexa), with pairing information. De Bruijn graphs methods (on k-mers based graph) preferred. • 3rd. gen.: long ( ∼ 10000bp), noisy ( ∼ 10%) reads (Pacific Biosciences [PacBio], Oxford Nanopore Technology [ONT]). Come-back of OLC methods. • Can be combined to have both accuracy and length ( hybrid methods ) 8

  27. De novo assembly methods with ONT reads State of the art: Canu (ex. Celera Assembler). Heavy pre-processing , many heuristics • correction: (uses [hash-based] overlaps for consensus) • trimming: recalculate overlaps to filter low-coverage/high-error regions • re-computation of overlaps with specific target errors (uses a priori model of errors) • assemble unitigs (unambiguous sequences) first, then incremental scaffolding 9

  28. De novo assembly methods with ONT reads • ONT-only assemblers (non-hybrid): active field of research 2015-now • Canu: complex pipeline, high quality consensus. • Miniasm: ideas of Canu assembly, no pre-processing, smart heuristics. Ultra-fast , low-quality . • Naive OLC approach with clean mathematical formulation ? 10

  29. Introduction De novo Genome Assembly Seriation Application of the Spectral Method to Genome Assembly Robust Seriation Multi-dimensional spectral ordering Conclusion 11

Recommend


More recommend