De novo genome assembly Dr Torsten Seemann IMB Winter School - - PowerPoint PPT Presentation

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane – Mon 1 July 2013

Introduction

Ideal world I would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG AGATTCGTATAAGTCT AGGATTCGCTATAGAT TCAGGCTCTGATATAT TTCGCGGGATTAGCTA Human DNA Non-existent 46 complete USB3 device haplotype chromosome sequences

Real world • Can’t sequence full-length native DNA – no instrument exists (yet) • But we can sequence short fragments – 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)

Make a DNA library • DNA preparation – depends on sequencing platform being used • Typical steps – Shearing: chop DNA into smaller fragments – Size selection: choose the size range you need – Adaptor ligation: add special sequence to ends • Now ready to sequence!

Instruments Platform Method Read Length Yield Quality Value synthesis + ++++ +++++ ++++ Illumina 250 fluorescence ligation + ++++ +++ +++ SOLiD 75 fluorescence non-term NTP + ++ +++ +++ PGM 300 pH wells non-term NTP + +++ ++ +++ Proton 400 pH wells non-term NTP + + +++ ++ Roche 454 600 luminescence ++ + ++ PacBio 12000 synthesis + ZMW

Which sequencing platform? Long reads Pick High Low yield any 3 cost High quality

De novo assembly The process of reconstructing the original DNA sequence from the fragment reads alone. • Instinctively like a jigsaw puzzle – Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)

An example

A small “genome” I’ll return them tomorrow! Friends, Romans, countrymen, lend me your ears;

Shakespearomics • Reads Oops! ds, Romans, count I dropped ns, countrymen, le them. Friends, Rom send me your ears; crymen, lend me

Shakespearomics • Reads I ’ m good ds, Romans, count with words. ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

Shakespearomics • Reads We have a ds, Romans, count consensus! ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; • Majority consensus Friends, Romans, countrymen, lend me your ears;

So far, so good.

The awful truth “Genome assembly is impossible.” He wears glasses so he must be smart A/Prof. Mihai Pop World leader in de novo assembly research.

Methods

Approaches • greedy assembly • overlap :: layout :: consensus • de Bruijn graphs • string graphs • seed and extend … all essentially doing the same thing, but taking different short cuts.

Assembly recipe • Find all overlaps between reads – hmm, sounds like a lot of work … • Build a graph – a picture of read connections • Simplify the graph – sequencing errors will mess it up a lot • Traverse the graph – trace a sensible path to produce a consensus

Find read overlaps • If we have N reads of length L – we have to do ½ N(N-1) ~ O(N ² ) comparisons – each comparison is an ~ O(L ² ) alignment – use special tricks/heuristics to reduce these! • What counts as “ overlapping ” ? – minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate

N=6 → 15 alignment scores 1 2 3 4 5 6 Read# 1 - - - - - - 2 80 - - - - - 3 95 85 - - - - 4 0 30 20 - - - 5 0 0 25 70 - - 6 0 35 25 60 50 -

Graph construction Thicker lines mean stronger evidence for overlap Node/Vertex Edge/Arc

A more realistic graph

What ruins the graph? • Read errors – introduce false edges and nodes • Non-haploid organisms – heterozygosity causes lots of detours • Repeats – if longer than read length – causes nodes to be shared, locality confusion

Graph simplification • Squash small bubbles – collapse small errors (or minor heterozygosity) • Remove spurs – short “dead end” hairs on the graph • Join unambiguously connected nodes – reliable stretches of unique DNA

Graph traversal • For each unconnected graph – at least one per replicon in original sample • Find a path which visits each node once – Hamiltonian path/cycle is NP-hard (this is bad) – solution will be a set of paths which terminate at decision points • Form a consensus sequences from paths – use all the overlap alignments – each of these collapsed paths is a contig

Contigs Contiguous, unambiguous stretches of assembled DNA sequence • Contigs ends correspond to – Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)

Repeats

What is a repeat? A segment of DNA which occurs more than once in the genome sequence • Very common – Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)

Dot plots Self similarity plot, genome versus itself

Effect on assembly The repeated element is collapsed into a single contig

Repeat mis-assembly excision collapsed tandem I II III d a b c a b c c b I III a d a c b II b c rearrangement I II III IV c e a b d f I III II IV a e c a d b f

The law of repeats • It is impossible to resolve repeats of length S unless you have reads longer than S. • It is impossible to resolve repeats of length S unless you have reads longer than S.

Scaffolding

Beyond contigs Contig sizes are limited by: • the length of repeats in your genome – Can’t change this! • the length (or “ span ” ) of the reads – Wait for new technology – Use “tricks” with existing technology

Types of reads • Example fragment – atcgtatgatcttgagattctctcttcccttatagctgctata • “Single-end” read – atcgtatgatcttgagattctctcttcccttatagctgctata – Sequence one end of the fragment • “Paired-end” read – atcgtatgatcttgagattctctcttcccttatagctgctata – Sequence both ends of same fragment – we can exploit this information!

Scaffolding • Paired-end reads – known sequences at either end – roughly known distance between ends – unknown sequence between ends • Most ends will occur in same contig – if our contigs are longer than pair distance • Some ends will be in different contigs – evidence that these contigs are linked!

Contigs to scaffolds Paired-end read Contigs Scaffold Gap Gap

Assumptions

What can we assemble? • Genomes – A single organism eg. its chromosomal DNA • Meta-genomes – Genomic DNA from a mixture of organisms • Transcriptomes – A single organism’s RNA inc. mRNA, ncRNA • Meta-transcriptomes – RNA from a mixture of organisms 2:30pm

Genomes • Expect uniformity – Each part of genome represented by roughly equal number of reads • Average depth of coverage – Genome: 4 Mbp – Yield: 4 million x 50 bp reads = 200 Mbp – Coverage: 200 ÷ 4 = 50x (reads per bp)

Meta-genomes • Expect proportionality & uniformity – Each genome represented by proportion of reads similar to their proportion in mixture • Example – Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli – Say we get 4M reads – Then we expect about: 1M from Staph, 1M from Clost, 2M from Ecoli

Meta-genome issues • Closely related species – will have very similar reads – lots of shared nodes in the graph • Conserved sequence – bits of DNA common to lots of organisms – “hub” nodes in the graph • Untangling is difficult – need longer reads

Assessment

Assessing assemblies • We desire – Total length similar to genome size – Fewer, larger contigs – Correct contigs • Metrics – No generally useful objective measure – Longest contig, total bp, N50, …

The “N50” The length of that contig from which 50% of the bases are in it and shorter contigs • Imagine we got 7 contigs with lengths: – 1,1,3,5,8,12,20 • Total – 1+1+3+5+8+12+20 = 50 • N50 is the “halfway sum” = 25 – 1+1+3+5+8+12 = 30 ( ≥ 25) so N50 is 12

N50 concerns • Optimizing for N50 – encourages mis-assemblies! • An aggressive assembler may over-join: – 1,1,3,5, 8,12 ,20 (previous) – 1,1,3,5, 20 ,20 (now) – 1+1+3+5+20+20 = 50 (unchanged) • N50 is the “halfway sum” (still 25) – 1+1+3+5+20= 30 ( ≥ 25) so N50 is 20 (was 12)

Validation • Self consistency – Align read back to contigs – Check for errors or discordant pairs • Second opinion – Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”

How do I do it?

Example • Culture your bacterium • Extract your genomic DNA • Send it to AGRF for Illumina sequencing – 100bp paired end • Get back two files: – MRSA_R1.fastq.gz – MRSA_R2.fastq.gz • Now what?

De novo genome assembly Dr Torsten Seemann IMB Winter School - - PowerPoint PPT Presentation

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013 Introduction Ideal world I would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

DNA Assembly and Finishing DNA Assembly and Finishing Latin American Course on Bioinformatics

Summary Company Auto Components LED Financials 3 Fiem Industries Ltd. (FIEM) was founded

Timelapse Photography A (VE RY) BRIE F INTRODUCTION Derek Carlin | New Westminster Photography

APEC Human Resource Development Working Group Network on Economic Development Management

Repetitive DNA and next-generation sequencing: computational challenges and solutions Todd J.

17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the

Computing and Deep Learning Johnny Israeli COMPUTE TRENDS GPU-Computing perf 10 1.5X per year

Universal Network Design and Assembly Introduction DNA Assembly This year, we improved upon

De novo genome assembly Dr Torsten Seemann IMB Winter School - - PowerPoint PPT Presentation

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013 Introduction Ideal world I would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

DNA Assembly and Finishing DNA Assembly and Finishing Latin American Course on Bioinformatics

Summary Company Auto Components LED Financials 3 Fiem Industries Ltd. (FIEM) was founded

Timelapse Photography A (VE RY) BRIE F INTRODUCTION Derek Carlin | New Westminster Photography

APEC Human Resource Development Working Group Network on Economic Development Management

Repetitive DNA and next-generation sequencing: computational challenges and solutions Todd J.

17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the

Computing and Deep Learning Johnny Israeli COMPUTE TRENDS GPU-Computing perf 10 1.5X per year

Universal Network Design and Assembly Introduction DNA Assembly This year, we improved upon

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational