short read genome assembly Sorin Istrail CSCI1820 Short-read genome - PowerPoint PPT Presentation

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1

Genomathica Assembler • Mathematica notebook for genome assembly simulation • Assembler can be found at: http://cs.brown.edu/courses/csci1820/software/minimal_asse mbler.nb • Sample FASTA genome phix174.fasta can be found in HW5 Biology: http://cs.brown.edu/courses/csci1820/software/phix174.fasta • Remember to – Change the input genome to your FASTA file ’ s location – Evaluate each cell initially, then you only need to evaluate the last two cells to re-run the assembly, and display the results respectively – Mathematica can be downloaded here: http://www.brown.edu/information-technology/software/ 2

coverage = 1 • Sequence reads are in black • Contiguous strings of assembled DNA ( contigs ) are in red

coverage = 2 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red

coverage = 2, paired ends

Sample prep Sequence data Raw Sequence Reads • wet-lab experimental methods to isolate, prepare, and sequence the DNA • results in a number of large FASTQ files • FASTQC can be used to check basic statistics of the files – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • many tools available for QC – e.g. http://hannonlab.cshl.edu/fastx_toolkit/

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

http://www.ncbi.nlm.nih.gov/Traces/sra/

Genome Assembly Software • Overlap-layout-consensus • Celera: http://wgs-assembler.sourceforge.net/ • K-mer based • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

Two graph models • A first graph model – Nodes (vertices) are contiguous sequences of k characters (k-mer) – Directed edge from v i to v j if v i [2..k]=v j [1..k-1] A C G T T C ACG CGT GTT TTC

Two graph models • De-bruijn graph – Nodes (vertices) are contiguous sequences of k-1 characters (k-1-mer) – Directed edge from v i to v j if v i [1..k-1]+v j [k-1] are a valid k-mer A C G T T C ACG CGT GTT TTC AC CG GT TT TC

Note edges that are not reflected in the input! Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap 17

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 ACGT CGTA GACG k=3 1 1 1 ACG CGT GTA GAC

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 1 TACG GTAC ACGT CGTA GACG k=3 1 1 2 1 ACG CGT GTA TAC GAC 1

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT CGTT k=4 1 1 1 1 1 TACG GTAC ACGT CGTA GACG GTT k=3 1 1 2 2 1 ACG CGT GTA TAC GAC 2

Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap – nodes as (k-1)-mers, edges form k-mers 21

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 ACG CGT GTA GAC k=3 1 1 1 1 AC CG GT TA GA

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 2 1 ACG CGT GTA TAC GAC 1 k=3 1 3 2 2 AC CG GT TA GA 1

Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT GTT k=4 1 1 2 2 1 ACG CGT GTA TAC GAC 2 1 TT GT k=3 1 1 4 2 2 AC CG GT TA GA 2

Genome Assembly • Building the k-mer graph – G(k): nodes as k-mers, edges (k-1) overlap – H(k): nodes as (k-1)-mers, edges form k-mers • H(k)=G(k-1) – So it really does not matter which you choose to implement • Where does the complexity come from? – Sequencing errors, repeats, uneven coverage, contamination from other organisms, ploidy, unsequenced regions 25

Popping bubbles Error occurs in the middle of a read and is propagated to many k-mers.

Trimming tips Error creates an erroneous ending k-mer

Chimeric extensions Errors connect two nodes in the graph which do not correspond to a valid extension in the genome sequence Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Repetitive regions • Satellites, SINEs, LINEs • Homologous Genes – Ortholog: descended from the same ancestral sequence and separated by speciation – Paralog: genes created by a duplication event 29

30 Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Velvet assembler • Four stages – Hashing reads into k-mers – Constructing the de Bruijn graph (not all 4^k k- mers, only those that exist in input) – Correct errors – Resolve repeats • But what after? – Paper gives very little information on this... 32

The Chinese postman problem (CPP) • Compute a closed tour of minimum length that visits each edge at least once – Similar to what we want except we may want to visit edges more than once due to repeats • How do we deal with repeats? – Also, the starting and ending vertices are distinct in genome assembly • How can we convert the closed tour to an open one? 33

Your homework • You are not required to implement section 4 of http://web.eecs.umich.edu/~pettie/matching /Edmonds-Johnson-chinese-postman.pdf • You are not even required to model genome assembly as CPP • But you do have to build the k-mer graph, correct errors, resolve repeats, and compute a CPP or Eulerian-like tour. 34

Evaluating assembly • The Assemblathon2 study lists 102 measures for evaluating assembly quality. • Bradnam et al. ( 2013) Assemblathon 2: evaluating de novo methods fo genome assembly in three vertebrate species 1. NG50 scaffold length: a length x where all scaffolds of length x or longer consists of at least 50% of the genome size 2. NG50 contig length: a length x where all contigs of length x or longer consists of at least 50% of the genome size 3. Amount of gene-sized scaffolds (>25 kbp). Useful for gene finding. 4. CEGMA: Number of 458 core genes mapped

Evaluating assembly 5. Fosmid coverage: How many validated fosmid regions were captured in assembly 6. Fosmid validity: Percentage of assembly validated by validated fosmid regions 7. Validated fosmid region tag scaffold summary score: number of validated fosmid region tag pairs that match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map with correct distance. Rewards short-range accuracy. 8. and 9. Using local and global alignments of optimal map data, how well the assembly is ordered. 10. REAPR summary score: a tool that evalutes accuracy of assembly using paired reads

short read genome assembly Sorin Istrail CSCI1820 Short-read genome - PowerPoint PPT Presentation

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CUDA-Accelerated Short-Read Alignment to a Large Reference Genome Richard Wilton Department of

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Improving Information Services for Education The Strategic Vision New services will

Open Source and Capacity in the HISP Network 02.10.2017 Action and research in the HISP network

Data Linkage Research at the ANU Peter Christen Department of Computer Science, Faculty of

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Sambuz

Useful Links

Newsletter

Mail Us

short read genome assembly Sorin Istrail CSCI1820 Short-read genome - PowerPoint PPT Presentation

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CUDA-Accelerated Short-Read Alignment to a Large Reference Genome Richard Wilton Department of

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Improving Information Services for Education The Strategic Vision New services will

Open Source and Capacity in the HISP Network 02.10.2017 Action and research in the HISP network

Data Linkage Research at the ANU Peter Christen Department of Computer Science, Faculty of

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference