Glenn Tesler University of California, San Diego Department of - PowerPoint PPT Presentation

Glenn Tesler University of California, San Diego Department of Mathematics Joint work with Jeff McLean and Roger Lasken’s labs at JCVI Pavel Pevzner’s labs at UCSD and St. Petersburg 2

 Genome sequencing ◦ Conventional ◦ Metagenomics ◦ Single Cell  De Bruijn graphs & SPAdes genome assembler  P. gingivalis found in a hospital sink drain 6

 The E. coli genome is ~ 4.6 million nucleotides long. Represent it as a (circular) string over the alphabet {A, C, G, T}: E. coli K-12 substr. MG1655 1-50 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA   51-100 AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT   101-150 TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA   151-200 GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCAT   . . . . . . . . . . . . . . . . . . . . . . . . . . . .   4639651-4639675   AAAAACGCCTTAGTAAGTATTTTTC    The human genome is ~ 3 billion nucleotides long, split into chromosomes represented as linear strings over {A, C, G, T}.  Current technologies read ~ 25 – 10000 consecutive nucleotides. We focus on the popular Illumina GAIIx, with 100 nt reads . 9

Fragment many copies of same genome. Lose positional information. Sequence reads (25 to 10000 nt) at one or both ends of fragments. Find overlapping reads. ACGTAGAATCGACCATG... ...AACATAGTTGACGTAGAATC Merge overlapping reads into contigs. ...AACATAGTTGACGTAGAATCGACCATG... Gap Gap Contig Contig Contig Coverage here = 2 14

• Problem: Given a collection of reads (short substrings of the genome sequence in the alphabet A, C, G, T), reconstruct the genome from which the reads are derived. • Challenges: ◦ Repeats in the genome …ACCCAGTT GACTGGGAT CCTTTTTAAA GACTGGGAT TTTAACGCGTAAG…   CAGTT GACTG   ACTGGGAT CC Sample reads GACTGGGAT T ◦ Sequencing errors (vary by platform and protocol), including: CCTTTTTATAGACTG Substitution   CCTTTTTA-AGACTGG Deletion   CCTTTCTTAAAGACT Insertion CCTTTTTTTTAAAGA Homopolymer CCTTTTTTCGCGTAA Chimeric read ◦ Size of the data, e.g. 30 million reads of length 100 nt in a 7 GB file. 17

 Traditional microbial genome sequencing requires isolating a pure strain and reproducing it in a ‘culture’ under controlled laboratory conditions. But >99% of bacteria cannot be cultured.  Metagenomics enables studies of organisms not easily cultured in a laboratory. It uses collective sequencing of non-identical cells.  Until recently, metagenomics was the only option for studies of microbial communities. However, metagenomics provides information about only a few genes (across many species), and/or information about the dominant species. gene 1 gene 2 gene 3 21

 Traditional microbial genome sequencing requires isolating a pure strain and reproducing it in a ‘culture’ under controlled laboratory conditions. But >99% of bacteria cannot be cultured.  Metagenomics enables studies of organisms not easily cultured in a laboratory. It uses collective sequencing of non-identical cells.  Until recently, metagenomics was the only option for studies of microbial communities. However, metagenomics provides information about only a few genes (across many species), and/or information about the dominant species. 22

 Traditional microbial genome sequencing requires isolating a pure strain and reproducing it in a ‘culture’ under controlled laboratory conditions. But >99% of bacteria cannot be cultured.  Metagenomics enables studies of organisms not easily cultured in a laboratory. It uses collective sequencing of non-identical cells.  Single Cell Bacterial Genomics: Complementing gene-centri c metagenomics data with whole-genom e assembly of uncultivated organisms. 1000s of genes sequenced from a single cell 23

Genomic DNA F.B. Dean, J.R. Nelson, T.L. Giesler, R.S. Lasken (2001). Genome Res. 11:1095-9 F.B. Dean, S. Hosono, L. Fang, et al. (2002). PNAS 99:5261-6  Roger Lasken’s lab developed Multiple Displacement Amplification (MDA).  More effective than PCR for amplification of a single cell.  Commercially available kits: TempliPhi and GenomiPhi (GE Healthcare) and REPLI-g (Qiagen).  REPLI-g: fragments ~ 2 – 100 kb; usually > 10 kb on average. 29

1 st generation copies Genomic DNA F.B. Dean, J.R. Nelson, T.L. Giesler, R.S. Lasken (2001). Genome Res. 11:1095-9 F.B. Dean, S. Hosono, L. Fang, et al. (2002). PNAS 99:5261-6  Roger Lasken’s lab developed Multiple Displacement Amplification (MDA).  More effective than PCR for amplification of a single cell.  Commercially available kits: TempliPhi and GenomiPhi (GE Healthcare) and REPLI-g (Qiagen).  REPLI-g: fragments ~ 2 – 100 kb; usually > 10 kb on average. 30

2 nd generation copies 1 st generation copies Genomic DNA F.B. Dean, J.R. Nelson, T.L. Giesler, R.S. Lasken (2001). Genome Res. 11:1095-9 F.B. Dean, S. Hosono, L. Fang, et al. (2002). PNAS 99:5261-6  Roger Lasken’s lab developed Multiple Displacement Amplification (MDA).  More effective than PCR for amplification of a single cell.  Commercially available kits: TempliPhi and GenomiPhi (GE Healthcare) and REPLI-g (Qiagen).  REPLI-g: fragments ~ 2 – 100 kb; usually > 10 kb on average. 31

3 rd generation copies 2 nd generation copies 1 st generation copies Genomic DNA F.B. Dean, J.R. Nelson, T.L. Giesler, R.S. Lasken (2001). Genome Res. 11:1095-9 F.B. Dean, S. Hosono, L. Fang, et al. (2002). PNAS 99:5261-6  Roger Lasken’s lab developed Multiple Displacement Amplification (MDA).  More effective than PCR for amplification of a single cell.  Commercially available kits: TempliPhi and GenomiPhi (GE Healthcare) and REPLI-g (Qiagen).  REPLI-g: fragments ~ 2 – 100 kb; usually > 10 kb on average. 32

4 th generation copies 3 rd generation copies 2 nd generation copies 1 st generation copies Genomic DNA F.B. Dean, J.R. Nelson, T.L. Giesler, R.S. Lasken (2001). Genome Res. 11:1095-9 F.B. Dean, S. Hosono, L. Fang, et al. (2002). PNAS 99:5261-6  Roger Lasken’s lab developed Multiple Displacement Amplification (MDA).  More effective than PCR for amplification of a single cell.  Commercially available kits: TempliPhi and GenomiPhi (GE Healthcare) and REPLI-g (Qiagen).  REPLI-g: fragments ~ 2 – 100 kb; usually > 10 kb on average. 33

 Lander-Waterman model predicts ~15x coverage needed for complete E. coli assembly.  Assumes uniform coverage; error-free reads; and no repeats in genome.  For our single cell E. coli assembly, 600x average coverage still has some gaps since there are positions with no reads. 38

A cutoff threshold will eliminate about 25% of valid data in the single cell case, whereas it eliminates noise in the normal multicell case. Chitsaz, et al., Nat. Biotechnol. (2011). 39

 Genome sequencing ◦ Conventional ◦ Metagenomics ◦ Single Cell  De Bruijn graphs & SPAdes genome assembler  P. gingivalis found in a hospital sink drain 44

Vertices: k-mers from the sequence Edges: (k+1)-mers from the sequence k=3: 4-mer wxyz gives wxy → xyz Genome: Eulerian path through graph (using edge multiplicities) ABCDEFGHIJCDEFGKL Genome: HIJ GHI IJC JCD FGH ABCD (twice) (twice) BCD EFG FGK GKL ABC CDE DEF P. Pevzner, J Biomol Struct Dyn (1989) 7:63–73 R. Idury, M. Waterman , J Comput Biol (1995) 2:291–306 P. Pevzner, H. Tang, M. Waterman, PNAS (2001) 98(17):9748-53 48

Vertices: k-mers from the reads Reads (but order would be Edges: (k+1)-mers from the reads random in real data): k=3: 4-mer wxyz gives wxy → xyz ABCDEFG Reads: short walks through graph (red) DEFGHIJ Genome: long walk through graph GHIJCDE We lose exact repeat multiplicities IJCDEFG CDEFGKL HIJ GHI IJC JCD FGH ABCD BCD EFG FGK GKL ABC CDE DEF 49

EFGHIJCDE ABCDE EFGKL CDEFG 50

Genome length 4.6 million bases Reads Illumina GA IIx platform, paired end sequencing 100 bases/read Reads are in pairs spanning ~ 250 bases (varies) ~ 30 million reads (15 million read pairs) ~ 600x coverage ~ 7 GB FASTQ file De Bruin Graph Can set k between ~ 25 – 70. We used parameters 55-mer vertices 56-mer edges Graph size Initially: ~ 200 million vertices (55-mers) Output: ~ 200 – 2000 contigs (varies by assembler) ~ 4.6 million bases 53

De Bruijn graph Mate pairs processing and repeats Error correction Error correction Postprocessing *&$+",(-)$."/01$ %#)2'",%'(#) 9#)'(.$"&:()&5&)' *(2'/)%&$&2'(5/'(#) 3(0$%4(00(). !"#$%&'()*+ 7&0&/'$"&2#4,'(#) ,-"%./+ +,4.&$"&5#6/4 !""#"$%#""&%'(#) 8/0$%4#2(). 91(5&"(%$"&5#6/4  De Bruijn graph assembler.  Adapted to handle conventional and single cell datasets.  Instead of global thresholds, uses local coverage, topology, and lengths to decide how to process the assembly graph. 54

Bulge from error in middle of read P TCGGTGAAAGAGCTTT CGGTGAA C GAGCTTTG Q GGTGAAAGAGCTTTGA GTGAAAGAGCTTTGAT Tip from error near start/end of read TCGGTGAAAGAGCTTT Q CG C TGAAAGAGCTTTG P GGTGAAAGAGCTTTGA GTGAAAGAGCTTTGAT Chimeric connection joining two distant parts of genome Q 1 TCGGTGAAAGAGCTTT CGGTGAAAGAGCTTTG P ACATCGTAAGCTTTGC TCGTAGTAGCCGATTC Q 2 CGTAGTAGCCGATTCG 57

Nurk et al (2013), Journal of Computational Biology (h) (d) 58

Glenn Tesler University of California, San Diego Department of - PowerPoint PPT Presentation

Glenn Tesler University of California, San Diego Department of Mathematics Joint work with Jeff McLean and Roger Laskens labs at JCVI Pavel Pevzners labs at UCSD and St. Petersburg 2 Genome sequencing Conventional

Chapter 2. Walks (Chapters 1.7, 2.12.6) Prof. Tesler Math 154 Winter 2020 Prof. Tesler Ch.

4/18/2017 Promoting Health Through Food Security Presented by the San Diego Food Insecurity

Promoting Health Through Food Security Presented by the San Diego Food Insecurity Coalition

Promoting Health Through Food Security Presented by the San Diego Food Insecurity Coalition

Healthy San Diego Health Care Options Medi Cal Managed Healthcare in San Diego County

2017 SAN DIEGO-BAJA CALIFORNIA BORDER CROSSING AND TRADE STATISTICS SAN DIEGO-BAJA CALIFORNIA

2.6 Gradients and Directional Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.6

Microarrays False Discovery Rate Prof. Tesler Math 186 Winter 2019 Prof. Tesler

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

6.16.4 Hypothesis tests Prof. Tesler Math 186 Winter 2019 Prof. Tesler 6.16.4 Hypothesis

2.3 Partial Derivatives, Linear Approximation Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.3

4.3 Normal distribution Prof. Tesler Math 186 Winter 2020 Prof. Tesler 4.3 Normal distribution

Chapter 5.2 5.8 Matchings Prof. Tesler Math 154 Winter 2020 Prof. Tesler Ch. 5.2 5.8:

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression

Chapter 10.1 Trees Prof. Tesler Math 184A Winter 2017 Prof. Tesler Ch. 10.1: Trees Math 184A

Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Winter 2019 Prof. Tesler Ch. 1.

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept 2016 Single-cell vs bulk

Cancer Genome Analysis (CONEXIC) Akavia et al. Cell, 2010. 02-715

An Object Oriented Simulation of Real Occurring Molecular Biological Processes for DNA Computing

Sample and buffer preparation Melissa Grwert EMBL Hamburg Biology (Dipl.) in Heidelberg

Innovation Washington, DC-based Think Tank & Advocacy Organization A unique model to create

PC-07 Clinical evaluation of three commercial PCR assays for the detection of macrolide resistance

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

Microarray analysis at a glance from low-level data processing to data analysis Olga

Sambuz

Useful Links

Newsletter

Mail Us

Glenn Tesler University of California, San Diego Department of - PowerPoint PPT Presentation

Glenn Tesler University of California, San Diego Department of Mathematics Joint work with Jeff McLean and Roger Laskens labs at JCVI Pavel Pevzners labs at UCSD and St. Petersburg 2 Genome sequencing Conventional

Chapter 2. Walks (Chapters 1.7, 2.12.6) Prof. Tesler Math 154 Winter 2020 Prof. Tesler Ch.

4/18/2017 Promoting Health Through Food Security Presented by the San Diego Food Insecurity

Promoting Health Through Food Security Presented by the San Diego Food Insecurity Coalition

Promoting Health Through Food Security Presented by the San Diego Food Insecurity Coalition

Healthy San Diego Health Care Options Medi Cal Managed Healthcare in San Diego County

2017 SAN DIEGO-BAJA CALIFORNIA BORDER CROSSING AND TRADE STATISTICS SAN DIEGO-BAJA CALIFORNIA

2.6 Gradients and Directional Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.6

Microarrays False Discovery Rate Prof. Tesler Math 186 Winter 2019 Prof. Tesler

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

6.16.4 Hypothesis tests Prof. Tesler Math 186 Winter 2019 Prof. Tesler 6.16.4 Hypothesis

2.3 Partial Derivatives, Linear Approximation Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.3

4.3 Normal distribution Prof. Tesler Math 186 Winter 2020 Prof. Tesler 4.3 Normal distribution

Chapter 5.2 5.8 Matchings Prof. Tesler Math 154 Winter 2020 Prof. Tesler Ch. 5.2 5.8:

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression

Chapter 10.1 Trees Prof. Tesler Math 184A Winter 2017 Prof. Tesler Ch. 10.1: Trees Math 184A

Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Winter 2019 Prof. Tesler Ch. 1.

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept 2016 Single-cell vs bulk

Cancer Genome Analysis (CONEXIC) Akavia et al. Cell, 2010. 02-715

An Object Oriented Simulation of Real Occurring Molecular Biological Processes for DNA Computing

Sample and buffer preparation Melissa Grwert EMBL Hamburg Biology (Dipl.) in Heidelberg

Innovation Washington, DC-based Think Tank &amp; Advocacy Organization A unique model to create

PC-07 Clinical evaluation of three commercial PCR assays for the detection of macrolide resistance

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

Microarray analysis at a glance from low-level data processing to data analysis Olga

Sambuz

Useful Links

Newsletter

Mail Us

Innovation Washington, DC-based Think Tank & Advocacy Organization A unique model to create