Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15

G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA sequencing. genome (unknown) sequenced reads : overlapping sub-sequences, covering the genome redundantly read contig assembly hypothesis of the genome high-quality vs low-quality assemblies 2/15

M OTIVATION Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to study novel organisms. Most assemblers cut reads into k -mers (de Bruijn graph method). read ACTGATGAC ACT CTG TGA k-mers GAT (k=3) ATG TGA GAC Practical issue: assemblers rely on the user to set the parameter k . → What could go wrong if k is incorrectly set? 3/15

M OTIVATION : OPTIMAL k NEEDED Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies NG50: maximum ℓ such that ( � | contig i |≥ ℓ | contig i | ) larger than | genome | / 2 Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k 8.5e+07 NG50 Assembly size 8.0e+07 40 60 80 4000 2000 0 40 60 80 k Fact: Genome assembly is not robust with respect to k . Our motivation: help bioinformaticians obtain the best possible assembly by finding optimal k automatically 4/15

E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. 5/15

E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. Actually, most of the time: - Bioinformaticians run [assembler] many times with k = 21 , . . . , 91, or - “Our colleagues had good results with k = 51 on [some other bacterial dataset]” . 5/15

H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig In DNA/RNA/metaDNA/metaRNA assembly: 6/15

H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers 6/15

H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k 6/15

H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k - Also, larger k -mers: more likely to contain errors (unusable k -mers) Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the genome is sufficiently covered by k -mers. → So, when are sufficiently many (non-erroneous) k -mers seen? 6/15

k - MER HISTOGRAMS Common practice: compute the k -mer abundance histogram. - x axis: abundance - y axis: number of k -mers having abundance x (seen x times) Abundance of each distinct 3-mer: Example reads dataset: ACT: 1 ACTCA CTC: 1 GTCA TCA: 2 GTC: 1 3-mers: 3-mer abundance: ACT CTC x y TCA 1 3 GTC 2 1 TCA 3 0 4 0 For a dataset and a value of k , methods that build histograms already exist ( k -mer counting, e.g. Jellyfish, DSK, . . . ). 7/15

D ISSECTION OF A k - MER HISTOGRAM Chr 14 ( ≈ 88 Mbp) GAGE dataset; histogram k = 21 k = 21 1e+09 Erroneous k-mers 1e+07 Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, .. 1e+05 0 20 40 60 80 120 Genomic area ≈ number of distinct k -mers covering the genome ≈ size of the assembly → How to determine exactly this area? 8/15

H ISTOGRAM MODEL We use Quake’s model: [DR Kelley 2010] Erroneous k -mers Pareto distribution with shape α , α k = 21 1e+09 pdf = x α + 1 Genomic k -mers Mixture of n Gaussians, 1e+07 weighted by a Zeta distribution of shape s : 1e+05 w 1 X 1 + . . . + w n X n X i ∼ N ( i µ 1 , ( i σ 1 ) 2 ) 0 20 40 60 80 120 P ( w i = k ) = k − s /ζ ( s ) Full model Mixture weighted by ( p e , 1 − p e ) . Numerical optimization (R) is used to fit the model to actual histograms. 9/15

S EEN SO FAR ⇒ good k value - Genome is sufficiently covered by k -mers = - Requires to know the number of genomic k -mers - Can be estimated with a k -mer histogram and the Quake model To find the optimal k , one can compare histograms for different values of k . k = 21 k = 41 k = 81 1e+09 5e+08 1e+08 Number of kmers Number of kmers Number of kmers 1e+07 5e+06 1e+06 1e+05 5e+04 1e+04 0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120 Abundance Abundance Abundance Chr 14 ( ≈ 88 Mbp) GAGE dataset; histograms for three values of k → Issue: computing a single histogram (using k -mer counting) is time and memory expensive 10/15

S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value DSK S. aureus 2min chr14 48min B. impatiens 7.5hour 11/15

S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value Memory usage of DSK Sampling method Sampling method (GB) S. aureus 2min 11sec 0.1 chr14 48min 7min 0.1 B. impatiens 7.5hour 1.2hour 0.4 We developed a fast and memory-efficient histogram sampling technique. Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two different reads will be either consistently sampled, either consistently ignored) ● 1e+08 - Chr 14 ( ≈ 88 Mbp) k = 41 Number of k−mers ● - continuous line = exact histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+06 ● - dots = sampled histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● - sampling errors are visible for low ● ● ● ● ● ● ● ● ● ● number of k -mers (log scale) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 20 40 60 80 100 Abundance 11/15

T OOLS , D ATASETS Software: KmerGenie ( http://kmergenie.bx.psu.edu ) Evaluation on actual datasets from GAGE (assembly benchmark): [Salzberg 2011] Dataset S. aureus human chr 14 B. impatiens Genome size 2.9 Mbp 88 Mbp 250 Mbp Coverage 167x 70x 247x Avg read length 101 bp 101 bp 124 bp Selected a typical assembler for each dataset, executed ∀ k : Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013] 12/15

K MER G ENIE R ESULTS : ACCURACY Predicted best k and predicted assembly size vs actual assembly size and NG50 for 3 organisms (GAGE benchmark). S. aureus Chr. 14 B. impatiens Predicted Predicted Predicted Velvet Velvet SOAPdenovo2 NG50 Predicted size NG50 Predicted size NG50 Predicted size 9.0e+07 3500000 3.0e+08 3000000 8.5e+07 2.5e+08 2500000 8.0e+07 20 40 60 20 40 60 80 20 40 60 80 20000 10000 4000 10000 5000 2000 0 0 0 20 40 60 20 40 60 80 20 40 60 80 k k k vertical lines corresponds to predicted best k 13/15

C ONCLUSION / PERSPECTIVES - KmerGenie helps choose the k-mer size for de novo assembly - Experiments: choices are close to the best possible - Methods: ◮ Best k maximizes the number of genomic k -mers ◮ Quake’s statistical model ◮ Efficient k -mer histogram sampling Perspectives: - Increase robustness (high-coverage, longer reads) - Improve statistical model - Estimation of Velvet’s cov_cutoff = ⇒ zero-parameter assembler - Extract information from histograms for transcriptome and meta-genomes 14/15

U SING K MER G ENIE curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz cd kmergenie-1.5397 make Usage for a single file: ./kmergenie reads.fastq Usage for a list of files: ls -1 *.fastq > list_reads ./kmergenie list_reads It returns: best k: 47 As well as a set of kmer histograms to visualize. Thank you for your attention! 15/15

Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA

Natural Selection 02-223 How to Analyze Your Own Genome

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Natural Selection 02-223 How to Analyze Your Own Genome 2.

Genome Wide SNP Selection with Entropy Based Methods Zhenqiu Liu University of Maryland

Bag of Nave Bayes: biomarker selection and classification from Genome-Wide SNP data Francesco

Bacterial Genome Annotation Lucile Soler Annotation course 9 th -11 th may 2017 Bacterial genome

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Automatic Creation of Tile Size Selection Models Tomofumi Yuki Lakshminarayanan Renganarayanan

A Model Selection Approach for Genome Wide Association Studies Florian Frommlet, Piotr Twarog,

Population Substructure and Control Selection in Genome-wide Association Studies Kai Yu, Ph.D.

Scaling up phylogene/c networks to genome-size data

Automated Size Analysis for OCL Fang Yu , Tevfik Bultan, Erik Peterson Department of Computer

CS681: Advanced Topics in Computational Biology Week 7 Lectures 2-3 Can Alkan EA224

CS681: Advanced Topics in Computational Biology Week 8 Lecture 1 Can Alkan EA224

Department of Navy Risk-Informed Remedy SelectionDepartment Richard Mach1 of Navy

NIGMS Lewis-Sigler Institute for Integrative Genomics Princeton University Genome Sizes and Gene

Meta-algorithmic techniques in SAT solving: Automated configuration, selection and beyond

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett

Department of Navy Risk Informed Remedy Selection Federal Remediation Technologies Roundtable

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Automated Verification of Shape and Size Properties via Separation Logic Huu Hai Nguyen 1 ,

An Empirical Evaluation and Comparison of f Manual and Automated Test Selection Milos Gligoric,

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Risk Pools and Adverse Selection General Idea Quality of Pool More Important Than Size Photo