informed and automated k mer size selection for genome
play

Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA


  1. Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15

  2. G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA sequencing. genome (unknown) sequenced reads : overlapping sub-sequences, covering the genome redundantly read contig assembly hypothesis of the genome high-quality vs low-quality assemblies 2/15

  3. M OTIVATION Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to study novel organisms. Most assemblers cut reads into k -mers (de Bruijn graph method). read ACTGATGAC ACT CTG TGA k-mers GAT (k=3) ATG TGA GAC Practical issue: assemblers rely on the user to set the parameter k . → What could go wrong if k is incorrectly set? 3/15

  4. M OTIVATION : OPTIMAL k NEEDED Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies NG50: maximum ℓ such that ( � | contig i |≥ ℓ | contig i | ) larger than | genome | / 2 Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k 8.5e+07 NG50 Assembly size 8.0e+07 40 60 80 4000 2000 0 40 60 80 k Fact: Genome assembly is not robust with respect to k . Our motivation: help bioinformaticians obtain the best possible assembly by finding optimal k automatically 4/15

  5. E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. 5/15

  6. E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. Actually, most of the time: - Bioinformaticians run [assembler] many times with k = 21 , . . . , 91, or - “Our colleagues had good results with k = 51 on [some other bacterial dataset]” . 5/15

  7. H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig In DNA/RNA/metaDNA/metaRNA assembly: 6/15

  8. H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers 6/15

  9. H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k 6/15

  10. H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k - Also, larger k -mers: more likely to contain errors (unusable k -mers) Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the genome is sufficiently covered by k -mers. → So, when are sufficiently many (non-erroneous) k -mers seen? 6/15

  11. k - MER HISTOGRAMS Common practice: compute the k -mer abundance histogram. - x axis: abundance - y axis: number of k -mers having abundance x (seen x times) Abundance of each distinct 3-mer: Example reads dataset: ACT: 1 ACTCA CTC: 1 GTCA TCA: 2 GTC: 1 3-mers: 3-mer abundance: ACT CTC x y TCA 1 3 GTC 2 1 TCA 3 0 4 0 For a dataset and a value of k , methods that build histograms already exist ( k -mer counting, e.g. Jellyfish, DSK, . . . ). 7/15

  12. D ISSECTION OF A k - MER HISTOGRAM Chr 14 ( ≈ 88 Mbp) GAGE dataset; histogram k = 21 k = 21 1e+09 Erroneous k-mers 1e+07 Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, .. 1e+05 0 20 40 60 80 120 Genomic area ≈ number of distinct k -mers covering the genome ≈ size of the assembly → How to determine exactly this area? 8/15

  13. H ISTOGRAM MODEL We use Quake’s model: [DR Kelley 2010] Erroneous k -mers Pareto distribution with shape α , α k = 21 1e+09 pdf = x α + 1 Genomic k -mers Mixture of n Gaussians, 1e+07 weighted by a Zeta distribution of shape s : 1e+05 w 1 X 1 + . . . + w n X n X i ∼ N ( i µ 1 , ( i σ 1 ) 2 ) 0 20 40 60 80 120 P ( w i = k ) = k − s /ζ ( s ) Full model Mixture weighted by ( p e , 1 − p e ) . Numerical optimization (R) is used to fit the model to actual histograms. 9/15

  14. S EEN SO FAR ⇒ good k value - Genome is sufficiently covered by k -mers = - Requires to know the number of genomic k -mers - Can be estimated with a k -mer histogram and the Quake model To find the optimal k , one can compare histograms for different values of k . k = 21 k = 41 k = 81 1e+09 5e+08 1e+08 Number of kmers Number of kmers Number of kmers 1e+07 5e+06 1e+06 1e+05 5e+04 1e+04 0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120 Abundance Abundance Abundance Chr 14 ( ≈ 88 Mbp) GAGE dataset; histograms for three values of k → Issue: computing a single histogram (using k -mer counting) is time and memory expensive 10/15

  15. S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value DSK S. aureus 2min chr14 48min B. impatiens 7.5hour 11/15

  16. S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value Memory usage of DSK Sampling method Sampling method (GB) S. aureus 2min 11sec 0.1 chr14 48min 7min 0.1 B. impatiens 7.5hour 1.2hour 0.4 We developed a fast and memory-efficient histogram sampling technique. Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two different reads will be either consistently sampled, either consistently ignored) ● 1e+08 - Chr 14 ( ≈ 88 Mbp) k = 41 Number of k−mers ● - continuous line = exact histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+06 ● - dots = sampled histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● - sampling errors are visible for low ● ● ● ● ● ● ● ● ● ● number of k -mers (log scale) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 20 40 60 80 100 Abundance 11/15

  17. T OOLS , D ATASETS Software: KmerGenie ( http://kmergenie.bx.psu.edu ) Evaluation on actual datasets from GAGE (assembly benchmark): [Salzberg 2011] Dataset S. aureus human chr 14 B. impatiens Genome size 2.9 Mbp 88 Mbp 250 Mbp Coverage 167x 70x 247x Avg read length 101 bp 101 bp 124 bp Selected a typical assembler for each dataset, executed ∀ k : Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013] 12/15

  18. K MER G ENIE R ESULTS : ACCURACY Predicted best k and predicted assembly size vs actual assembly size and NG50 for 3 organisms (GAGE benchmark). S. aureus Chr. 14 B. impatiens Predicted Predicted Predicted Velvet Velvet SOAPdenovo2 NG50 Predicted size NG50 Predicted size NG50 Predicted size 9.0e+07 3500000 3.0e+08 3000000 8.5e+07 2.5e+08 2500000 8.0e+07 20 40 60 20 40 60 80 20 40 60 80 20000 10000 4000 10000 5000 2000 0 0 0 20 40 60 20 40 60 80 20 40 60 80 k k k vertical lines corresponds to predicted best k 13/15

  19. C ONCLUSION / PERSPECTIVES - KmerGenie helps choose the k-mer size for de novo assembly - Experiments: choices are close to the best possible - Methods: ◮ Best k maximizes the number of genomic k -mers ◮ Quake’s statistical model ◮ Efficient k -mer histogram sampling Perspectives: - Increase robustness (high-coverage, longer reads) - Improve statistical model - Estimation of Velvet’s cov_cutoff = ⇒ zero-parameter assembler - Extract information from histograms for transcriptome and meta-genomes 14/15

  20. U SING K MER G ENIE curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz cd kmergenie-1.5397 make Usage for a single file: ./kmergenie reads.fastq Usage for a list of files: ls -1 *.fastq > list_reads ./kmergenie list_reads It returns: best k: 47 As well as a set of kmer histograms to visualize. Thank you for your attention! 15/15

Recommend


More recommend