spring a next generation compressor for fastq data
play

SPRING: a next-generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy Weissman, Stanford


  1. SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019

  2. Joint work with • Kedar Tatwawadi, Stanford University • Idoia Ochoa, UIUC • Mikel Hernaez, UIUC • Tsachy Weissman, Stanford University

  3. Outline • Introduction and motivation • FASTQ format and compression results • Algorithms - SPRING and others • SPRING as a practical tool • Next steps

  4. Genome sequencing • Genome: long string of bases {A, C, G, T} • Sequenced as noisy paired substrings ( reads ): Genome ~ 3 billion bases AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT Coverage/ Depth: ~30x-60x ~ 300 – 500 bases ~ 100 –150 bases

  5. Typical workflows

  6. Typical workflows Alignment Variant VCF Aligned Sequencing Raw reads to calling w.r.t. (tabular reads reference reference data)

  7. Typical workflows Alignment Variant VCF Aligned Sequencing Raw reads to calling w.r.t. (tabular reads reference reference data) Assembled Sequencing Raw reads Assembly genome

  8. Why store raw reads?

  9. Why store raw reads? • Pipelines improve with time - need raw data for reanalysis

  10. Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming

  11. Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming • Can’t perform alignment when reference genome not available – e.g., de novo assembly or metagenomics

  12. Why store raw reads? • Pipelines improve with time - need raw data for reanalysis • For temporary storage - alignment and assembly time-consuming • Can’t perform alignment when reference genome not available – e.g., de novo assembly or metagenomics • Can get better compression than aligned data compression if significant variation from reference (more on this later)!

  13. FASTQ format

  14. FASTQ format We’ll mostly focus on reads in this talk.

  15. Read compression

  16. Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base)

  17. Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base) • Gzip: ~20 GB (2 bits/base) – still far from optimal

  18. Read compression • For a typical 25x human dataset: • Uncompressed: 79 GB (1 byte/base) • Gzip: ~20 GB (2 bits/base) – still far from optimal • Order of read pairs in FASTQ irrelevant – can this help?

  19. Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB

  20. Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore 6 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

  21. Read compression results Compressor 25x human Uncompressed 79 GB Gzip ~20 GB FaStore 6 GB (allow reordering) SPRING 3 GB (no reordering) SPRING 2 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

  22. Read compression results Compressor 25x human 100x human Uncompressed 79 GB 319 GB Gzip ~20 GB ~80 GB FaStore 6 GB 13.7 GB (allow reordering) SPRING 3 GB 10 GB (no reordering) SPRING 2 GB 5.7 GB (allow reordering) Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

  23. Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to

  24. Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome

  25. Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads)

  26. Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads) • Store noise in reads

  27. Key idea AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT • Storing reads equivalent to • Store genome • Store read positions in genome (+ gap between paired reads) • Store noise in reads • Entropy calculations show this outperforms previous compressors

  28. Key idea • But... How to get the genome from the reads?

  29. Key idea • But... How to get the genome from the reads? • Genome assembly too expensive - big challenges: • resolve repeats • get very long pieces of genome from shorter assemblies

  30. Key idea • But... How to get the genome from the reads? • Genome assembly too expensive - big challenges: • resolve repeats • get very long pieces of genome from shorter assemblies • Solution: Don’t need perfect assembly for compression!

  31. SPRING workflow Raw reads

  32. SPRING workflow Contigs Approximate assembly Raw reads

  33. SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads

  34. SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads BSC Compressed file https://github.com/IlyaGrebnov/libbsc

  35. SPRING workflow Contigs Assembled sequence • Read position in • Approximate assembled sequence Encode assembly Gap b/w paired reads • Noisy bases + positions • Etc. • Raw reads BSC Compressed file In “allow reordering” mode: reorder by position in approximate assembly https://github.com/IlyaGrebnov/libbsc

  36. Approx. assembly/reordering step (simplified)

  37. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables

  38. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance

  39. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG

  40. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • ACGATCGTACGTATACGGGTACG

  41. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • ACGATCGTACGTATACGGGTACG • Index match found but Hamming distance too large → shift search substring by one

  42. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG

  43. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG • No index match found → shift search substring by one

  44. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA

  45. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA • Next read found! • Repeat process with the new read

  46. Approx. assembly/reordering step (simplified) • Index reads by specific substrings using hash tables • For the current read, try to find an overlapping read within small Hamming distance • Example (reads indexed by prefix for simplicity): (current read) • ACGATCGTACGTACGATCGTCAG (candidate next read) • GATCGTACGTATGATGGTCATTA • Next read found! • Repeat process with the new read. • If no match found at any shift, pick arbitrary remaining read & start new contig

Recommend


More recommend