spring a next generation compressor for fastq data
play

SPRING: A next generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University Allerton Conference, 3rd October 2018 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy


  1. SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University Allerton Conference, 3rd October 2018

  2. Joint work with ◮ Kedar Tatwawadi, Stanford University ◮ Idoia Ochoa, UIUC ◮ Mikel Hernaez, UIUC ◮ Tsachy Weissman, Stanford University

  3. Outline Introduction High-Throughput Sequencing Entropy of reads Methods Results

  4. High-Throughput Sequencing Genome ~ 3 billion bases ~ 300 – 500 bases ~ 100 – 150 bases

  5. FASTQ format File 1 Read @ERR174324.1 HSQ1009_86:1:1101:1192:2116/1 ATTCNGTCACTTCTCACCAGGCCCCTCATTCAACACTGGGAATTAAAATTCGAC... + CCCF#2ADHHHHHJJJIJJJJIJJJJJJJJGIJJJJJJJJIJJJIJJJJJGIJJ... ⋮ Quality scores File 2 Read identifier @ERR174324.2 HSQ1009_86:1:1101:1192:2116/2 CAGANAGAGACTCTGTCTCAAAAAAACAAACAAACAAACAAACAAAAAGTCTTA... + CCCF#2ADHFHHHJIJJJJJJJJJJJJJJJJJJIJJJJHIIJJJJJJJJIIIJJ... ⋮

  6. Read order - unpaired 2 1 6 2 1 3 4 4 3 5 5 6 Original order in FASTQ New order (arbitrary)

  7. Read order - paired 1 2 2 6 3 1 4 4 5 3 5 6 Original order in FASTQ New order (preserves read pairing but pairs ordered arbitrarily)

  8. Entropy of reads (ordered) Genome (length ! ) " noiseless unpaired reads Simple case H ( ordered reads ) = H ( genome ) + H ( ordered reads | genome ) − H ( genome | ordered reads )

  9. Entropy of reads (ordered) Genome (length ! ) " noiseless unpaired reads Simple case H ( ordered reads ) = H ( genome ) + H ( ordered reads | genome ) − H ( genome | ordered reads ) For typical datasets, last term is negligible: H ( ordered reads ) � H ( genome ) + n log 2 m � �� � � �� � Store genome Store positions of reads in genome

  10. Entropy of reads (unordered) � m + n − 1 � H ( unordered reads ) � H ( genome ) + log 2 m − 1 � �� � � �� � Store genome Store positions of reads in genome ◮ � m + n − 1 � = number of ways to distribute n indistinguishable m − 1 balls into m distinguishable boxes. ◮ Achievability - sort reads by genome position and entropy code differences of read positions.

  11. Entropy of reads (example) Example: For human genome and read length 100, Coverage Entropy of ordered reads Entropy of unordered reads 50x 6.7 GB 1.1 GB 100x 12.8 GB 1.4 GB Table 1: Coverage = average number of reads covering a base in the genome

  12. Entropy of reads (general) In general, entropy of reads with ( ∗ ) exact order preserved & ( ∗∗ ) only pairing preserved (ordering of read pairs discarded): � � ( ∗ ) 2 log 2 m n H ( reads ) � H ( genome ) + � m + n � 2 − 1 ( ∗∗ ) log 2 � �� � m − 1 Store genome � �� � Store positions of read pairs in genome + n 2 ( H ( insert size ) + 1) + nH ( noise ) � �� � � �� � Store noisy bases Store insert size & orientation Upper bound suggests compression scheme

  13. Outline Introduction High-Throughput Sequencing Entropy of reads Methods Results

  14. Read compression 1. Find “genome” ◮ Reorder reads ◮ Find consensus 2. Encode reads 3. Compress streams

  15. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables

  16. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance

  17. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG No similar read with highlighted index found → shift

  18. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG No similar read with highlighted index found → shift

  19. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG GATCGTACGTATGATGGTCAGTA Next read found!

  20. Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG GATCGTACGTATGATGGTCAGTA Next read found! ◮ Repeat process with the new read

  21. Encode reads noise noisepos ACTGCT G GCTGCTGC T AGC GT 7,16 7,9 CT C CTAGCTGCTGCCAGCC C 3 3 Delta encoding GCTAGCT A CTGCCAGCCTA A 8 8 GCT C GCT A CTG T C C GCCTA CATC 4,8,12,14 4,4,4,2 Majority ACTGCTAGCTGCTGC C AGCCTA seq (Reference Sequence)

  22. Encode reads noise noisepos ACTGCT G GCTGCTGC T AGC GT 7,16 7,9 CT C CTAGCTGCTGCCAGCC C 3 3 Delta encoding GCTAGCT A CTGCCAGCCTA A 8 8 GCT C GCT A CTG T C C GCCTA CATC 4,8,12,14 4,4,4,2 Majority ACTGCTAGCTGCTGC C AGCCTA seq (Reference Sequence) ◮ Read positions and insert sizes encoded based on the mode (order preserving or not) ◮ All streams compressed with BSC, a BWT-based compressor

  23. Quality value and read identifier compression ◮ If read order not preserved, sort quality values and read identifiers according to new read order

  24. Quality value and read identifier compression ◮ If read order not preserved, sort quality values and read identifiers according to new read order ◮ Standard techniques used for compression

  25. Modes ◮ Lossless (default)

  26. Modes ◮ Lossless (default) ◮ Recommended lossy ◮ Read order discarded (read pairing still preserved) ◮ Quality values quantized using Illumina 8-level binning ◮ Read identifiers discarded

  27. Outline Introduction High-Throughput Sequencing Entropy of reads Methods Results

  28. Results Organism Cvg. FASTQ Gzip FaStore SPRING P. aeruginosa 50 768 MB 279 MB 145 MB 115 MB Metagenomic - 19.3 GB 6.9 GB 3.6 GB 3.2 GB H. sapiens 28 227 GB 74 GB 36 GB 29 GB H. sapiens* 25 196 GB 36 GB 11 GB 7 GB H. sapiens* 100 788 GB 145 GB 34 GB 26 GB ◮ * sequenced with NovaSeq technology with only 4 quality levels (40 levels for others).

  29. Results Organism Cvg. FASTQ Gzip FaStore SPRING P. aeruginosa 50 768 MB 279 MB 145 MB 115 MB Metagenomic - 19.3 GB 6.9 GB 3.6 GB 3.2 GB H. sapiens 28 227 GB 74 GB 36 GB 29 GB H. sapiens* 25 196 GB 36 GB 11 GB 7 GB H. sapiens* 100 788 GB 145 GB 34 GB 26 GB ◮ * sequenced with NovaSeq technology with only 4 quality levels (40 levels for others). ◮ Similar improvements in recommended lossy mode with 20%-50% compression gains over lossless mode.

  30. Results - read compression Results for read compression of human NovaSeq datasets: Coverage Tool Mode 25x 100x SPRING order preserving 3.0 GB 10.1 GB SPRING pairing preserving 2.0 GB 5.7 GB FaStore pairing preserving 6.1 GB 13.7 GB

  31. Conclusion ◮ SPRING: FASTQ compressor ◮ Compression improvements of 1.2x-1.8x on human data ◮ Practical computational requirements ◮ Several other features: random access, long read compression ... ◮ Github: https://github.com/shubhamchandak94/SPRING/

  32. Conclusion ◮ SPRING: FASTQ compressor ◮ Compression improvements of 1.2x-1.8x on human data ◮ Practical computational requirements ◮ Several other features: random access, long read compression ... ◮ Github: https://github.com/shubhamchandak94/SPRING/ ◮ Future work: integrate with MPEG-G standard for genomic information representation ( https://mpeg-g.org/ )

  33. Thank You!

  34. References ◮ S. Chandak, K. Tatwawadi, I. Ochoa, M. Hernaez and T. Weissman; SPRING: A next-generation compressor for FASTQ data, Submitted . ◮ S. Chandak, K. Tatwawadi, T. Weissman; Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis, Bioinformatics , Volume 34, Issue 4, 15 February 2018, Pages 558–567 ◮ � L. Roguski, I. Ochoa, M. Hernaez, S. Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Recommend


More recommend