SPRING: A next generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University Allerton Conference, 3rd October 2018

Joint work with ◮ Kedar Tatwawadi, Stanford University ◮ Idoia Ochoa, UIUC ◮ Mikel Hernaez, UIUC ◮ Tsachy Weissman, Stanford University

Outline Introduction High-Throughput Sequencing Entropy of reads Methods Results

High-Throughput Sequencing Genome ~ 3 billion bases ~ 300 – 500 bases ~ 100 – 150 bases

FASTQ format File 1 Read @ERR174324.1 HSQ1009_86:1:1101:1192:2116/1 ATTCNGTCACTTCTCACCAGGCCCCTCATTCAACACTGGGAATTAAAATTCGAC... + CCCF#2ADHHHHHJJJIJJJJIJJJJJJJJGIJJJJJJJJIJJJIJJJJJGIJJ... ⋮ Quality scores File 2 Read identifier @ERR174324.2 HSQ1009_86:1:1101:1192:2116/2 CAGANAGAGACTCTGTCTCAAAAAAACAAACAAACAAACAAACAAAAAGTCTTA... + CCCF#2ADHFHHHJIJJJJJJJJJJJJJJJJJJIJJJJHIIJJJJJJJJIIIJJ... ⋮

Read order - unpaired 2 1 6 2 1 3 4 4 3 5 5 6 Original order in FASTQ New order (arbitrary)

Read order - paired 1 2 2 6 3 1 4 4 5 3 5 6 Original order in FASTQ New order (preserves read pairing but pairs ordered arbitrarily)

Entropy of reads (ordered) Genome (length ! ) " noiseless unpaired reads Simple case H ( ordered reads ) = H ( genome ) + H ( ordered reads | genome ) − H ( genome | ordered reads )

Entropy of reads (ordered) Genome (length ! ) " noiseless unpaired reads Simple case H ( ordered reads ) = H ( genome ) + H ( ordered reads | genome ) − H ( genome | ordered reads ) For typical datasets, last term is negligible: H ( ordered reads ) � H ( genome ) + n log 2 m � �� Store genome Store positions of reads in genome

Entropy of reads (unordered) � m + n − 1 � H ( unordered reads ) � H ( genome ) + log 2 m − 1 � �� Store genome Store positions of reads in genome ◮ � m + n − 1 � = number of ways to distribute n indistinguishable m − 1 balls into m distinguishable boxes. ◮ Achievability - sort reads by genome position and entropy code differences of read positions.

Entropy of reads (example) Example: For human genome and read length 100, Coverage Entropy of ordered reads Entropy of unordered reads 50x 6.7 GB 1.1 GB 100x 12.8 GB 1.4 GB Table 1: Coverage = average number of reads covering a base in the genome

Entropy of reads (general) In general, entropy of reads with ( ∗ ) exact order preserved & ( ∗∗ ) only pairing preserved (ordering of read pairs discarded): � � ( ∗ ) 2 log 2 m n H ( reads ) � H ( genome ) + � m + n � 2 − 1 ( ∗∗ ) log 2 � �� m − 1 Store genome � �� Store positions of read pairs in genome + n 2 ( H ( insert size ) + 1) + nH ( noise ) � �� Store noisy bases Store insert size & orientation Upper bound suggests compression scheme

Read compression 1. Find “genome” ◮ Reorder reads ◮ Find consensus 2. Encode reads 3. Compress streams

Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables

Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance

Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG No similar read with highlighted index found → shift

Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG GATCGTACGTATGATGGTCAGTA Next read found!

Reorder reads (simplified) ◮ Index reads by specific substrings using hash tables ◮ For the current read, try to find an overlapping read within small Hamming distance ◮ Example (reads indexed by prefix): ACGATCGTACGTACGATCGTCAG GATCGTACGTATGATGGTCAGTA Next read found! ◮ Repeat process with the new read

Encode reads noise noisepos ACTGCT G GCTGCTGC T AGC GT 7,16 7,9 CT C CTAGCTGCTGCCAGCC C 3 3 Delta encoding GCTAGCT A CTGCCAGCCTA A 8 8 GCT C GCT A CTG T C C GCCTA CATC 4,8,12,14 4,4,4,2 Majority ACTGCTAGCTGCTGC C AGCCTA seq (Reference Sequence)

Encode reads noise noisepos ACTGCT G GCTGCTGC T AGC GT 7,16 7,9 CT C CTAGCTGCTGCCAGCC C 3 3 Delta encoding GCTAGCT A CTGCCAGCCTA A 8 8 GCT C GCT A CTG T C C GCCTA CATC 4,8,12,14 4,4,4,2 Majority ACTGCTAGCTGCTGC C AGCCTA seq (Reference Sequence) ◮ Read positions and insert sizes encoded based on the mode (order preserving or not) ◮ All streams compressed with BSC, a BWT-based compressor

Quality value and read identifier compression ◮ If read order not preserved, sort quality values and read identifiers according to new read order

Quality value and read identifier compression ◮ If read order not preserved, sort quality values and read identifiers according to new read order ◮ Standard techniques used for compression

Modes ◮ Lossless (default)

Modes ◮ Lossless (default) ◮ Recommended lossy ◮ Read order discarded (read pairing still preserved) ◮ Quality values quantized using Illumina 8-level binning ◮ Read identifiers discarded

Results Organism Cvg. FASTQ Gzip FaStore SPRING P. aeruginosa 50 768 MB 279 MB 145 MB 115 MB Metagenomic - 19.3 GB 6.9 GB 3.6 GB 3.2 GB H. sapiens 28 227 GB 74 GB 36 GB 29 GB H. sapiens* 25 196 GB 36 GB 11 GB 7 GB H. sapiens* 100 788 GB 145 GB 34 GB 26 GB ◮ * sequenced with NovaSeq technology with only 4 quality levels (40 levels for others).

Results Organism Cvg. FASTQ Gzip FaStore SPRING P. aeruginosa 50 768 MB 279 MB 145 MB 115 MB Metagenomic - 19.3 GB 6.9 GB 3.6 GB 3.2 GB H. sapiens 28 227 GB 74 GB 36 GB 29 GB H. sapiens* 25 196 GB 36 GB 11 GB 7 GB H. sapiens* 100 788 GB 145 GB 34 GB 26 GB ◮ * sequenced with NovaSeq technology with only 4 quality levels (40 levels for others). ◮ Similar improvements in recommended lossy mode with 20%-50% compression gains over lossless mode.

Results - read compression Results for read compression of human NovaSeq datasets: Coverage Tool Mode 25x 100x SPRING order preserving 3.0 GB 10.1 GB SPRING pairing preserving 2.0 GB 5.7 GB FaStore pairing preserving 6.1 GB 13.7 GB

Conclusion ◮ SPRING: FASTQ compressor ◮ Compression improvements of 1.2x-1.8x on human data ◮ Practical computational requirements ◮ Several other features: random access, long read compression ... ◮ Github: https://github.com/shubhamchandak94/SPRING/

Conclusion ◮ SPRING: FASTQ compressor ◮ Compression improvements of 1.2x-1.8x on human data ◮ Practical computational requirements ◮ Several other features: random access, long read compression ... ◮ Github: https://github.com/shubhamchandak94/SPRING/ ◮ Future work: integrate with MPEG-G standard for genomic information representation ( https://mpeg-g.org/ )

Thank You!

References ◮ S. Chandak, K. Tatwawadi, I. Ochoa, M. Hernaez and T. Weissman; SPRING: A next-generation compressor for FASTQ data, Submitted . ◮ S. Chandak, K. Tatwawadi, T. Weissman; Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis, Bioinformatics , Volume 34, Issue 4, 15 February 2018, Pages 558–567 ◮ � L. Roguski, I. Ochoa, M. Hernaez, S. Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics , Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

SPRING: A next generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University Allerton Conference, 3rd October 2018 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy

Emerson Compressor Control Process Control Made Easy with SmartProcess Compressor Agenda

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Compressor stations & Compressor stations & health risks: health risks: Moving New

CPM Series Permanent Magnet Motor Variable Speed Screw Air Compressor PAR ART T 01 Why we

Compressor stations and Compressor stations and health risks health risks Curtis Nordgaard, MD

Review of Natural Gas Transmission Compressor Station Methane Emissions and Mitigation Options

Charge Compressor 2/3 Stage Fouling/High Vibration Johnny Dugas Senior Technical Associate

GazSurf - provides wide range of reliable compressor units, spare parts and consumables

XC600D Series 2013 XC600D Series 2013 Controllers for Small Medium Compressor Racks with

Presentation w w w .zjboyang.com HI STORY Lanhai Compressor Co., LTD was born main product is

NESE Pipeline and Compressor Station: Chokepoints and Tactics Jeff Tittel, Director, New Jersey

Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST COMPRESSOR

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Nucleosome Positioning 02-715 Advanced Topics in Computa8onal Genomics

Performance Measurement Work Group 09/16/2016 Meeting Strategic Issues: Short- and Mid-Term

e-Science Development of Taiwan Eric Yen & Simon Lin ISGC, March 2011 Outline Extending

e-Science Introduction Eric Yen e-Science Workshop, March 2011 Outline Workshop Overview

knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 20.09.2008 Bioinformatics

STRUCTURAL BIOLOGY AND RADIOBIOLOGY LAB I2BC - CEA Saclay PROTEIN INTERACTIONS AT THE HEART OF

Use of web conferencing tools in in managing ris isk of dis isengagement by online le

ComiR: A New Efficient Tool for Predicting Multiple miRNA Targets Claudia Coronnello, PhD Dept.

Sambuz

Useful Links

Newsletter

Mail Us

SPRING: A next generation compressor for FASTQ data Shubham Chandak - PowerPoint PPT Presentation

SPRING: A next generation compressor for FASTQ data Shubham Chandak Stanford University Allerton Conference, 3rd October 2018 Joint work with Kedar Tatwawadi, Stanford University Idoia Ochoa, UIUC Mikel Hernaez, UIUC Tsachy

Emerson Compressor Control Process Control Made Easy with SmartProcess Compressor Agenda

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Compressor stations &amp; Compressor stations &amp; health risks: health risks: Moving New

CPM Series Permanent Magnet Motor Variable Speed Screw Air Compressor PAR ART T 01 Why we

Compressor stations and Compressor stations and health risks health risks Curtis Nordgaard, MD

Review of Natural Gas Transmission Compressor Station Methane Emissions and Mitigation Options

Charge Compressor 2/3 Stage Fouling/High Vibration Johnny Dugas Senior Technical Associate

GazSurf - provides wide range of reliable compressor units, spare parts and consumables

XC600D Series 2013 XC600D Series 2013 Controllers for Small Medium Compressor Racks with

Presentation w w w .zjboyang.com HI STORY Lanhai Compressor Co., LTD was born main product is

NESE Pipeline and Compressor Station: Chokepoints and Tactics Jeff Tittel, Director, New Jersey

Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference

Thank you, sponsors Our online sponsors PLATINUM GOLD 1 6/28/2016 TOP 4 LOW COST COMPRESSOR

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Nucleosome Positioning 02-715 Advanced Topics in Computa8onal Genomics

Performance Measurement Work Group 09/16/2016 Meeting Strategic Issues: Short- and Mid-Term

e-Science Development of Taiwan Eric Yen &amp; Simon Lin ISGC, March 2011 Outline Extending

e-Science Introduction Eric Yen e-Science Workshop, March 2011 Outline Workshop Overview

knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 20.09.2008 Bioinformatics

STRUCTURAL BIOLOGY AND RADIOBIOLOGY LAB I2BC - CEA Saclay PROTEIN INTERACTIONS AT THE HEART OF

Use of web conferencing tools in in managing ris isk of dis isengagement by online le

ComiR: A New Efficient Tool for Predicting Multiple miRNA Targets Claudia Coronnello, PhD Dept.

Sambuz

Useful Links

Newsletter

Mail Us

Compressor stations & Compressor stations & health risks: health risks: Moving New

e-Science Development of Taiwan Eric Yen & Simon Lin ISGC, March 2011 Outline Extending