overcoming high nanopore basecaller error rates for dna
play

Overcoming high nanopore basecaller error rates for DNA storage via - PowerPoint PPT Presentation

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy


  1. Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020

  2. Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy Matt Hulett Griffin Lau Kubit Chandak Tatwawadi Neu Mardia SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval Tsachy Weissman Mary Wootters Hanlee Ji

  3. Motivation

  4. 200 Petabyte

  5. 200 Petabyte 40,000 x 5 TByte HDDs 40 tons 10s of years

  6. 200 Petabyte 40,000 x 5 TByte HDDs DNA 40 tons 1 gram 10s of years 1,000s of years

  7. 200 Petabyte 40,000 x 5 TByte HDDs DNA 40 tons 1 gram Easy duplication 10s of years 1,000s of years

  8. DNA storage setup

  9. Building block: synthesis • Ability to “ write/synthesize ” artificial DNA (sequence of {A,C,G,T}) Current ability: short ssDNA oligos (~150nt) at scale DNA Synthesis is not perfect: Usually has ~1% insertion/Deletion error

  10. Building block: sequencing • Nanopore sequencing: portable, real time https://directorsblog.nih.gov/2018/02/06/sequencing-human-genome-with-pocket-sized-nanopore-device/

  11. Typical DNA Storage System Segmentation Inner code Synthesis Outer code + indexing File • Duplication Storage • Permutation • Loss • Corruption Decoding Sequencing + Basecalling Sequenced Reconstructed reads file

  12. Challenges • High basecall error rates for nanopore sequencing • 5-10% edit distance • Predominantly insertion and deletion errors • Lack of good error correction codes for this setting

  13. Challenges • High basecall error rates for nanopore sequencing • 5-10% edit distance • Predominantly insertion and deletion errors • Lack of good error correction codes for this setting • Most previous works rely on consensus over multiple reads – high reading cost • Sequence the input lot of times (~30-40x) • Cluster by index , and perform “averaging” to reduce the error

  14. Previous Works We want to be here! [2] [3] [2] L. Organick et al. , “Random access in large-scale DNA data storage," Nature biotechnology , vol. 36, no. 3, p. 242, 2018. [3] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019. 14

  15. Methods

  16. Nanopore Physics

  17. Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

  18. Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise VERY HARD TO MODEL AND ANALYZE FAITHFULLY Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

  19. Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY! Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

  20. Key idea

  21. Key idea Using Flappie basecaller (Oxford Nanopore) Probabilities

  22. Key idea Using Flappie basecaller (Oxford Nanopore) Probabilities Basecalling Code constraints not used AACGT

  23. Key idea Using Flappie basecaller (Oxford Nanopore) ACGCGT Probabilities Decoding Basecalling Code constraints Code constraints used not used AACGT

  24. Convolutional Codes as the Inner Code State diagram snippet Convolution code parameters : r = 1/2 (rate) m = 6 (memory) Incoming bit / output

  25. Basecaller-decoder integration Convolutional code Combining NN-modeling + convolutional codes Perform Viterbi decoding using the modified state diagram NN-modeling based transition probabilities

  26. Overall Inner Code design Segment #265 Attach index and CRC Convolutional list decoding Payload 8-bit 12-bit CRC index Convolutional encoding Select topmost list element with correct CRC (if any) Map to DNA (2 bits per base) Segment #265 (b) Inner code decoding (a) Inner code encoding

  27. Experiments and results

  28. Experiments • Data : 11KB of data: The Gettysburg Address, UN Declaration, “I have a Dream” Speech, poem collections, … • Final Error Correction Code Design: • Reed Solomon outer code: 30% redundancy (default) • Pretrained Model from the ONT Flappie Basecaller • Synthesis: Data Synthesized using CustomArray synthesis, into oligos of length ~165 • Experiments: - Rate of convolution code: r = 1/2, 3/4, 5/6 - Memory: m = 8,11,14 - List Size: 4, 8

  29. Results 2.10 1.90 r = 1/2 Writing cost (bases/bit) 3x improvement in 1.70 reading cost! 1.50 r = 3/4 1.30 1.10 [22] [6] 0.90 r = 5/6 0.70 0.50 0 5 10 15 20 25 30 35 Reading cost (bases/bit) Convolutional code: m=8, L=8 Convolutional code: m=11, L=8 Convolutional code: m=14, L=4 Previous works [6] L. Organick et al. , “Random access in large-scale DNA data storage," Nature biotechnology , vol. 36, no. 3, p. 242, 2018. [22] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” 29 Nature communications, vol. 10, no. 1, pp. 2933, 2019.

  30. Conclusions and future work • Novel error-correction mechanism for nanopore sequencing based DNA storage • Use “soft-information” from raw signal to improve decoding • Use neural net in basecaller to distil information from “hard-to-model” raw signal • Use convolutional codes that align nicely with sequential nanopore model • Requires 3x fewer reads for decoding than previous works

  31. Conclusions and future work • Novel error-correction mechanism for nanopore sequencing based DNA storage • Use “soft-information” from raw signal to improve decoding • Use neural net in basecaller to distil information from “hard-to-model” raw signal • Use convolutional codes that align nicely with sequential nanopore model • Requires 3x fewer reads for decoding than previous works • Future work: • Optimization of convolutional code and CRC parameters • Finetuning of neural network model and use of improved basecallers • Application to other novel synthesis methodologies

  32. Thank You! Code and data available at https://github.com/shubhamchandak94/nanopore_dna_storage

Recommend


More recommend