Overcoming high nanopore basecaller error rates for DNA storage via - PowerPoint PPT Presentation

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020

Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy Matt Hulett Griffin Lau Kubit Chandak Tatwawadi Neu Mardia SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval Tsachy Weissman Mary Wootters Hanlee Ji

Motivation

200 Petabyte

200 Petabyte 40,000 x 5 TByte HDDs 40 tons 10s of years

200 Petabyte 40,000 x 5 TByte HDDs DNA 40 tons 1 gram 10s of years 1,000s of years

200 Petabyte 40,000 x 5 TByte HDDs DNA 40 tons 1 gram Easy duplication 10s of years 1,000s of years

DNA storage setup

Building block: synthesis • Ability to “ write/synthesize ” artificial DNA (sequence of {A,C,G,T}) Current ability: short ssDNA oligos (~150nt) at scale DNA Synthesis is not perfect: Usually has ~1% insertion/Deletion error

Building block: sequencing • Nanopore sequencing: portable, real time https://directorsblog.nih.gov/2018/02/06/sequencing-human-genome-with-pocket-sized-nanopore-device/

Typical DNA Storage System Segmentation Inner code Synthesis Outer code + indexing File • Duplication Storage • Permutation • Loss • Corruption Decoding Sequencing + Basecalling Sequenced Reconstructed reads file

Challenges • High basecall error rates for nanopore sequencing • 5-10% edit distance • Predominantly insertion and deletion errors • Lack of good error correction codes for this setting

Challenges • High basecall error rates for nanopore sequencing • 5-10% edit distance • Predominantly insertion and deletion errors • Lack of good error correction codes for this setting • Most previous works rely on consensus over multiple reads – high reading cost • Sequence the input lot of times (~30-40x) • Cluster by index , and perform “averaging” to reduce the error

Previous Works We want to be here! [2] [3] [2] L. Organick et al. , “Random access in large-scale DNA data storage," Nature biotechnology , vol. 36, no. 3, p. 242, 2018. [3] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019. 14

Methods

Nanopore Physics

Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise VERY HARD TO MODEL AND ANALYZE FAITHFULLY Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

Nanopore Sequencing Model Nanopore sequencing channel • Memory (inter-symbol interference) • Base skips • Fading • Random symbol duration … ACGTACGTACGT ... • Noise VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY! Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

Key idea

Key idea Using Flappie basecaller (Oxford Nanopore) Probabilities

Key idea Using Flappie basecaller (Oxford Nanopore) Probabilities Basecalling Code constraints not used AACGT

Key idea Using Flappie basecaller (Oxford Nanopore) ACGCGT Probabilities Decoding Basecalling Code constraints Code constraints used not used AACGT

Convolutional Codes as the Inner Code State diagram snippet Convolution code parameters : r = 1/2 (rate) m = 6 (memory) Incoming bit / output

Basecaller-decoder integration Convolutional code Combining NN-modeling + convolutional codes Perform Viterbi decoding using the modified state diagram NN-modeling based transition probabilities

Overall Inner Code design Segment #265 Attach index and CRC Convolutional list decoding Payload 8-bit 12-bit CRC index Convolutional encoding Select topmost list element with correct CRC (if any) Map to DNA (2 bits per base) Segment #265 (b) Inner code decoding (a) Inner code encoding

Experiments and results

Experiments • Data : 11KB of data: The Gettysburg Address, UN Declaration, “I have a Dream” Speech, poem collections, … • Final Error Correction Code Design: • Reed Solomon outer code: 30% redundancy (default) • Pretrained Model from the ONT Flappie Basecaller • Synthesis: Data Synthesized using CustomArray synthesis, into oligos of length ~165 • Experiments: - Rate of convolution code: r = 1/2, 3/4, 5/6 - Memory: m = 8,11,14 - List Size: 4, 8

Results 2.10 1.90 r = 1/2 Writing cost (bases/bit) 3x improvement in 1.70 reading cost! 1.50 r = 3/4 1.30 1.10 [22] [6] 0.90 r = 5/6 0.70 0.50 0 5 10 15 20 25 30 35 Reading cost (bases/bit) Convolutional code: m=8, L=8 Convolutional code: m=11, L=8 Convolutional code: m=14, L=4 Previous works [6] L. Organick et al. , “Random access in large-scale DNA data storage," Nature biotechnology , vol. 36, no. 3, p. 242, 2018. [22] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” 29 Nature communications, vol. 10, no. 1, pp. 2933, 2019.

Conclusions and future work • Novel error-correction mechanism for nanopore sequencing based DNA storage • Use “soft-information” from raw signal to improve decoding • Use neural net in basecaller to distil information from “hard-to-model” raw signal • Use convolutional codes that align nicely with sequential nanopore model • Requires 3x fewer reads for decoding than previous works

Conclusions and future work • Novel error-correction mechanism for nanopore sequencing based DNA storage • Use “soft-information” from raw signal to improve decoding • Use neural net in basecaller to distil information from “hard-to-model” raw signal • Use convolutional codes that align nicely with sequential nanopore model • Requires 3x fewer reads for decoding than previous works • Future work: • Optimization of convolutional code and CRC parameters • Finetuning of neural network model and use of improved basecallers • Application to other novel synthesis methodologies

Thank You! Code and data available at https://github.com/shubhamchandak94/nanopore_dna_storage

Overcoming high nanopore basecaller error rates for DNA storage via - PowerPoint PPT Presentation

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy

Nanopore sequencing High molecular weight DNA isolations Hi-C Ruta Sahasrabudhe Assistant

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Electronic Detection of DNA-nicks Using 2D Solid-state Nanopore Transistor I use Blue Waters to

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner & Katz eds.

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

CPSC 533 Philosophical Foundations of Artificial Intelligence Presented by: Arthur Fischer

Bluetooth Based Contact Tracing Scheme for Hamagen Benny Pinkas Eyal Ronen Some disclaimers

Abstract Meta-learning, or learning to learn, has gained renewed interest in recent years within

Excess power trigger generator Patrick Brady and Saikat Ray-Majumder University of

Applications 28 th 31 th May 2017 Ringberg Castle Status DHPT 1.2b Leonard Germic, B.

PVMD Delft University of Technology Chalcogenide solar cells 2 I II III IV V VI He 5 6

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San

Overcoming high nanopore basecaller error rates for DNA storage via - PowerPoint PPT Presentation

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy

Nanopore sequencing High molecular weight DNA isolations Hi-C Ruta Sahasrabudhe Assistant

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Electronic Detection of DNA-nicks Using 2D Solid-state Nanopore Transistor I use Blue Waters to

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner &amp; Katz eds.

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

CPSC 533 Philosophical Foundations of Artificial Intelligence Presented by: Arthur Fischer

Bluetooth Based Contact Tracing Scheme for Hamagen Benny Pinkas Eyal Ronen Some disclaimers

Abstract Meta-learning, or learning to learn, has gained renewed interest in recent years within

Excess power trigger generator Patrick Brady and Saikat Ray-Majumder University of

Applications 28 th 31 th May 2017 Ringberg Castle Status DHPT 1.2b Leonard Germic, B.

PVMD Delft University of Technology Chalcogenide solar cells 2 I II III IV V VI He 5 6

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner & Katz eds.