coding for optimized writing rate in dna storage
play

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, - PowerPoint PPT Presentation

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Shuki Bruck IEEE ISIT 2020 DN DNA Stor orage Information In this DNA Synthesis (Writing) talk Storage Medium (Multiple Strands of DNA) DNA


  1. Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Shuki Bruck IEEE ISIT 2020

  2. DN DNA Stor orage Information In this DNA Synthesis (Writing) talk Storage Medium (Multiple Strands of DNA) DNA Sequencing (Reading) Reconstruction

  3. Current DNA Synthesis Systems • Slow • Expensive

  4. Terminator Free DNA Synthesis (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) • Faster • Cheaper • Noisy

  5. Terminator Free DNA Synthesis Channel (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) Current Write C Write Time 𝑢 Symbol Sequence (Sticky Insertion) Noise A C C C C Previous Symbol Distribution 𝐸

  6. Terminator Free DNA Synthesis Channel (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) Current Write C Write Time 𝑢 Symbol Sequence (Sticky Insertion) Noise A C C C C Previous Symbol Distribution 𝐸 !→#

  7. Terminator Free DNA Synthesis Channel (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) Current Write C Write Time 𝑢 Symbol Sequence (Sticky Insertion) Noise A C C C C Previous Symbol Distribution 𝐸 !→# (𝑢)

  8. Terminator Free DNA Synthesis Channel (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) Sequence to be synthesized: ACTAG A Round 1 ACCC 𝐸 !→# (𝑢 $ ) ACCCTT Round 2 𝐸 #→% (𝑢 & ) 𝐸 %→! (𝑢 ' ) ACCCTTA Round 3 ACCCTTAGGGGG Round 4 𝐸 !→( (𝑢 ) ) Length of run in each round is given by a distribution 𝐸 which depends on - previous symbol - current symbol - time of synthesis

  9. Approach (H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. ) ACCCTTAGGGGG Forget Runs and Encode Information in Transitions ACTAG Can we do better? Rate: 𝐦𝐩𝐡 𝟑 𝟒

  10. Precision Resolution (PR) Framework (M. Schwartz and J. Bruck, “On the capacity of the precision-resolution system,” IEEE Trans. Inform. Theory, vol. 56, no. 3, pp. 1028–1037, 2010. ) 0100010010010101 • Information encoded in length of runs of 0’s. • Clock frequency mismatch at Tx and Rx can result in erroneous measurement of run lengths. • PR framework provides an optimal set of run lengths that can be recovered without any error .

  11. Precision Resolution (PR) Framework • Assumptions: 1. Run Length noise is independent of the location of the run. 2. Noisy Run Lengths have a finite support. PR framework cannot be directly applied for the Terminator Free DNA Synthesis Channel. Why?

  12. Memory Round 1 ACCC 𝐸 !→# (𝑢 $ ) ACCCTT Round 2 𝐸 #→% (𝑢 & ) 𝐸 %→! (𝑢 ' ) ACCCTTA Round 3 ACCCTTAGGGGG Round 4 𝐸 !→( (𝑢 ) ) Distribution 𝑬 depends on the previous symbol

  13. Quantization Error • Distribution 𝑬 doesn’t have a finite support . • The quantizer may have an error in detecting the round duration . • We assume this error to be ≤ 𝜺 .

  14. Multiple Copies • Multiple DNA strings can be synthesized for the same user information. • They can be used to improve the overall scheme .

  15. Encode Information in round times 𝑢 !→# = {1, 2} 𝑢 !→$ = {1, 3} C 𝑢 !→% = {1, 3} 𝑢 #→! = {1, 2} 2 1 2 1 𝑢 #→$ = {1, 2} 1 2 1 3 𝑢 #→% = {1, 3} 3 1 𝑢 $→! = {1, 2} A 3 1 1 2 G 1 𝑢 $→# = {1, 3} 2 𝑢 $→% = {1, 2} 3 4 𝑇(𝐻) 𝑢 %→! = {1, 4} 1 1 1 1 𝑢 %→# = {1, 2} 2 4 𝑢 %→$ = {1, 4} T

  16. Convert 𝐻 to a simple graph 𝐻’ A T 3 Add auxiliary vertices d 1 d 2 T A 1 1 1 Perron-Frobenius Theory 𝑑𝑏𝑞 𝑇 𝐻 = log % 𝜇 𝐵 & ! 𝐵 $ ! : 𝐵𝑒𝑘𝑏𝑑𝑓𝑜𝑑𝑧 𝑁𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝐻 & , 𝜇 𝐵 $ ! : 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐹𝑗𝑕𝑓𝑜 𝑊𝑏𝑚𝑣𝑓 𝑝𝑔 𝐵 $ !

  17. Framework Description • Maximal round time decoding error ( 𝜺 > 𝟏) • Maximal round time ( 𝑵 ) • Allowable round times for a given transition 𝑐 → 𝑏 (𝟐) < 𝒖 𝒄→𝒃 (ℓ) 𝟑 𝟐 ≤ 𝒖 𝒄→𝒃 < ⋯ < 𝒖 𝒄→𝒃 ≤ 𝑵

  18. Example 𝑚 = 2, 𝑁 = 4 𝑢 !→# = {1, 2} 𝑢 !→$ = {1, 3} C 𝑢 !→% = {1, 3} 𝑢 #→! = {1, 2} 2 1 2 1 𝑢 #→$ = {1, 2} 1 2 1 3 𝑢 #→% = {1, 3} 3 1 𝑢 $→! = {1, 2} A 3 1 1 2 G 1 𝑢 $→# = {1, 3} 2 𝑢 $→% = {1, 2} 3 4 𝑇(𝐻) 𝑢 %→! = {1, 4} 1 1 1 1 𝑢 %→# = {1, 2} 2 4 𝑢 %→$ = {1, 4} T

  19. Framework Description • Maximal round time decoding error ( 𝜺 > 𝟏) • Maximal round time ( 𝑵 ) • Allowable round times for a given transition 𝑐 → 𝑏 (𝟐) < 𝒖 𝒄→𝒃 (ℓ) 𝟑 𝟐 ≤ 𝒖 𝒄→𝒃 < ⋯ < 𝒖 𝒄→𝒃 ≤ 𝑵 • Number of copies ( 𝑶 ) • Quantizing Function ℚ 𝒄→𝒃 : ℕ 𝑶 → [ℓ] (𝒋) ) Receiver: 𝒔 𝟐 , 𝒔 𝟑 , … , 𝒔 𝑶 𝒕. 𝒖 . 𝒔 𝒌 ~ 𝑬 𝒄→𝒃 (𝒖 𝒄→𝒃 ℚ 𝒔 𝟐 , 𝒔 𝟑 , … , 𝒔 𝑶 = [ 𝒕. 𝒖. 𝐐𝐬 [ 𝒋 . 𝒋 = 𝒋 𝒋 ≥ 𝟐 − 𝜺.

  20. 𝑢 ,→. = {1, 2} 𝐸 "→$ (1) 𝐸 "→$ (2) CGGG Quantizing Function CGGGGG Receiver CGGGG CGGGGGG CGGGGG

  21. 𝑡 rounds where for each round there are ℓ possible round Multiple times Sequence Terminator State Error Decoding Alignment 𝑛 ∈ {0,1} ! Free DNA 𝑒 ∈ [ℓ] " Splitting Correction + for ECC Synthesis encoder Coding Quantizing Channel Function , 𝑒 ∈ [ℓ] " 𝑐𝑓𝑑𝑏𝑣𝑡𝑓 𝑝𝑔 𝑢ℎ𝑓 ℚ 𝜀 𝑓𝑠𝑠𝑝𝑠 State 𝑗𝑜𝑢𝑠𝑝𝑒𝑣𝑑𝑓𝑒 Splitting 𝑐𝑧 𝑢ℎ𝑓 𝑑ℎ𝑏𝑜𝑜𝑓𝑚 decoder 𝑶 𝒅𝒑𝒒𝒋𝒇𝒕 𝑛 ∈ {0,1} ! -

  22. Theorem L et G ʹ be the ordinary version of G . Further assume the k user informaMon bits are i.i.d. uniform random bits. Then for all large enough k , the user informaMon bits may be encoded into a sequence using at most 𝑙 1 (1 + 𝛽 − 1 𝑚𝑝𝑕 -.$ (ℓ)) 𝐷 *,ℓ 𝑑𝑏𝑞 𝑇 𝐻 synthesis time, and be decoded correctly with high probability. Here α is the sum of probabilities of non-auxiliary vertices in the stationary distribution of the max-entropic Markov chain and 𝜀 𝐷 1,ℓ ≜ 1 + 𝜀𝑚𝑝𝑕 ℓ ℓ − 1 + (1 − 𝜀)𝑚𝑝𝑕 ℓ (1 − 𝜀)

  23. 𝑙 1 (1 + 𝛽 − 1 𝑚𝑝𝑕 345 (ℓ)) 𝐷 1,ℓ 𝑑𝑏𝑞 𝑇 𝐻 Multiple Sequence Terminator State Error Decoding Alignment 𝑛 ∈ {0,1} ! Free DNA 𝑒 ∈ [ℓ] " Splitting Correction + for ECC Synthesis encoder Coding Quantizing Channel Function , 𝑒 ∈ [ℓ] " ℚ State Splitting decoder 𝑶 𝒅𝒑𝒒𝒋𝒇𝒕 𝑛 ∈ {0,1} ! -

  24. Experimental Results – Binomial Run lengths 𝜀 = 0.02 𝑂 = 5

  25. Poisson Run Lengths 𝜇 ! for different Achievable Rates for different values of 𝜀 values of 𝑂

  26. Conclusion • Method for encoding information in DNA sequences based on PR framework and terminator free DNA synthesis method . • Rate above 𝐦𝐩𝐡 𝟑 𝟒 can be achieved. • Method accounts for quantizer error 𝜺. • Provided method for designing quantizers for Binomial and Poisson run lengths. • As we have multiple copies, alignment can be used to account for deletion of runs .

  27. Thank You. Questions sidjain@caltech.edu

Recommend


More recommend