Robust Data Storage in DNA with Error-Correcting Codes Robert Grass ⋆ and Reinhard Heckel ∗ ⋆ ETH Zurich ⋆ , IBM Research Zurich ∗ September 28, 2015 Research performed at ETH Zurich Thanks to Prof. W. Stark, D. Paunescu, M. Puddu
DNA ◮ DNA is a molecule storing genetic information of organisms ◮ We view DNA as a string of four different nucleotides . . . C T . . . A G A C G T 2 / 15
Storing information in DNA Binary data can be encoded as a DNA string: 01011010 encode ACGTACGT ◮ First message written on DNA in the 90s ◮ Church et al. (Science 2012) and Goldman et al. (Nature 2013) stored about 1Mb on DNA However, previous approaches are not robust! 3 / 15
Our contribution ◮ Making DNA data storage robust and cheaper by using error-correction codes and storing DNA in synthetic fossils ◮ Information can be recovered from DNA stored at the Global Seed Vault (-18C) after over 1 million years 4 / 15
Motivation: Maximum storage times Archimedes “The Method” DNA in ancient bone years 1 10 100 1000 10 . 000 100 . 000 5 / 15
Motivation: Information density Our work 0 . 01 0 . 1 1 10 100 1000 Gbit/mm 3 6 / 15
DNA is not a disc: The DNA channel ◮ Can only read and write short DNA segments ◮ Segments can not be spatially ordered ACATACGT CATGTACA GCTATGCC GCTATGCC CATGTACA ACATACGT synthesize sequence 01011010 encode decode 01011010 Error sources: ◮ Individual base errors: ‘CTACA...’ instead of ‘ATACG...’ ◮ Loss of complete sequences 7 / 15
Encoding and decoding scheme encode each add unique index encode column to each symbol outer code DNA synthesis · · · decode inner sort and decode code remove indices outer code DNA · · · · · · sequencing 8 / 15
Encoding and decoding scheme Inner code: ◮ Reed-Solomon code over GF(47) with n = 39 , k = 33 ◮ Corrects individual base errors Outer code: ◮ Reed-Solomon code over extension field GF( 47 30 ) with N = 713 , K = 594 ◮ Recovers lost sequences (erasures) ◮ Corrects errors from the inner decoder Why GF(47)? ◮ Allows to avoid runs of length > 3 such that ‘CTAGGGG’ which result in a significant increase of reading errors Information theoretically close to optimal 9 / 15
Protecting DNA from decay and environment Dry storage of DNA in bone in amber in silica 10 / 15
Protection through DNA encapsulation in silica Paunescu, Fuhrer, Grass, Angew. Chem. Int. Ed. 2013. Paunescu, Grass et al. Nat. Protoc. 2013 11 / 15
Accelerated aging experiment Archimedes: “The Method” Give me a place ACATACGT CATGTACA GCTATGCC synthesis encapsulation encode to stand and with a lever I will move the whole world... storage at 70 ◦ Give me a place ACATACGT CATGTACA GCTATGCC to stand and with a lever I decode sequencing release will move the whole world... 12 / 15
Errors in and loss of whole sequences 59 68 71 initial error error after inner decoding error outer code erasure 8 . 9 final error 5 . 5 4 . 5 3 2 . 8 2 . 5 0 . 5 0 . 4 0 . 3 0 0 0 Original DNA 1/2 week at 70 ◦ 1 week at 70 ◦ ◮ In all cases the information could be reconstructed perfectly ◮ 1 week at 70 ◦ = 2000 years in Zurich = 2 million years at Global Seed Vault ( − 18 . 8 ◦ ) 13 / 15
Errors in individual sequences error probability in % Original DNA 1/2 week at 70 ◦ 1 1 week at 70 ◦ 0 . 5 C G T A G T A C T A C G 2 2 2 2 2 2 2 2 2 2 2 2 A A A C C C G G G T T T 14 / 15
Conclusion ◮ Digital information can be stored robustly for thousands of years in DNA ◮ Only the combination of error-correction and DNA encapsulation in silica enables long-term storage Thank you! 15 / 15
Recommend
More recommend