a dna based archival storage system
play

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez - PowerPoint PPT Presentation

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez * Douglas M. Carmean Luis Ceze * Georg Seelig * Karin Strauss * University of Washington Microso fu Research Facebook cold storage facility 1 exabyte (10 9 GB)


  1. Chunking data Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T G C A C T G C T T A C C G C C A G T T C

  2. Chunking data Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T G C A C T G C T T A C C G C C A G T T C

  3. Chunking data Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T G C A C A A A A T G C T T A C C A A A C G C C A G T T C A A A G Addresses within the value

  4. Chunking data Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T G T T G C C A G T T C A A A G C A T C C Key identifiers Addresses (“primers”) within the value

  5. E ff icient reads A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T G T T G C C A G T T C A A A G C A T C C Key identifiers Addresses (“primers”) within the value

  6. E ff icient reads A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Key identifiers Addresses (“primers”) within the value

  7. E ff icient reads Pool containing stored strands for all keys & values! A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Key identifiers Addresses (“primers”) within the value

  8. E ff icient reads cat.jpg Pool containing stored strands for get(key) all keys & values! A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Key identifiers Addresses (“primers”) within the value

  9. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G

  10. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Strands with 3 di ff erent primers

  11. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Strands with PCR 3 di ff erent primers Selectively amplify strands based on their primer

  12. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Strands with PCR 3 di ff erent primers Selectively amplify strands based on their primer

  13. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Almost all Strands with PCR Sample strands have 3 di ff erent desired primer primers Selectively amplify strands based on their primer

  14. Random access Address Primers A T G T T G G A T G C A C A A A A C A T C C A T G T T T G C T T A C C A A A C C A T C C A T T T C C A T T C A A A C A T C C G G G G Reads are destructive, so replenish when necessary Almost all Strands with PCR Sample strands have 3 di ff erent desired primer primers Selectively amplify strands based on their primer

  15. Error correction Both synthesis and sequencing are error prone: G G A T G C A Insertions G G A T A G C A Deletions G G A T G A Substitutions G G A T C C A Error rates ~1% per nucleotide!

  16. Logical redundancy

  17. Logical redundancy Primer Data Address

  18. Logical redundancy Primer Data Address

  19. Logical redundancy Primer Data Address

  20. Logical redundancy Primer Data Address XOR redundancy provides simple error correction

  21. Logical redundancy Primer Data Address XOR redundancy provides simple error correction Reserved address space to indicate redundancy data

  22. Wet lab results

  23. The process

  24. The process

  25. The process

  26. The process catcatgg

  27. The process catcatgg

  28. The process catcatgg

  29. The process catcatgg

  30. The process catcatgg

  31. The process catcatgg catcatg c

  32. The process catcatgg catcatg c

  33. The process catcatgg Throughput MBs/week catcatg c

  34. Decoding Encoded and synthesized 3 files (151 kB):

  35. Photo: Tara Brown / UW

  36. Decoding Encoded and synthesized 3 files (151 kB):

  37. Decoding Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB):

  38. Decoding Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for Sequenced and decoded the random access (42 kB): resulting amplified pool:

  39. Decoding Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for Sequenced and decoded the random access (42 kB): resulting amplified pool: Recovered every bit despite errors in synthesis and sequencing

  40. The importance of redundancy Primer Data Address

  41. The importance of redundancy Primer Data Address

  42. The importance of redundancy Primer Data Address If we ignore redundancy data, we cannot recover the file. 75 Frequency 50 25 0 0 2500 5000 7500 Number of copies

  43. The importance of redundancy Primer Data Address If we ignore redundancy data, we cannot recover the file. Some strands are 75 Frequency missing entirely 50 25 0 0 2500 5000 7500 Number of copies

  44. A DNA-based archival storage system Redundancy E ff icient and density retrieval Write Read Wet lab experiments Store

  45. A DNA-based archival storage system Also in the paper: • Reliability-density trade-o ff Redundancy E ff icient • Simulation of decay and density retrieval over time • Error analysis • Model of truncated Write Read strands Wet lab experiments Store

  46. MBs/week GBs/second

  47. DNA productivity is growing Transistors on Chip 10 10 Reading DNA Writing DNA Productivity 10 8 10 6 10 4 10 2 1970 1980 1990 2000 2010 Year Source: Robert Carlson

  48. DNA technology is miniaturizing

  49. We’ve just barely scratched the surface 100% Accuracy 75% 50% 25% 0% 0.01% 0.1% 1% 10% Reads used

  50. Our community has seen these challenges before Simulation Cache locality Latency-hiding optimizations Scheduling Error correction Spatial addressing Programming Circuit design with errors

Recommend


More recommend