coding for dna storage in live organisms
play

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - PowerPoint PPT Presentation

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel Based on joint works with: (alphabetically) Jehoshua Bruck Caltech Ohad Elishco Ben-Gurion


  1. Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel

  2. Based on joint works with: (alphabetically) • Jehoshua Bruck – Caltech • Ohad Elishco – Ben-Gurion University (now MIT) • Farzad Farnoud (Hassanzadeh) – University of Virginia • Siddharth Jain – Caltech • Yonatan Yehezkeally – Ben-Gurion University Introduction 2 / 79

  3. Science fiction distant future dream? Introduction 3 / 79

  4. No – It’s just around the corner! Introduction 4 / 79

  5. DNA is a long string Genetic information is stored in DNA, which is a In E. coli bacteria, genetic information is stored in In humans, genetic information is stored in over Introduction 5 / 79 string of nucleotides: Adenine, Cytosine, Guanine, and Thymine. about 4 · 10 6 base pairs. 3 · 10 9 base pairs.

  6. Why store information in DNA? DNA is dense! It stores information in the molecular level. Introduction 6 / 79 DNA can potentially hold 250 · 2 50 bytes ( 250 peta-byte) of information in 1 gram of DNA. If we were to use 8 Tb hard-drives to store the same amount, we’ll need 32000 hard-drives, with a total weight of about 25 tons!

  7. OK, but why in living organisms? Living organisms replicate and solve this problem. organisms. Main disadvantage: Mutations! We need error-correcting codes. Introduction 7 / 79 • Reading from DNA is destructive, hence we need several copies. • Data longevity is (potentially) better, due to replication of • The organism’s outer shell provides extra protection. • Labeling organisms for biological studies. • Watermarking genetically modified organisms (GMOs).

  8. Error-correcting codes – An age old story An error-correcting code has two main components: 1 An error ball: Its size and shape depend on the kind of errors the channel induces. 2 A packing of error balls: Its density affects communication efficiency. Its structure affects ease of encoding/decoding. Introduction 8 / 79

  9. What kinds of errors do we expect? u Introduction Which is the most common? Unknown yet, but… w v v u w u w v u w w Insertion v u w v u w u w v u Deletion Substitution Duplication 9 / 79 v ′

  10. Repeated sequences are everywhere myotonic dystrophy, Huntington’s disease, and important phenomena such as chromosome fragility, expansion diseases, silencing genes, and rapid morphological variation. Repetitions are common in other species as well, and are claimed to be 1 Lander et al. , Nature 2001. Introduction 10 / 79 More than 50 % of human genome is repeated sequences! 1 Repetitions were shown to be connected with diseases such as cancer, a major evolutionary force during vertebrate evolution. 1

  11. Duplication processes may repeat ACTCA ACTACTCA ACTATACTCA ACTATACACTCA It is conceivable that a substantial portion of the unique genome, the part that is not known to contain repeated sequences, also has its origins in ancient repeated sequences that are no longer recognizable due to change over time. 2 2 Lander et al. , Nature 2001. Introduction 11 / 79 ⇓ ⇓ ⇓

  12. Duplication processes may differ v v u v v R w u v v w u v w v z Introduction w u Palindromic Duplication u Interspersed Duplication End Duplication Tandem Duplication u v w v z w u v w u v w 12 / 79

  13. A formal definition Definition Introduction 13 / 79 Let Σ be a finite alphabet, s ∈ Σ ∗ some string, and T ⊆ Σ ∗ Σ ∗ a set of string-duplication rules. A string-duplication system, S , defined by the tuple (Σ , s , T ) , is the reflexive transitive closure of T operating on s , namely, S ⊆ Σ ∗ is the minimal set for which: 1 s ∈ S . 2 s ′ ∈ S and T ∈ T imply T ( s ′ ) ∈ S . We write S = S (Σ , s , T ) .

  14. End duplication - formally Definition (End Duplication) Introduction v w v u w v u k k 14 / 79 T end otherwise. T end uvwv x k { if x = uvw , | u | = i , | v | = k i , k ( x ) = { � } � i � 0 T end = . � i , k = S (Σ , s , T end ) . The end-duplication system is defined as S end

  15. Tandem duplication - formally Definition (Tandem Duplication) Introduction w v v u w v u k k 15 / 79 T tan otherwise. T tan uvvw x k { if x = uvw , | u | = i , | v | = k i , k ( x ) = { � } � i � 0 T tan = . � i , k = S (Σ , s , T tan ) . The tandem-duplication system is defined as S tan

  16. How expressive is a duplication system? Definition n Definition Introduction 16 / 79 The capacity of a string system S ⊆ Σ ∗ is defined by log 2 | S ∩ Σ n | cap ( S ) = lim sup . n →∞ Let S ⊆ Σ ∗ be a string system. We shall say S is fully expressive if for every v ∈ Σ ∗ there exist u , w ∈ Σ ∗ such that uvw ∈ S .

  17. We are interested in: Introduction 17 / 79 • How does the capacity depend on the choice of duplication rules? • How does the capacity depend on the choice of seed string? • Which systems are fully expressive? • What is the connection between capacity and full expressiveness?

  18. Some related previous work exists Tandem duplication was studied in the context of formal languages: Where are tandem-duplication languages located in the Chomsky hierarchy? Binary tandem-duplication languages are regular. Non-binary tandem-duplication languages are irregular. Introduction 18 / 79 • Martín-Vide and Paun, Acta Cybernetica (1999): • Dassow, Mitrana and Paun, Bull. of the EATCS (1999): • Ming-Wei, Bull. of the EATCS (2000):

  19. More related previous work exists Tandem duplication was studied in an algorithmic context: Systems Sci. (2004): How to efficiently find tandem duplications in a string. How to efficiently find nested tandem duplications. J. Comp. Biology (2007), Brejová et al., Phil. Trans. R. Soc. A (2014): How to reconstruct the derivation process of a tandem-duplicated string. Introduction 19 / 79 • Main and Lorentz, J. Alg. (1984), Gusfield and Stoye, J. Comp. and • Matroud, Hendy, and Tuffley, Nucleic Acids Research (2011): • Elemento et al., Molecular Bio. and Evolution (2002), Lajoie et al.,

  20. End duplication has full capacity Theorem For S end k k k Assumption End Duplication 20 / 79 ) , | s | � k, = S (Σ , s , T end cap ( S end ) = log 2 | Σ | . The initial string s contains every symbols of Σ at least once.

  21. End duplication has full capacity (Cont.) Proof. End Duplication n n k 21 / 79 k � : We obviously have, � ∩ Σ n � � S end � log 2 cap ( S end ) = lim sup n →∞ log 2 | Σ n | � lim sup n →∞ = log 2 | Σ | .

  22. End duplication has full capacity (Cont.) Proof. string y with w as a suffix. k End Duplication 22 / 79 � : We claim that starting with any string s ∈ Σ � k , with each symbol appearing at least once, and any w = w 1 w 2 . . . w k ∈ Σ k , we can derive a Step I: Duplicate prefix. Assume s = uv , | u | = k , then s = uv ⇒ uvu = s ′ . Observation: Every symbol of Σ appears in the beginning and end of a k -substring of s ′ . Step II: Force w 1 at the end. ⇒ w 1 w 1

  23. End duplication has full capacity (Cont.) Proof. End Duplication k and then 23 / 79 k Step III: Force w 1 w 2 at the end. ⇒ w 2 w 1 w 1 w 2 ⇒ w 1 w 2 w 1 w 2 Repeat Step III inductively to get w 1 w 2 . . . w k as a suffix.

  24. End duplication has full capacity (Cont.) Proof. End Duplication systems are fully expressive. k S end Corollary n k 24 / 79 k substring. Step IV: Repeat previous steps to get every k -word from Σ k as a Thus, after at most 2 k | Σ | k duplications we get a string s ′′ containing all possible k -substrings, | s ′′ | � 2 k 2 | Σ | k . For any n = | s ′′ | + tk we can now create | Σ | tk distinct strings. Hence, � ∩ Σ n � log 2 ( | Σ | tk ) � S end � log 2 � lim sup � log 2 | Σ | . cap ( S end ) = lim sup | s ′′ | + tk n →∞ t →∞

  25. Tandem duplication behaves differently But first… Tandem Duplication otherwise, q , 25 / 79 q Definition Main tool – φ k -transform domain. We assume WLOG that Σ = Z q . We define the transform φ k : Z � k q × Z ∗ → Z k q by, φ k ( x ) = ( Pref k ( x ) , Suff | x |− k ( x ) − Pref | x |− k ( x )) , q × Z ∗ q × Z ∗ q → Z k as well as ζ i , k : Z k { if y = uw , | u | = i ( x , u 0 k w ) ζ i , k ( x , y ) = ( x , y ) where Pref i ( x ) and Suff i ( x ) are, respectively, the i -prefix and i -suffix of x .

  26. 26 / 79 q The following diagram commutes: Tandem Duplication q T tan q q Lemma Main tool - φ k -transform domain Z � k Z � k − − − − → i , k    � φ k  � φ k ζ i , k q × Z ∗ q × Z ∗ − − − − → Z k Z k i.e., for every string x ∈ Z � k q , φ k ( T tan i , k ( x )) = ζ i , k ( φ k ( x )) .

  27. 27 / 79 Example Tandem Duplication where the inserted elements are underlined. T tan Main tool - φ k -transform domain Assume Σ = Z 4 . Starting with 02123 and letting i = 1 and k = 2 leads to 1 , 2 02123 − − − − → 0212123     � φ 2 � φ 2 ζ 1 , 2 (02 , 102) − − − − → (02 , 10002)

Recommend


More recommend