from their substrings
play

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi - PowerPoint PPT Presentation

Reconstruction of Strings from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of Technology Full version: https://arxiv.org/abs/1912.11108 Background picture from


  1. Reconstruction of Strings from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion – Israel Institute of Technology Full version: https://arxiv.org/abs/1912.11108 Background picture from https://www.hpcwire.com/2019/09/20/dna-data-storage-innovation-reduces-write-times-boosts-density

  2. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum DNA Based Storage DNA-based Storage 2

  3. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum DNA Storage System 3

  4. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum DNA Shotgun Sequencing • Accurate reading of DNA strands is limited to small lengths • The information of the string is provided by a list of its substrings of fixed length 𝑀 . • 𝑀 -multispectrum. • The substrings are assembled to reconstruct the strand. 4

  5. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Multispectrum Reconstruction Previous Work • Several recent papers this reconstruction problem • Different reading setups and various error models. • Those include: 1. R. Arratia, D. Martin, G. Reinert, and M. Waterman, “Poisson process approximation for sequence repeats, and sequencing by hybridization,” Journal of Computational Biology : a Journal of Computational Molecular Cell Biology, vol. 3, pp. 425– 463, 1996. 2. A. S. Motahari, G. Bresler, and D. Tse , “Information theory of DNA shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6273 – 6289, 2013. 3. A. S. Motahari, K. Ramchandran, D. Tse , and N. Ma, “Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads,” in Proc. of the IEEE International Symposium of Information Theory, Istanbul, Turkey, 2013, pp. 1640– 1644. 4. S. Ganguly, E. Mossel, and M. Racz , “Sequence assembly from corrupted shotgun reads,” in Proc. of the IEEE International Symposium of Information Theory, Barcelona, Spain, 2016, pp. 265 – 269. 5. G. Bresler, M. Bresler, and D. Tse , “Optimal assembly for high throughput shotgun sequencing,” BMC Bioinformatics, vol. 14, 2013. 6. I. Shomorony, T. Courtade, and D. Tse , “Do read errors matter for genome assembly?” in Proc. of the IEEE International Symposium of Information Theory, Hong Kong, 2015, pp. 919 – 923. 7. I. Shomorony, G. Kamath, F. Xia, T. Courtade, and D. Tse , “Partial DNA assembly: A rate - distortion perspective,” in Proc. of the IEEE International Symposium of Information Theory, Barcelona, Spain, 2016, pp. 1799 – 1803. 8. R. Gabrys and O. Milenkovic , “Unique reconstruction of coded sequences from multiset substring spectra,” in Proc. of the IEEE International Symposium on Information Theory, Vail, Colorado, USA, 2018, pp. 2540 – 2544. 5

  6. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Definitions • Frequently used: w = (𝑥 1 , … , 𝑥 𝑜 ) ∈ Σ 𝑜 . • Substring: 𝑥 𝑗,𝑙 = (𝑥 𝑗 , … , 𝑥 𝑗+𝑙−1 ) . • 𝑙 -prefix : 𝑥 1,𝑙 , 𝑙 -suffix : 𝑥 𝑜−𝑙+1,𝑙 . • 𝑀 -multispectrum of 𝑥 is the multiset: 𝑇 𝑀 𝑥 = {𝑥 1,𝑀 , 𝑥 2,𝑀 , … , 𝑥 𝑜−𝑀+1,𝑀 } • 𝑥 is called 𝑀 -reconstructible if it can be uniquely reconstructed from 𝑇 𝑀 (𝑥) . 𝑇 8 𝑦 11101111 00001110 10000011 11011111 8 -multispectrum reconstruct 𝑦 = 0100000111011111 0100000111011111 = 𝑦 01000001 00111011 00000111 01110111 00011101 6

  7. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Problem Definition • In many cases the 𝑀 -multispectrum can not be read error free. 11101111 00001110 11101111 00001110 Lossy Multispectrum 11011111 10000011 00111011 01000001 01000001 00000111 01110111 00000111 01110111 00011101 00011101 11101111 00001110 11001111 00001110 11011111 10000011 11011111 10000011 Erroneous Multispectrum 01000001 00111011 01010001 00111011 00000111 01110111 00000111 01110111 00011101 00011000 7

  8. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Lossy Multispectrum 𝐶 𝑀,𝑢 (𝑥) • A multiset 𝑉 is a 𝑢 -losses 𝑀 -multispectrum of 𝑥 if 𝑉 ⊆ 𝑇 𝑀 (𝑥) and 𝑇 𝑀 𝑥 − 𝑉 ≤ 𝑢. • 𝐶 𝑀,𝑢 (𝑥) consists of all the 𝑢 -losses 𝑀 -multispectrums of 𝑥 . • Maximal reconstructible substring 𝑋 1 𝑉 , 𝑢 -losses • Because of the losses, entries from the start or the end of 𝑥 can be absent. • 𝑋 1 (𝑉) is the largest consecutive substring of 𝑥 contained in 𝑉. 𝑋 1 𝑉 ≥ 𝑜 − 𝑢 . • 𝑥 is (𝑀, 𝑢) -reconstructible if its maximal reconstructible substring 𝑋 1 (𝑉) can be uniquely reconstructed from any 𝑉 ∈ 𝐶 𝑀,𝑢 (𝑥) . 𝑉 11101111 00001110 3 -losses 8 -multispectrum reconstruct 10000011 11011111 𝑦 = 0100000111011111 10000011101111 = 𝑋 1 (𝑉) 01000001 00111011 00000111 01110111 00011101 8

  9. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum 𝑀 -Reconstructible and 𝑀 -Substring Unique • A string 𝑥 ∈ Σ n is called 𝑀 -substring unique if for every 1 ≤ 𝑗 < 𝑘 ≤ 𝑜 − 𝑀 + 1, 𝑥 𝑗,𝑀 , ≠ 𝑥 𝑘,𝑀 . • Theorem.[1] For 𝑀 ≥ 𝑏 log 𝑜 where 𝑏 > 1, the asymptotic rate of the set of 𝑀 - substring unique strings approaches 1. • Theorem.[2] If 𝑦 is (𝑀 − 1) -substring unique then it is 𝑀 -reconstructible. • The first 𝑀 -substring satisfies that its 𝑀 − 1 -prefix appears once in 𝑇 𝑀 (𝑦) . • Similarly the 𝑀 − 1 -suffix of the last 𝑀 -substring. • Every other 𝑀 − 1 -prefix or suffix appears twice. [1] O. Elishco, R. Gabrys , M. Medard, and E. Yaakobi, “Repeat free codes,” in Proc. of the IEEE International Symposium of Information Theory, Paris, France, 2019, pp. 932 – 936. 9 [2] E. Ukkonen , “Approximate string -matching with q- grams and maximal matches,” Theoretical Computer Science, vol. 92, no. 1, pp. 191– 211, 1992.

  10. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 1 – Find a 7 -prefix that appears once 𝑇 8 𝑦 11101111 00001110 𝑦 = 10000011 11011111 01000001 00111011 00000111 01110111 00011101 10

  11. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 00001110 𝑦 = 01000001 10000011 11011111 00111011 00000111 01110111 00011101 11

  12. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 00001110 𝑦 = 010000011 11011111 00111011 00000111 01110111 00011101 12

  13. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 00001110 𝑦 = 0100000111 11011111 00111011 01110111 00011101 13

  14. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 𝑦 = 01000001110 11011111 00111011 01110111 00011101 14

  15. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 𝑦 = 010000011101 11011111 00111011 01110111 15

  16. Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum Basic Stitching • 𝑦 = 0100000111011111 is 7 -substring unique • 𝑇 8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011, 01110111, 11101111, 11011111 } • Step 2 – Find a 7 -prefix that matches the current 7 -suffix 𝑇 8 𝑦 11101111 𝑦 = 0100000111011 11011111 01110111 16

Recommend


More recommend