data deduplication with random substitutions
play

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud - PowerPoint PPT Presentation

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer Engineering University of Virginia { haolou,farzad } @virginia.edu June 8, 2020 Lou, Farnoud ISIT2020 1 / 21 Data explosion David Reinsel, John


  1. Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer Engineering University of Virginia { haolou,farzad } @virginia.edu June 8, 2020 Lou, Farnoud ISIT2020 1 / 21

  2. Data explosion David Reinsel, John Gantz, and John Rydning. “The digitization of the world: from edge to core”. In: IDC White Paper (2018) Lou, Farnoud ISIT2020 2 / 21

  3. Data deduplication Efficient data reduction approach: data deduplication. Lou, Farnoud ISIT2020 3 / 21

  4. Data deduplication Efficient data reduction approach: data deduplication. Deduplication system Chunk size: fixed-length variable-length (content defined) Lou, Farnoud ISIT2020 3 / 21

  5. Data deduplication Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB) 1 or file-level redundancy 1 Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles . 2001, pp. 174–187. Lou, Farnoud ISIT2020 4 / 21

  6. Data deduplication Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB) 1 or file-level redundancy hash-based fingerprint, no byte by byte comparison. 1 Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles . 2001, pp. 174–187. Lou, Farnoud ISIT2020 4 / 21

  7. Data deduplication Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB) 1 or file-level redundancy hash-based fingerprint, no byte by byte comparison. ⇒ deduplication methods are more efficient for large-scale storage systems. 1 Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles . 2001, pp. 174–187. Lou, Farnoud ISIT2020 4 / 21

  8. An Information-theoretic point of view Information source model: data stream. Introduction of deduplication algorithms. Performance analysis of deduplication algorithms. Lou, Farnoud ISIT2020 5 / 21

  9. Existing work Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE Transactions on Information Theory 65.9 (2019), pp. 5688–5704 Rasmus Vestergaard, Qi Zhang, and Daniel E Lucani. “Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties”. In: arXiv preprint arXiv:1901.02720 (2019) Laura Conde-Canencia, Tyson Condie, and Lara Dolecek. “Data deduplication with edit errors”. In: 2018 IEEE Global Communications Conference (GLOBECOM) . IEEE. 2018, pp. 1–6 Lou, Farnoud ISIT2020 6 / 21

  10. Source model L a ∼ integer distribution P s , X a ∼ { 0 , 1 } L a . Lou, Farnoud ISIT2020 7 / 21

  11. Source model L a ∼ integer distribution P s , X a ∼ { 0 , 1 } L a . Source symbols: a = 1 , 2 , . . . , A · · · X 1 X 2 X A Lou, Farnoud ISIT2020 7 / 21

  12. Source model L a ∼ integer distribution P s , X a ∼ { 0 , 1 } L a . Source symbols: a = 1 , 2 , . . . , A · · · X 1 X 2 X A X n 1 , X n 2 , . . . , X n B , drawn with replacement from { X 1 , X 2 , . . . , X A } . Lou, Farnoud ISIT2020 7 / 21

  13. Source model L a ∼ integer distribution P s , X a ∼ { 0 , 1 } L a . Source symbols: a = 1 , 2 , . . . , A · · · X 1 X 2 X A X n 1 , X n 2 , . . . , X n B , drawn with replacement from { X 1 , X 2 , . . . , X A } . Concatenation and substitution: X n 1 X n 2 · · · X n B substitutions Y 1 Y 2 · · · Y B Data stream s = Lou, Farnoud ISIT2020 7 / 21

  14. Source model L a ∼ integer distribution P s , X a ∼ { 0 , 1 } L a . Source symbols: a = 1 , 2 , . . . , A · · · X 1 X 2 X A X n 1 , X n 2 , . . . , X n B , drawn with replacement from { X 1 , X 2 , . . . , X A } . Concatenation and substitution: X n 1 X n 2 · · · X n B substitutions Y 1 Y 2 · · · Y B Data stream s = Y 1 , Y 2 , . . . , Y B : source blocks. Lou, Farnoud ISIT2020 7 / 21

  15. Source model Substitutions: each bit has probability δ ≤ 1 / 2 to be flipped independently, δ is a constant. Lou, Farnoud ISIT2020 8 / 21

  16. Source model Substitutions: each bit has probability δ ≤ 1 / 2 to be flipped independently, δ is a constant. Example: A = 3, B = 4. X 1 : 000 X 2 : 0010 X 3 : 11 X n 1 = X 2 X n 2 = X 1 X n 3 = X 3 X n 4 = X 1 0010 000 11 000 Y 1 Y 2 Y 3 Y 4 1010 000 10 001 ⇒ s = 101000010001 Lou, Farnoud ISIT2020 8 / 21

  17. Source model Assumptions: Length distribution P s : mean L , P s ( L/ 2 ≤ L a ≤ 2 L ) = 1. Lou, Farnoud ISIT2020 9 / 21

  18. Source model Assumptions: Length distribution P s : mean L , P s ( L/ 2 ≤ L a ≤ 2 L ) = 1. Asymptotically, A = o ( B 1 − ǫ ), 0 < ǫ < 1, L = B 1 /k , k > 1. Lou, Farnoud ISIT2020 9 / 21

  19. Entropy Entropy: H ( s ) H ( δ ) BL ≤ H ( s ) ≤ H ( δ ) BL + o ( BL ) as B → ∞ . BL : expected length of s . Lou, Farnoud ISIT2020 10 / 21

  20. Deduplication scheme Double fixed-length deduplication: s = · · · · · · D S 1 S 2 S 3 · · · · · · S K ℓ Z 1 Z 1 2 · · · Z 1 Z 2 1 · · · Z 2 C · · · · · · 1 C s is parsed into segments of length D . Lou, Farnoud ISIT2020 11 / 21

  21. Deduplication scheme Double fixed-length deduplication: s = · · · · · · D S 1 S 2 S 3 · · · · · · S K ℓ Z 1 Z 1 2 · · · Z 1 Z 2 1 · · · Z 2 C · · · · · · 1 C s is parsed into segments of length D . Each S k are further parsed into chunks of length ℓ . Lou, Farnoud ISIT2020 11 / 21

  22. Deduplication scheme Example: D = 5 , ℓ = 3. s =00001011010 ⇒ s =00001 | 01101 | 0 ⇒ s =000 | 01 | 011 | 01 | 0 chunks: 000 , 01 , 011 , 01 , 0 . Lou, Farnoud ISIT2020 12 / 21

  23. Deduplication scheme Double fixed-length deduplication: Prefix-free code for | s | . Lou, Farnoud ISIT2020 13 / 21

  24. Deduplication scheme Double fixed-length deduplication: Prefix-free code for | s | . Chunks are processed sequentially: Lou, Farnoud ISIT2020 13 / 21

  25. Deduplication scheme Double fixed-length deduplication: Prefix-free code for | s | . Chunks are processed sequentially: First time appearance: 1 + itself. Added to dictionary. Lou, Farnoud ISIT2020 13 / 21

  26. Deduplication scheme Double fixed-length deduplication: Prefix-free code for | s | . Chunks are processed sequentially: First time appearance: 1 + itself. Added to dictionary. Not first time: 0 + pointer to dictionary. Lou, Farnoud ISIT2020 13 / 21

  27. Deduplication scheme Example: s = 00001011010 ⇒ 000 , 01 , 011 , 01 , 0. Lou, Farnoud ISIT2020 14 / 21

  28. Deduplication scheme Example: s = 00001011010 ⇒ 000 , 01 , 011 , 01 , 0. Compressed string: 0001011 + 1000 + + 1011 + + 101 001 10 Lou, Farnoud ISIT2020 14 / 21

  29. Deduplication scheme Example: s = 00001011010 ⇒ 000 , 01 , 011 , 01 , 0. Compressed string: 0001011 + 1000 + + 1011 + + 101 001 10 0001011: Elias γ code for | s | = 11. Lou, Farnoud ISIT2020 14 / 21

  30. Deduplication scheme Example: s = 00001011010 ⇒ 000 , 01 , 011 , 01 , 0. Compressed string: 0001011 + 1000 + + 1011 + + 101 001 10 0001011: Elias γ code for | s | = 11. 1000,101,1011,10: 1st occurrence of chunk 000,01,011,0. Lou, Farnoud ISIT2020 14 / 21

  31. Deduplication scheme Example: s = 00001011010 ⇒ 000 , 01 , 011 , 01 , 0. Compressed string: 0001011 + 1000 + + 1011 + + 101 001 10 0001011: Elias γ code for | s | = 11. 1000,101,1011,10: 1st occurrence of chunk 000,01,011,0. 001: 2nd occurrence of 01. Lou, Farnoud ISIT2020 14 / 21

  32. Performance analysis Setting: Source model: P s ( L a = L ) = 1. Lou, Farnoud ISIT2020 15 / 21

  33. Performance analysis Setting: Source model: P s ( L a = L ) = 1. First-level parsing length: pick D = L . Lou, Farnoud ISIT2020 15 / 21

  34. Performance analysis Setting: Source model: P s ( L a = L ) = 1. First-level parsing length: pick D = L . s = Y 1 Y 2 Y 3 · · · · · · Y B L · · · · · · Y 1 Y 2 Y 3 Y B ℓ Z 1 Z 1 2 · · · Z 1 Z 2 Z 2 2 · · · Z 2 · · · · · · Z B Z B 2 · · · Z B C C C 1 1 1 Length of compressed version of s : L F ( s ). Lou, Farnoud ISIT2020 15 / 21

  35. Performance analysis Theorem As B → ∞ , with optimal ℓ , E [ L F ( s )] ≤ 2 ǫ (1 + 1 k ) . H ( s ) A = o ( B 1 − ǫ ), L = B 1 /k . Lou, Farnoud ISIT2020 16 / 21

  36. Deduplication scheme Variable-length deduplication 2 2 Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE Transactions on Information Theory 65.9 (2019), pp. 5688–5704. Lou, Farnoud ISIT2020 17 / 21

  37. Deduplication scheme Variable-length deduplication 2 Example: s = 0100100011000, M = 2: s = 0100 | 100 | 01100 | 0 , chunks: 0100 , 100 , 01100 , 0 . 2 Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE Transactions on Information Theory 65.9 (2019), pp. 5688–5704. Lou, Farnoud ISIT2020 17 / 21

  38. Deduplication scheme Variable-length deduplication: Prefix-free code for | s | . Chunks are processed sequentially: First time appearance: 1 + itself. Added to dictionary. Not first time: 0 + pointer to dictionary. L V L ( s ): compressed string length of s . Lou, Farnoud ISIT2020 18 / 21

Recommend


More recommend