rapidcdc leveraging duplicate locality to accelerate
play

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in - PowerPoint PPT Presentation

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he


  1. RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he was a Ph.D. student at UT Arlington Wayne State University

  2. Data is Growing Rapidly From storagenewsletter . com § Many of the data needs to be stored for preservation and processing. § Efficient data storage and management has become a big challenge. 2

  3. The Opportunity: Data Duplication is Common § Sources of duplicate data: – The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving. § Significant data duplication has been observed for both backup and primary storage workloads. 3

  4. The Deduplication Technique can Help When duplication is detected Only one copy is stored: File1 (using fingerprinting): Logical Physical SHA1( ) = SHA1( ) File1 File2 File2 § Benefits – Storage space – I/O bandwidth – Network traffic An important feature in commercial storage systems. § – NetApp ONTAP system – Dell-EMC Data Domain system § Two critical issues: – How to deduplicate more data? 4 – How to deduplicate faster?

  5. Deduplicate at Smaller Chunks … Remove Chunking and duplicate chunks fingerprinting … for higher deduplication ratio Two potentially major sources of cost in the deduplication: § – Chunking – Fingerprinting Can chunking be very fast? § 5

  6. Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- sized chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B HOWAREYOU?OK?REALLY?YES?NO 6

  7. Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- size chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 7

  8. Content-Defined Chunking ( CDC) § CDC: determines chunk boundaries according to contents (a predefined special marker). – Variable chunk size. – Addresses boundary-shift problem Assume the special marker is ‘ ? ’ § File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 8

  9. The Advantage of CDC CDC FSC 40 36 32 Deduplication ratio 28 24 20 16 12 8 4 0 s r a s r j s s 4 w a r i e s j d e t d k o e e e d - n c e r n x p R o N o u a - d N e s d n r l s - g i o L a n W o C a o i b G e D § Real-world datasets include two-week’s google news, Linux kernels, and various Docker images. § CDC’s deduplication ratio is much higher than FSC. § However, CDC can be very expensive. 9

  10. CDC can be Too Expensive! Assume the special marker is ‘ ? ’ File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO The marker for identifying chunk boundaries must § – be evenly spaced out with a controllable distance in between. Actually the marker is determined by applying a hash § function on a window of bytes. – E.g., hash(“YOU?”) == pre-defined-value § The window rolls forward byte-by-byte and the hashing is applied continuously. 10

  11. CDC Chunking Becomes a Bottleneck Breakdown of CPU time 100 Fingerprinting 80 Time (%) 60 40 Chunking 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 11 11 § The bottleneck shifts from the disk to CPU.

  12. CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 12 12 § The bottleneck shifts from the disk to CPU.

  13. CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 13 13 § The bottleneck shifts from the disk to CPU.

  14. Efforts on Acceleration of CDC Chunking Make hashing faster § – Example functions: SimpleByte, gear, and AE – More likely to generate small chunks • increasing size of metadata cached in memory for performance § Use GPU/multi-core to parallelize the chunking process – Extra hardware cost – Substantial efforts to deploy – The speedup is bounded by hardware parallelism. § Significant software/hardware efforts, but limited performance return 14 14

  15. We proposed RapidCDC that … § is still sequential and doesn’t require additional cores/threads. § makes the hashing speed almost irrelevant. accelerates the CDC chunking often by 10-30 times . § § has a deduplication ratio the same as regular CDC methods. can be adopted in an existing CDC deduplication system by § adding 100~200 LOC in a few functions. 15 15

  16. The Path to the Breakthrough Unique Chunks in the Disk 16

  17. The Path to the Breakthrough Fingerprint Matched! 17

  18. The Path to the Breakthrough Confirm it ! 15KB Fingerprint Matched ! 15KB 18

  19. The Path to the Breakthrough 15KB Fingerprint Matched! 15KB 7KB 10KB 9KB 20KB 12KB 12KB 19

  20. The Path to the Breakthrough 16KB Fingerprint Matched ! 20

  21. The Path to the Breakthrough P 16KB 7KB Fingerprint Fingerprint Matched ! Matched ! 21

  22. The Path to the Breakthrough P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! 22

  23. The Path to the Breakthrough P P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! Matched ! P Fingerprint almost always happens ! Matched ! 23

  24. Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 24 (Debian) # of files

  25. Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 25 # of files

  26. RapidCDC: Using Next Chunk in History as a Hint § History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint); § Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary. When FP(B1) == FP(A1): P 0 P 1 P 2 P 3 P 4 Offset in file: +s 3 = +s 2 = +s 4 = File B … B 1 B 2 B 3 B 4 … File A … <FP 1, s 2 > <FP 2, s 3 > <FP 3, s 4 > <FP 4, …> … A 1 A 2 A 3 A 4 § Regular CDC is used for chunking until a duplicate chunk (e.g., B 1 ) is found 26

  27. More Design Considerations … § A chunk may have been followed with chunks of different sizes – Maintain a size list § Validation of Hinted Next Chunk Boundaries – Four alternative criterions with different efficiency and confidences Ø FF (fast-forwarding only) Ø FF+RWT (Rolling window Test) Ø FF+MT (Marker Test) Ø FF+RWT+FPT (Fingerprint Test) § Please refer to the paper for detail. 27

  28. Evaluation of RapidCDC § Prototype: based on a rolling-window-based CDC system. – Using Rabin/Gear as rolling function for rolling window computation. – Using SHA1 to calculate fingerprints. § Three disks with different speed are tested. – SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write. – NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write. 28

  29. Synthetic Datasets: Insert/Delete 6 FF+RWT+FPT Regular Deduplication ratio FF+RWT FF+RWT+FPT Deduplication ratio 7 5 FF+MT FF+RWT 6 FF FF+MT Speedup 4 FF Speedup 5 4 3 3 2 2 1 1 0 0 1000 2000 5000 10000 20000 1000 2000 5000 10000 20000 # of modifications # of modifications # of modifications # of modifications § Chunking speedup correlates to the deduplication ratio. § Deduplication ratio is little affected (except for one very aggressive validation criterion). 29

  30. Real-world Datasets: Chunking Speed 33X FF+RWT+FPT Regular 40 FF+RWT 30 FF+RWT+FPT Deduplication ratio Faster! FF+MT Deduplication ratio FF+RWT 25 Speedup 30 FF FF+MT Speedup FF 20 20 15 10 10 5 0 0 Debian Neo4j Wordpress Nodejs Debian Neo4j Wordpress Nodejs § Chunking speedup approaches deduplication ratio. § Negligible deduplication ratio reductions (if any). 30

Recommend


More recommend