RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he was a Ph.D. student at UT Arlington Wayne State University
Data is Growing Rapidly From storagenewsletter . com § Many of the data needs to be stored for preservation and processing. § Efficient data storage and management has become a big challenge. 2
The Opportunity: Data Duplication is Common § Sources of duplicate data: – The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving. § Significant data duplication has been observed for both backup and primary storage workloads. 3
The Deduplication Technique can Help When duplication is detected Only one copy is stored: File1 (using fingerprinting): Logical Physical SHA1( ) = SHA1( ) File1 File2 File2 § Benefits – Storage space – I/O bandwidth – Network traffic An important feature in commercial storage systems. § – NetApp ONTAP system – Dell-EMC Data Domain system § Two critical issues: – How to deduplicate more data? 4 – How to deduplicate faster?
Deduplicate at Smaller Chunks … Remove Chunking and duplicate chunks fingerprinting … for higher deduplication ratio Two potentially major sources of cost in the deduplication: § – Chunking – Fingerprinting Can chunking be very fast? § 5
Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- sized chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B HOWAREYOU?OK?REALLY?YES?NO 6
Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- size chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 7
Content-Defined Chunking ( CDC) § CDC: determines chunk boundaries according to contents (a predefined special marker). – Variable chunk size. – Addresses boundary-shift problem Assume the special marker is ‘ ? ’ § File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 8
The Advantage of CDC CDC FSC 40 36 32 Deduplication ratio 28 24 20 16 12 8 4 0 s r a s r j s s 4 w a r i e s j d e t d k o e e e d - n c e r n x p R o N o u a - d N e s d n r l s - g i o L a n W o C a o i b G e D § Real-world datasets include two-week’s google news, Linux kernels, and various Docker images. § CDC’s deduplication ratio is much higher than FSC. § However, CDC can be very expensive. 9
CDC can be Too Expensive! Assume the special marker is ‘ ? ’ File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO The marker for identifying chunk boundaries must § – be evenly spaced out with a controllable distance in between. Actually the marker is determined by applying a hash § function on a window of bytes. – E.g., hash(“YOU?”) == pre-defined-value § The window rolls forward byte-by-byte and the hashing is applied continuously. 10
CDC Chunking Becomes a Bottleneck Breakdown of CPU time 100 Fingerprinting 80 Time (%) 60 40 Chunking 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 11 11 § The bottleneck shifts from the disk to CPU.
CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 12 12 § The bottleneck shifts from the disk to CPU.
CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 13 13 § The bottleneck shifts from the disk to CPU.
Efforts on Acceleration of CDC Chunking Make hashing faster § – Example functions: SimpleByte, gear, and AE – More likely to generate small chunks • increasing size of metadata cached in memory for performance § Use GPU/multi-core to parallelize the chunking process – Extra hardware cost – Substantial efforts to deploy – The speedup is bounded by hardware parallelism. § Significant software/hardware efforts, but limited performance return 14 14
We proposed RapidCDC that … § is still sequential and doesn’t require additional cores/threads. § makes the hashing speed almost irrelevant. accelerates the CDC chunking often by 10-30 times . § § has a deduplication ratio the same as regular CDC methods. can be adopted in an existing CDC deduplication system by § adding 100~200 LOC in a few functions. 15 15
The Path to the Breakthrough Unique Chunks in the Disk 16
The Path to the Breakthrough Fingerprint Matched! 17
The Path to the Breakthrough Confirm it ! 15KB Fingerprint Matched ! 15KB 18
The Path to the Breakthrough 15KB Fingerprint Matched! 15KB 7KB 10KB 9KB 20KB 12KB 12KB 19
The Path to the Breakthrough 16KB Fingerprint Matched ! 20
The Path to the Breakthrough P 16KB 7KB Fingerprint Fingerprint Matched ! Matched ! 21
The Path to the Breakthrough P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! 22
The Path to the Breakthrough P P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! Matched ! P Fingerprint almost always happens ! Matched ! 23
Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 24 (Debian) # of files
Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 25 # of files
RapidCDC: Using Next Chunk in History as a Hint § History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint); § Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary. When FP(B1) == FP(A1): P 0 P 1 P 2 P 3 P 4 Offset in file: +s 3 = +s 2 = +s 4 = File B … B 1 B 2 B 3 B 4 … File A … <FP 1, s 2 > <FP 2, s 3 > <FP 3, s 4 > <FP 4, …> … A 1 A 2 A 3 A 4 § Regular CDC is used for chunking until a duplicate chunk (e.g., B 1 ) is found 26
More Design Considerations … § A chunk may have been followed with chunks of different sizes – Maintain a size list § Validation of Hinted Next Chunk Boundaries – Four alternative criterions with different efficiency and confidences Ø FF (fast-forwarding only) Ø FF+RWT (Rolling window Test) Ø FF+MT (Marker Test) Ø FF+RWT+FPT (Fingerprint Test) § Please refer to the paper for detail. 27
Evaluation of RapidCDC § Prototype: based on a rolling-window-based CDC system. – Using Rabin/Gear as rolling function for rolling window computation. – Using SHA1 to calculate fingerprints. § Three disks with different speed are tested. – SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write. – NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write. 28
Synthetic Datasets: Insert/Delete 6 FF+RWT+FPT Regular Deduplication ratio FF+RWT FF+RWT+FPT Deduplication ratio 7 5 FF+MT FF+RWT 6 FF FF+MT Speedup 4 FF Speedup 5 4 3 3 2 2 1 1 0 0 1000 2000 5000 10000 20000 1000 2000 5000 10000 20000 # of modifications # of modifications # of modifications # of modifications § Chunking speedup correlates to the deduplication ratio. § Deduplication ratio is little affected (except for one very aggressive validation criterion). 29
Real-world Datasets: Chunking Speed 33X FF+RWT+FPT Regular 40 FF+RWT 30 FF+RWT+FPT Deduplication ratio Faster! FF+MT Deduplication ratio FF+RWT 25 Speedup 30 FF FF+MT Speedup FF 20 20 15 10 10 5 0 0 Debian Neo4j Wordpress Nodejs Debian Neo4j Wordpress Nodejs § Chunking speedup approaches deduplication ratio. § Negligible deduplication ratio reductions (if any). 30
Recommend
More recommend