RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in - PowerPoint PPT Presentation

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he was a Ph.D. student at UT Arlington Wayne State University

Data is Growing Rapidly From storagenewsletter . com § Many of the data needs to be stored for preservation and processing. § Efficient data storage and management has become a big challenge. 2

The Opportunity: Data Duplication is Common § Sources of duplicate data: – The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving. § Significant data duplication has been observed for both backup and primary storage workloads. 3

The Deduplication Technique can Help When duplication is detected Only one copy is stored: File1 (using fingerprinting): Logical Physical SHA1( ) = SHA1( ) File1 File2 File2 § Benefits – Storage space – I/O bandwidth – Network traffic An important feature in commercial storage systems. § – NetApp ONTAP system – Dell-EMC Data Domain system § Two critical issues: – How to deduplicate more data? 4 – How to deduplicate faster?

Deduplicate at Smaller Chunks … Remove Chunking and duplicate chunks fingerprinting … for higher deduplication ratio Two potentially major sources of cost in the deduplication: § – Chunking – Fingerprinting Can chunking be very fast? § 5

Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- sized chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B HOWAREYOU?OK?REALLY?YES?NO 6

Fixed-Size Chunking (FSC) § FSC: partition files (or data streams) into equal- and fixed- size chunks. – Very fast! § But the deduplication ratio can be significantly compromised. – The boundary-shift problem. File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 7

Content-Defined Chunking ( CDC) § CDC: determines chunk boundaries according to contents (a predefined special marker). – Variable chunk size. – Addresses boundary-shift problem Assume the special marker is ‘ ? ’ § File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO 8

The Advantage of CDC CDC FSC 40 36 32 Deduplication ratio 28 24 20 16 12 8 4 0 s r a s r j s s 4 w a r i e s j d e t d k o e e e d - n c e r n x p R o N o u a - d N e s d n r l s - g i o L a n W o C a o i b G e D § Real-world datasets include two-week’s google news, Linux kernels, and various Docker images. § CDC’s deduplication ratio is much higher than FSC. § However, CDC can be very expensive. 9

CDC can be Too Expensive! Assume the special marker is ‘ ? ’ File A HOWAREYOU?OK?REALLY?YES?NO File B H HOWAREYOU?OK?REALLY?YES?NO The marker for identifying chunk boundaries must § – be evenly spaced out with a controllable distance in between. Actually the marker is determined by applying a hash § function on a window of bytes. – E.g., hash(“YOU?”) == pre-defined-value § The window rolls forward byte-by-byte and the hashing is applied continuously. 10

CDC Chunking Becomes a Bottleneck Breakdown of CPU time 100 Fingerprinting 80 Time (%) 60 40 Chunking 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 11 11 § The bottleneck shifts from the disk to CPU.

CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 12 12 § The bottleneck shifts from the disk to CPU.

CDC Chunking Becomes a Bottleneck Breakdown of CPU time Breakdown of IO time 100 Fingerprinting 80 I/O Idle Time (%) 60 40 Chunking I/O Busy 20 0 Linux-tar Redis Neo4j dr=4.06 dr=7.17 dr=19.04 § Chunking time > 60% of the CPU time. § I/O bandwidth is not fully utilized. 13 13 § The bottleneck shifts from the disk to CPU.

Efforts on Acceleration of CDC Chunking Make hashing faster § – Example functions: SimpleByte, gear, and AE – More likely to generate small chunks • increasing size of metadata cached in memory for performance § Use GPU/multi-core to parallelize the chunking process – Extra hardware cost – Substantial efforts to deploy – The speedup is bounded by hardware parallelism. § Significant software/hardware efforts, but limited performance return 14 14

We proposed RapidCDC that … § is still sequential and doesn’t require additional cores/threads. § makes the hashing speed almost irrelevant. accelerates the CDC chunking often by 10-30 times . § § has a deduplication ratio the same as regular CDC methods. can be adopted in an existing CDC deduplication system by § adding 100~200 LOC in a few functions. 15 15

The Path to the Breakthrough Unique Chunks in the Disk 16

The Path to the Breakthrough Fingerprint Matched! 17

The Path to the Breakthrough Confirm it ! 15KB Fingerprint Matched ! 15KB 18

The Path to the Breakthrough 15KB Fingerprint Matched! 15KB 7KB 10KB 9KB 20KB 12KB 12KB 19

The Path to the Breakthrough 16KB Fingerprint Matched ! 20

The Path to the Breakthrough P 16KB 7KB Fingerprint Fingerprint Matched ! Matched ! 21

The Path to the Breakthrough P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! 22

The Path to the Breakthrough P P P 16KB 7KB 20KB Fingerprint Fingerprint Fingerprint Fingerprint Matched ! Matched ! Matched ! Matched ! P Fingerprint almost always happens ! Matched ! 23

Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 24 (Debian) # of files

Duplicate Locality Duplicate locality: if two of chunks are duplicates, their next chunks (in their § respective files or data stream) are likely duplicates of each other. § Duplicate chunks tend to stay together. All duplicate chunks 100 Percentage of chunks (%) 80 60 Duplicate chunk immediately following another duplicate chunk 40 20 0 10 20 40 80 90 25 # of files

RapidCDC: Using Next Chunk in History as a Hint § History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint); § Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary. When FP(B1) == FP(A1): P 0 P 1 P 2 P 3 P 4 Offset in file: +s 3 = +s 2 = +s 4 = File B … B 1 B 2 B 3 B 4 … File A … <FP 1, s 2 > <FP 2, s 3 > <FP 3, s 4 > <FP 4, …> … A 1 A 2 A 3 A 4 § Regular CDC is used for chunking until a duplicate chunk (e.g., B 1 ) is found 26

More Design Considerations … § A chunk may have been followed with chunks of different sizes – Maintain a size list § Validation of Hinted Next Chunk Boundaries – Four alternative criterions with different efficiency and confidences Ø FF (fast-forwarding only) Ø FF+RWT (Rolling window Test) Ø FF+MT (Marker Test) Ø FF+RWT+FPT (Fingerprint Test) § Please refer to the paper for detail. 27

Evaluation of RapidCDC § Prototype: based on a rolling-window-based CDC system. – Using Rabin/Gear as rolling function for rolling window computation. – Using SHA1 to calculate fingerprints. § Three disks with different speed are tested. – SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write. – NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write. 28

Synthetic Datasets: Insert/Delete 6 FF+RWT+FPT Regular Deduplication ratio FF+RWT FF+RWT+FPT Deduplication ratio 7 5 FF+MT FF+RWT 6 FF FF+MT Speedup 4 FF Speedup 5 4 3 3 2 2 1 1 0 0 1000 2000 5000 10000 20000 1000 2000 5000 10000 20000 # of modifications # of modifications # of modifications # of modifications § Chunking speedup correlates to the deduplication ratio. § Deduplication ratio is little affected (except for one very aggressive validation criterion). 29

Real-world Datasets: Chunking Speed 33X FF+RWT+FPT Regular 40 FF+RWT 30 FF+RWT+FPT Deduplication ratio Faster! FF+MT Deduplication ratio FF+RWT 25 Speedup 30 FF FF+MT Speedup FF 20 20 15 10 10 5 0 0 Debian Neo4j Wordpress Nodejs Debian Neo4j Wordpress Nodejs § Chunking speedup approaches deduplication ratio. § Negligible deduplication ratio reductions (if any). 30

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in - PowerPoint PPT Presentation

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Fan Ni* Song Jiang fan@netapp.com song.jiang@uta.edu ATG, NetApp UT Arlington Cluster and Internet Computing Laboratory * The work was done when he

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Leveraging Value Locality in Optimizing NAND Flash-based SSDs Aayush Gupta , Raghav Pisolkar,

Solar ROI for Eastside Businesses Confidential. Do not duplicate or retransmit. What we do

Mahjong International League (MIL) and Duplicate Mahjong History of Mahjong Modern Mahjong and

Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection

2016 ANALYST MEETING KEVIN HOLLERAN 1 COMPANY CONFIDENTIAL DO NOT DUPLICATE OR DISTRIBUTE

Topic: Duplicate Detection and Similarity Computing UCSB 290N, 2015 Tao Yang Some of slides

A compiler approach to Cyber-Security Franois de Ferrire Compilers Expertise Center

L o w - C o s t D u p l i c a t e M u l t i p l i c a t i o n M i

Count table Example Two-dimensional table showing the number of beneficiaries by county and

Region-wide enterprise image management Dream or reality? Bart Thielen Helse Vest - Norway

ECE 3574: Applied Software Design: Module and API Design Chris Wyatt Some preliminaries My

Academic Honesty A parents guide for supporting their student at Adrian High School We want

CAPA: the spirit of Beaver against physical attacks Oscar Reparaz, Lauren De Meyer, Victor

Sublinear Algorithms for Personalized PageRank, with Applications Ashish Goel Joint work with