iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp 1
iDedup – overview/context Dedupe exploited effectively here Storage Primary => 90+% savings Clients Storage Secondary Storage NDMP/Other NFS/CIFS/iSCSI Primary Storage: iDedup • Performance & reliability are key features • Inline/foreground dedupe technique for primary • RPC-based protocols => latency sensitive • Minimal impact on latency-sensitive workloads • Only offline dedupe techniques developed 2
Dedupe techniques (offline vs inline) Offline dedupe • First copy on stable storage is not deduped • Dedupe is a post-processing/background activity Inline dedupe • Dedupe before first copy on stable storage • Primary => Latency should not be affected! • Secondary => Dedupe at ingest rate (IOPS)! 3
Why inline dedupe for primary? Provisioning/Planning is easier • Dedupe savings are seen right away • Planning is simpler as capacity values are accurate No post-processing activities • No scheduling of background processes • No interference => front-end workloads are not affected • Key for storage system users with limited maintenance windows Efficient use of resources • Efficient use of I/O bandwidth (offline has both reads + writes) • File system’s buffer cache is more efficient (deduped) Performance challenges have been the key obstacle • Overheads (CPU + I/Os) for both reads and writes hurt latency 4
iDedup – Key features Minimizes inline dedupe performance overheads • Leverages workload characteristics • Eliminates almost all extra I/Os due to dedupe processing • CPU overheads are minimal Tunable tradeoff: dedupe savings vs performance • Selective dedupe => some loss in dedupe capacity savings iDedup can be combined with offline techniques • Maintains same on-disk data structures as normal file system • Offline dedupe can be run optionally 5
Related Work Workload/ Method Offline Inline Primary NetApp ASIS EMC Celerra iDedup IBM StorageTank zFS* Secondary (No motivation for EMC DDFS, systems in this EMC Cluster category) DeepStore, NEC HydraStor, Venti, SiLo, Sparse Indexing, ChunkStash, Foundation, Symantec, EMC Centera 6
Outline ¡ Inline dedupe challenges ¡ Our Approach ¡ Design/Implementation details ¡ Evaluation results ¡ Summary 7
Inline Dedupe - Read Path challenges Inherently, dedupe causes disk-level fragmentation ! • Sequential reads turn random => more seeks => more latency • RPC based protocols (CIFS/NFS/iSCI) are latency sensitive • Dataset/workload property Primary workloads are typically read-intensive • Usually the Read/Write ratio is ~70/30 • Inline dedupe must not affect read performance ! Fragmentation with random seeks 8
Inline Dedupe – Write Path Challenges Extra random I/Os in the write path due to dedupe algorithm CPU overheads in the critical write path • Dedupe metadata (Finger Print DB) lookups and updates • Dedupe requires computing the Fingerprint (Hash) of each block • Updating the reference counts of blocks on disk • Dedupe algorithm requires extra cycles Dedupe metadata Client Write Write Compute Dedupe Disk IO Write Allocation Logging Hash Algorithm Random Seeks !! De-stage phase 9
Our Approach 10
iDedup – Intuition Is there a good tradeoff between capacity savings and latency performance? 11
iDedup – Solution to Read Path issues Insight 1: Dedupe only sequences of disk blocks • Solves fragmentation => amortized seeks during reads • Selective dedupe, leverages spatial locality • Configurable minimum sequence length Fragmentation with random seeks Sequences, with amortized seeks 12
iDedup – Write Path issues How can we reduce dedupe metadata I/Os? Flash(?) - read I/Os are cheap, but frequent updates are expensive 13
iDedup – Solution to Write Path issues Insight 2: Keep a smaller dedupe metadata as an in-memory cache • No extra IOs • Leverages temporal locality characteristics in duplication • Some loss in dedupe (a subset of blocks are used) Cached fingerprints Blocks Time 14
iDedup - Viability Is this loss in dedupe savings Ok? Both spatial and temporal localities are dataset/workload Dedup properties ! Ratio ⇒ Viable for some important primary workloads Spatial Spatial + Temporal Original Locality Locality 15
Design and Implementation 16
iDedup Architecture I/Os (Reads + Writes) NVRAM (Write log blocks) De-stage Dedupe iDedup metadata Algorithm (FPDB) File system (WAFL) Disk 17
iDedup - Two key tunable parameters Minimum sequence length – Threshold • Minimum number of sequential duplicate blocks on disk • Dataset property => ideally set to expected fragmentation • Different from larger block size – variable vs fixed • Knob between performance (fragmentation) and dedupe Dedupe metadata (Fingerprint DB) cache size • Workload’s working set property • Increase in cache size => decrease in buffer cache • Knob between performance (cache hit ratio) and dedupe 18
iDedup Algorithm The iDedup algorithm works in 4 phases for every file: ¡ Phase 1 (per file):Identify blocks for iDedup – Only full, pure data blocks are processed – Metadata blocks, special files ignored ¡ Phase 2 (per file) : Sequence processing – Uses the dedupe metadata cache – Keeps track of multiple sequences ¡ Phase 3 (per sequence): Sequence pruning – Eliminate short sequences below threshold – Pick among overlapping sequences via a heuristic ¡ Phase 4 (per sequence): Deduplication of sequence 19
Evaluation 20
Evaluation Setup ¡ NetApp FAS 3070, 8GB RAM, 3 disk RAID0 ¡ Evaluated by replaying real-world CIFS traces – Corporate filer traces in NetApp DC (2007) ¡ Read data: 204GB (69%), Write data: 93GB – Engineering filer traces in NetApp DC (2007) ¡ Read data: 192GB (67%), Write data: 92GB ¡ Comparison points – Baseline: System with no iDedup – Threshold-1: System with full dedupe (1 block) ¡ Dedupe metadata cache: 0.25, 0.5 & 1GB 21
Results: Deduplication ratio vs Threshold Dedupe ratio vs Thresholds, Cache sizes (Corp) Less than linear 24 .25 GB .5 GB decrease in dedupe 22 1 GB savings 20 Deduplication ratio (%) ⇒ Spatial locality in 18 dedupe 16 Ideal Threshold = 14 Biggest threshold with least decrease 12 in dedupe savings 10 ⇒ Threshold-4 8 1 2 4 8 ⇒ ~60% of max Threshold 22
Results: Disk Fragmentation (req sizes) CDF of block request sizes (Engg, 1GB) 100 Max 90 Fragmentation for fragmentation Percentage of Total Requests 80 other thresholds are between Baseline 70 and Thresh-1 60 Least fragmentation 50 40 ⇒ Tunable 30 fragmentation Baseline (Mean=15.8) 20 Threshold-1 (Mean=12.5) Threshold-2 (Mean=14.8) 10 Threshold-4 (Mean=14.9) Threshold-8 (Mean=15.4) 0 1 8 16 24 32 40 Request Sequence Size (Blocks) 23
Results: CPU Utilization CDF of CPU utilization samples (Corp, 1GB) 100 Larger variance 90 (long tail) compared 80 Percentage of CPU Samples to baseline 70 60 ⇒ But, mean 50 difference is less 40 than 4% 30 20 Baseline (Mean=13.2%) Threshold-1 (Mean=15.0%) 10 Threshold-4 (Mean=16.6%) Threshold-8 (Mean=17.1%) 0 0 5 10 15 20 25 30 35 40 CPU Utilization (%) 24
Results: Latency Impact CDF of client response time (Corp, 1GB) 100 Latency impact for longer response 95 times ( > 2ms) Percentage of Requests 90 ⇒ Thresh-1 mean latency affected Affects >2ms by ~13% vs 85 baseline ⇒ Diff between 80 Thresh-8 and Baseline Threshold-1 baseline <4%! Threshold-8 75 0 5 10 15 20 25 30 Response Time (ms) 25
Summary ¡ Inline dedupe has significant performance challenges – Reads : Fragmentation, Writes: CPU + Extra I/Os ¡ iDedup creates tradeoffs b/n savings and performance – Leverage dedupe locality properties – Avoid fragmentation – dedupe only sequences – Avoid extra I/Os – keep dedupe metadata in memory ¡ Experiments for latency-sensitive primary workloads – Low CPU impact – < 5% on the average – ~60% of max dedupe, ~4% impact on latency ¡ Future work: Dynamic threshold, more traces ¡ Our traces are available for research purposes 26
Acknowledgements ¡ NetApp WAFL Team – Blake Lewis, Ling Zheng, Craig Johnston, Subbu PVS, Praveen K, Sriram Venketaraman ¡ NetApp ATG Team – Scott Dawkins, Jeff Heller ¡ Shepherd – John Bent ¡ Our Intern (from UCSC) – Stephanie Jones 27
28
Recommend
More recommend