HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for - PowerPoint PPT Presentation

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun Wu 1,4 , Chen Wang 2 , Yinjin Fu 3 , Sherif Sakr 1 , Liming Zhu 1,2 and Kai Lu 4 The University of New South Wales 1 Data61, CSIRO 2 PLA University of Science and Technology 3 National University of Defence Technology 4

Outline  Background  Motivations  Hybrid Prioritized Deduplication  Experiment Results  Conclusion 18/05/2017 2

Background  Primary Storage Deduplication Fingerprint Lookup  Save the storage capacity Data blocks  Improve the I/O efficiency  The state-of-the-art  Post-processing deduplication – Perform during off-peak time  Inline deduplication – Perform on the write path Only write unique blocks 18/05/2017 3

Post-processing Deduplication  The commodity product uses post-processing deduplication [TOS’16]  Windows Server 2012 [ATC’12]  Challenges remain for real-world systems  Off-peak periods may not be enough  More storage capacity is required  Duplicate writes shorten the lifespan of storage devices (e.g., SSD)  Does not help improving the I/O performance, but wastes I/O bandwidth  Inline deduplication can help 18/05/2017 4

Inline Deduplication  Fingerprint look-up is the bottleneck  On-disk fingerprint table introduces high latency  Fingerprint table is large and hard to fit in memory  Cache efficiency is critical  The state-of-the-art solutions and challenges  Exploit the temporal locality of workloads [FAST’12][IPDPS’14] – But temporal locality may not exist [TPDS’17]  For cloud scenario, – locality for workloads of different VMs may be quite different  Workloads may interfere with each other and reduce the cache efficiency 18/05/2017 5

Motivation  Workloads with different temporal locality interfere with each other  A toy example. Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 1 12 13 3 4 14 15 16 17 A 1 B 1 2 3 1 2 3 1 1 4 5 6 6 6 7 8 9 10 7 1 1 1 1 2 3 3 4 4 5 5 6 6 7 8 8 9 C 1 Fingerprint Cache # of Deduplicated Blocks: 0 18/05/2017 7

Motivation  Workloads with different temporal locality interfere with each other  A toy example. – 18 duplicate blocks in total, only 6 are identified. Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 1 12 13 3 4 14 15 16 17 A 8 B 1 2 3 1 2 3 1 1 4 5 6 6 6 7 8 9 10 7 16 1 1 1 2 3 3 4 4 5 5 6 6 7 8 8 9 C 10 Fingerprint Cache # of Deduplicated Blocks: 6 18/05/2017 12

Motivation  Temporal locality may be weak for workloads  Histogram for the distribution of distance between duplicate blocks FIU-mail Cloud-FTP 18/05/2017 13

Motivation  Workloads with different temporal locality interfere with each other  Using real-world I/O trace. (LRU) # of duplicate blocks: FIU-mail > 4*Cloud-FTP Occupied cache size: FIU-mail < 0.8*Cloud-FTP Cache resource allocation is unreasonable! 18/05/2017 14

Hybrid Prioritized Deduplication  Hybrid inline & post-processing deduplication  Either post-processing or inline deduplication works well  Solution: Combine inline and post-processing deduplication together  Identifying more duplicates by inline caching  Using post-processing to achieve exact deduplication  Challenges: Interference compromises the temporal locality of workload, thus reducing the efficiency of fingerprint caching  We differentiate workloads (data streams) to improve it 18/05/2017 16

Hybrid Prioritized Deduplication  Prioritize the cache allocation for inline deduplication  Data stream that contributes more deduplication ratio should get more cache resources  For inline phase, deduplication ratio comes from better temporal locality  How to evaluate temporal locality ?  Changes dynamically with time  Accurate estimation is critical to achieve good cache allocation  Use # of duplicate blocks in N consecutive data blocks ( estimation interval ) as an indicator for temporal locality 18/05/2017 17

System architecture Estimate the temporal locality for streams and allocate cache according to this. On-disk fingerprint table for post-processing deduplication. 18/05/2017 18

Evaluate the temporal locality  Simple idea: Count distinct data block fingerprints for streams  Introduce high memory overhead  May be comparable to the cache capacity  Estimate rather than count  Get the number of distinct fingerprints by small portion of samples  Essentially same as a classical problem ‘How many distinct elements exist in a set ?’ Origin – Estimate # of species of animal population from samples [Fisher, JSTOR’1940]  Sublinear estimator – Unseen Estimation Algorithm [NIPS’13] 18/05/2017 19

Estimate the temporal locality  Using unseen algorithm to estimate LDSS. Time Estimation Interval I f1 f2 f3 f4 ... ... ... ... … f15 f16 f17 f18 Reservoir Sampling Fingerprint Sample Buffer Unseen Estimation Algorithm LDSS for Interval I 18/05/2017 20

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for - PowerPoint PPT Presentation

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun Wu 1,4 , Chen Wang 2 , Yinjin Fu 3 , Sherif Sakr 1 , Liming Zhu 1,2 and Kai Lu 4 The University of New South Wales 1 Data61, CSIRO 2 PLA University

Vickery-Clark-Groves Mechanism Maria Serna Fall 2016 AGT-MIRI VCG mechanism Selling one item

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Recent Advances and Techniques in Algorithmic Mechanism Design Part 2: Bayesian Mechanism Design

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Prioritized Garbage Collection Using the Garbage Collector to Support Caching Diogenes Nunez ,

Prioritized Access Arbitration to Shared Resources on Integrated Software Systems in Multicore

Development of high-strength 122-type iron-based superconducting wires and tapes for high-field

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Generic and parallel Grbner bases in JAS Heinz Kredel, University of Mannheim 4 th

Reverse engineering using computational algebra Elena Dimitrova School of Mathematical and

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015

C on posite Dynamics in the E as ly Univ es se Luigi Delle Rose 2 Higgs doublets as 2 Higgs

Predrag BUNCIC, Thorsten KOLLEGER & Pierre VANDE VYVRE ALICE-USA, May 2013, CERN

Compressing IP Forwarding Tables for Fun and Profit Gbor Rtvri, Zoltn Cserntony, Attila

Sambuz

Useful Links

Newsletter

Mail Us

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for - PowerPoint PPT Presentation

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun Wu 1,4 , Chen Wang 2 , Yinjin Fu 3 , Sherif Sakr 1 , Liming Zhu 1,2 and Kai Lu 4 The University of New South Wales 1 Data61, CSIRO 2 PLA University

Vickery-Clark-Groves Mechanism Maria Serna Fall 2016 AGT-MIRI VCG mechanism Selling one item

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Recent Advances and Techniques in Algorithmic Mechanism Design Part 2: Bayesian Mechanism Design

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview &amp; Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS &amp; RESPONSE RATES 28 October 2014 Matching

Prioritized Garbage Collection Using the Garbage Collector to Support Caching Diogenes Nunez ,

Prioritized Access Arbitration to Shared Resources on Integrated Software Systems in Multicore

Development of high-strength 122-type iron-based superconducting wires and tapes for high-field

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Generic and parallel Grbner bases in JAS Heinz Kredel, University of Mannheim 4 th

Reverse engineering using computational algebra Elena Dimitrova School of Mathematical and

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015

C on posite Dynamics in the E as ly Univ es se Luigi Delle Rose 2 Higgs doublets as 2 Higgs

Predrag BUNCIC, Thorsten KOLLEGER &amp; Pierre VANDE VYVRE ALICE-USA, May 2013, CERN

Compressing IP Forwarding Tables for Fun and Profit Gbor Rtvri, Zoltn Cserntony, Attila

Sambuz

Useful Links

Newsletter

Mail Us

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Predrag BUNCIC, Thorsten KOLLEGER & Pierre VANDE VYVRE ALICE-USA, May 2013, CERN