Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2

Document-oriented Databases { { " _id " : "55ca4cf7bad4f75b8eb5c25c", " _id " : "55ca4cf7bad4f75b8eb5c25d”, " pageId " : "46780", " pageId " : "46780", " revId " : " 41173 ", " revId " : "128520", " timestamp " : "2002-03-30T20:06:22", " timestamp " : "2002-03-30T20:11:12", Update " sha1 " : "6i81h1zt22u1w4sfxoofyzmxd” " sha1 " : "q08x58kbjmyljj4bow3e903uz” " text " : “The Peer and the Peri is a " text " : "The Peer and the Peri is a comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] [[operetta ]] in two acts… just as [[operetta ]] in two acts… just as predicting,…The fairy Queen, however, predicted, …The fairy Queen, on the other appears to … all live happily ever after. " hand, is ''not'' happy, and appears to … all } live happily ever after. " } Update: Reading a recent doc and writing back a similar one 2 ¡

Replication Bandwidth { { " _id " : "55ca4cf7bad4f75b8eb5c25d”, " _id " : "55ca4cf7bad4f75b8eb5c25c", " pageId " : "46780", " pageId " : "46780", " revId " : " 41173 ", " revId " : "128520", Primary " timestamp " : "2002-03-30T20:11:12", " timestamp " : "2002-03-30T20:06:22", " sha1 " : "q08x58kbjmyljj4bow3e903uz” " sha1 " : "6i81h1zt22u1w4sfxoofyzmxd” " text " : "The Peer and the Peri is a " text " : “The Peer and the Peri is a Database comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] [[operetta ]] in two acts… just as [[operetta ]] in two acts… just as predicted, …The fairy Queen, on the other predicting,…The fairy Queen, however, hand, is ''not'' happy, and appears to … all appears to … all live happily ever after. " live happily ever after. " } } Operation Operation logs logs WAN Secondary Secondary 3 ¡

Replication Bandwidth { { " _id " : "55ca4cf7bad4f75b8eb5c25d”, " _id " : "55ca4cf7bad4f75b8eb5c25c", " pageId " : "46780", " pageId " : "46780", " revId " : " 41173 ", " revId " : "128520", Primary " timestamp " : "2002-03-30T20:11:12", " timestamp " : "2002-03-30T20:06:22", " sha1 " : "q08x58kbjmyljj4bow3e903uz” " sha1 " : "6i81h1zt22u1w4sfxoofyzmxd” " text " : "The Peer and the Peri is a " text " : “The Peer and the Peri is a Database comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] [[operetta ]] in two acts… just as [[operetta ]] in two acts… just as predicted, …The fairy Queen, on the other predicting,…The fairy Queen, however, hand, is ''not'' happy, and appears to … all appears to … all live happily ever after. " live happily ever after. " } Goal: Reduce WAN bandwidth } Operation Operation logs logs WAN for geo-replication Secondary Secondary 4 ¡

Why Deduplication? • Why not just compr ompres ess? – Oplog batches are small and not enough overlap • Why not just use di di fg fg ? – Need application guidance to identify source • Dedup Dedup finds and removes redundancies – In the entire data corpus 5 ¡

Traditional Dedup: Ideal Chunk Boundary Modified Region Duplicate Region {BYTE STREAM } Incoming Data 1 2 3 4 5 Send dedup’ed Deduped data to replicas 1 2 4 5 Data 6 ¡

Traditional Dedup: Reality Chunk Boundary Modified Region Duplicate Region Incoming Data 1 2 3 4 5 Deduped 4 Data 7 ¡

Traditional Dedup: Reality Chunk Boundary Modified Region Duplicate Region Incoming Data 1 2 3 4 5 Send almost the Deduped 4 entire document. Data 8 ¡

Similarity Dedup (sDedup) Chunk Boundary Modified Region Duplicate Region Incoming Data Delta! Only send delta Dedup’ed encoding. Data 9 ¡

Compress vs. Dedup 20GB sampled Wikipedia dataset MongoDB v2.7 / / 4MB Oplog batches 10 ¡

sDedup Integration Client Insertion & Updates Oplog Oplog syncer Unsynchronized Source Source oplog entries documents documents sDedup sDedup Encoder Decoder Database Database Re-constructed oplog entries Source Dedup’ed Replay Document Oplog oplog entries Cache 11 ¡ Primary Node Secondary Node

sDedup Encoding Steps • Identify Similar Documents • Select the Best Match • Delta Compression 12 ¡

Identify Similar Documents Target Document Similarity Similarity Candida andidate Documents Documents Rabin Chunking Scor Sc ore 32 17 25 41 12 1 Doc #1 39 32 22 15 Doc #1 Consistent Doc #2 32 25 38 41 12 Sampling Doc #3 2 32 17 38 41 12 Similarity Sketch 41 32 Doc #2 Doc #2 32 25 38 41 12 2 32 Doc #3 Feature 32 17 38 41 12 Index Table Doc #3 41 13 ¡

Select the Best Match Initial Ranking Initial Ranking Final Ranking Final Ranking Rank Candidates Score Rank Candidates Cached? Score Doc #2 Doc #3 1 2 1 Yes es 4 Doc #3 Doc #1 1 2 1 Yes es 3 Doc #1 Doc #2 2 1 2 No 2 Is doc cached? If yes, reward +2 Source Document 14 ¡ Cache

Evaluation • MongoDB setup (v2.7) – 1 primary, 1 secondary node, 1 client – Node Config: 4 cores, 8GB RAM, 100GB HDD storage • Datasets: – Wikipedia dump (20GB out of ~12TB) – Additional datasets evaluated in the paper 15 ¡

Compression sDedup trad-dedup 50 Compression Ratio 38.9 38.4 40 26.3 30 20 15.2 9.9 9.1 10 4.6 2.3 0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 16 ¡

Memory sDedup trad-dedup 780.5 800 Memory (MB) 600 400 272.5 200 133.0 80.2 57.3 61.0 47.9 34.1 0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 17 ¡

Other Results (See Paper) • Negligible client perf Negligible client performanc ormance o e overhead erhead • Failur ailure r e rec ecovery is quick and eas ery is quick and easy y • Shar Sharding ding does not hurt c does not hurt compr ompres ession r sion rate e • Mor More da e datasets tasets – Microsoft Exchange, Stack Exchange 18 ¡

Conclusion & Future Work • sDedup : S imilarity-based dedup lication for replicated document databases – Much greater data reduction than traditional dedup – Up to 38x compression ratio for Wikipedia – Resource-e ffj cient design with negligible overhead • Futur Future w e work ork – More diverse datasets – Dedup for local database storage – Di fg erent similarity search schemes (e.g., super-fingerprints) 19 ¡

Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2 Document-oriented Databases { { " _id " :

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

The Localization of Innovative Activity Characteristics, Determinants and Perspectives Giovanni

DID IMMIGRATION CONTRIBUTE TO WAGE STAGNATION OF UNSKILLED WORKERS? Giovanni Peri, IRLE

Ms. Audrey A. Brown, Community Member Ms. Melinda Brunner, Alaska Department of Environmental

Polling and qualitative research from What the poll will show Although voters are in a foul

1 Peter Series Lesson #106 September 21, 2017 Dean Bible Ministries www.deanbibleministries.org

Mid-Atlantic Coastal Bays and Sounds -- an Overlooked Opportunity? Princeton Energy Resources

CRFB.org Types of Medicare Expansion Medicare Buy-In/Eligibility Change A Public Option

Gone with the Wind: International Migration Amelia Aburn 1 Dennis Wesselbaum 2 1 Victoria

Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2 Document-oriented Databases { { " _id " :

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Distributed Systems (3rd Edition) Chapter 07: Consistency &amp; Replication Version: February

The Localization of Innovative Activity Characteristics, Determinants and Perspectives Giovanni

DID IMMIGRATION CONTRIBUTE TO WAGE STAGNATION OF UNSKILLED WORKERS? Giovanni Peri, IRLE

Ms. Audrey A. Brown, Community Member Ms. Melinda Brunner, Alaska Department of Environmental

Polling and qualitative research from What the poll will show Although voters are in a foul

1 Peter Series Lesson #106 September 21, 2017 Dean Bible Ministries www.deanbibleministries.org

Mid-Atlantic Coastal Bays and Sounds -- an Overlooked Opportunity? Princeton Energy Resources

CRFB.org Types of Medicare Expansion Medicare Buy-In/Eligibility Change A Public Option

Gone with the Wind: International Migration Amelia Aburn 1 Dennis Wesselbaum 2 1 Victoria

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February