technology Efficient Locally Trackable from seed Deduplication in Replicated Systems João Barreto and Paulo Ferreira Distributed Systems Group INESC-ID/Technical University Lisbon, Portugal www.gsd.inesc-id.pt Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
Bandwidth remains scarce technology from seed Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Collaboration through data technology from seed replication Site S Site R File A, v4 File A, v3 File A, v3 versions( R ) versions( S ) File A, v2 File A, v2 File B, v4 File B, v3 File B, v3 File C, v9 File C, v8 File C, v8 File C, v7 Distributed users share objects A, B and C At each m om ent: S stores versions( S) and R stores versions( R) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
The bottleneck: synchronization technology from seed Site S Synchronize to Site R File A, v4 File A, v3 File A, v3 versions( R ) versions( S ) File A, v2 File A, v2 File B, v4 File B, v3 File B, v3 File C, v9 File C, v8 File C, v8 File C, v7 1 . Determ ine w hich versions to transfer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
The bottleneck: synchronization technology from seed Site S Site R versions( S ) versions( R ) 2 . Transfer versions Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Deduplication: technology from seed exploiting redundancy Site S Site R + References to chunks in versions( R) versions( S ) versions( R ) How to determine which chunks are redundant? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Locally trackable vs untrackable technology from seed redundancy Locally Trackable Redundancy chunk to transfer exists in some Site S Site R version that is both in versions(S) and versions(R) versions( S ) versions( R ) Locally Untrackable Redundancy otherwise Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Existing approaches: technology from seed compare-by-hash Site S Site R versions( S ) versions( R ) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Existing approaches: technology from seed advantages and shortcomings Compare-by-hash Detects both locally trackable and untrackable redundancy Detects redundancy across any versions and/or objects Additional round-trip Limited precision: – smaller chunks may not compensate hash-exchange and hash-lookup overheads Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
Existing approaches: technology from seed delta encoding Site S Site R versions( S ) versions( R ) Calculate deltas from most recent versions versions(R) to each version to transfer. Using local, high-precision algorithm s. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Existing approaches: technology from seed advantages and shortcomings Compare-by-hash Delta Encoding Detects both locally Only detects locally trackable and untrackable trackable redundancy redundancy Limited to pairs of Detects redundancy across versions any versions and/or objects High-precision local Additional round-trip redundancy detection Redundancy detection can Limited precision: occur ahead of transfer – smaller chunks may not compensate hash-exchange time and hash-lookup overheads Simple protocol Can we devise a solution that borrows the advantages from both approaches? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
Our contribution: redFS technology from seed Site S Site R [ S:6 ] [ S:5 ] [ S:5 ] [ S:4 ] [ S:4 ] [ S:6 ] [ R:4 ] [ R:4 ] [ R:5 ] [ S:3 ] [ S:3 ] [ R:3 ] versions( S ) versions( R ) 0. Use local high-precision compare-by-hash algorithm to pre-compute local redundancy relations 2. Determine C = versions(S) ∩ versions(R) 3. For each chunk to transfer, if the chunk is also in some Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa version in C, simply send a reference to that version
What does redFS achieve? technology from seed Compare-by-hash Delta Encoding Detects both locally Only detects locally trackable and untrackable trackable redundancy redundancy Limited to pairs of Detects redundancy across versions any versions and/or objects High-precision local Additional round-trip redundancy detection Redundancy detection can Limited precision: occur ahead of transfer – smaller chunks may not compensate hash-exchange time and hash-lookup overheads Simple protocol Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
What does redFS achieve? technology from seed Compare-by-hash Delta Encoding Detects both locally Only detects locally trackable and untrackable trackable redundancy redundancy Limited to pairs of Detects redundancy across versions any versions and/or objects High-precision local Additional round-trip redundancy detection Redundancy detection can Limited precision: occur ahead of transfer – smaller chunks may not compensate hash-exchange time and hash-lookup overheads Simple protocol Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
What does redFS achieve? technology from seed Simpler protocol, Able to detect Delta redFS finer-grained Encoding redundancy More complicated Com pare-by- protocol, hash Limited precision redundancy ≈ Detectable forms of redudancy Any redundancy Any locally Locally trackable redundancy trackable across pairs of consecutive Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa versions of sam e Middleware 2009 02/12/2009 object
Evaluation: methodology technology from seed • We evaluated different solutions from every approach: – RedFS full implementation – LBFS, rsync, TAPER (compare-by-hash) – xdelta, svn (delta encoding) • Two distributed sites, network with different bandwidths (3Mbps to 100Mbps) • Real workloads – Single-writer Scenarios – Multi-writer Scenarios Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Evaluation: technology from seed single-writer multi-reader scenarios Site S ( W riter) Site R ( Reader) v0 of a set of v0 of a set of files files (e.g. gcc 3.3.1 (e.g. gcc 3.3.1 source code) source code) All redundancy is locally trackable! v1 of a set of v1 of a set of files files deduplication (e.g. gcc 3.4.1 (e.g. gcc 3.4.1 source code) source code) Sam e m ethodology and w orkloads as in recent com pare-by-hash Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa papers ( e.g. LBFS [ SOSP’0 1 ] , TAPER [ FAST’0 5 ] ) Middleware 2009 02/12/2009
Evaluation: transferred volumes in technology from seed single-writer multi-reader scenarios (Except for particularly suited RedFS transfers less (or, in few exceptions, workloads such as this one) comparable) bytes than all compare-by-hash Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa RedFS transfers less than solutions Middleware 2009 02/12/2009 delta-encoding
Evaluation: performace in technology from seed single-writer multi-reader scenarios svn (delta encoding) rsync Plain LBFS redFS Best performance with high bandwidth (due to protocol simplicity) Best performance with low bandwidth (due to high deduplication efficiency) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Evaluation: technology from seed multi-writer scenario W orkloads from collaborative activity betw een groups of teachers and students during 1 -sem ester courses. Site S ( W riter) Site R ( W riter) v0 of working v0 of working set set vS of working vR of working set set Locally untrackable redundancy can now occur! deduplication Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Evaluation: technology from seed multi-writer scenario Q: How much locally untrackable redundancy? A: Only 1% to 4% of all redundancy generated over +3 months was locally untrackable Advantages of redFS persist even in real scenarios where locally untrackable redundancy can occur (both in transferred volume and performance) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Middleware 2009 02/12/2009
Recommend
More recommend