Data Management in Distributed Systems Simon Schäffner Advisor: Stefan Liebald Technische Universität München Fakultät für Informatik Lehrstuhl für Netzarchitekturen und Netzdienste Garching, 13. Juli 2018
Distributed Systems Simon Schäffner � 2
Distributed Systems “A collection of autonomous computing elements that appear to its users as a single coherent system” [1] Simon Schäffner � 2
Distributed Systems “A collection of autonomous computing elements that appear to its users as a single coherent system” [1] Geographical Dispersion? Simon Schäffner � 2
Data Management in Distributed Systems Input / Output for most tasks More complex than reading from / writing to HDD How is data organised? What data is stored on which nodes? Simon Schäffner � 3
Attributes of Data Management Strategies Scalability Simon Schäffner � 4
Attributes of Data Management Strategies Scalability Performance Simon Schäffner � 5
Attributes of Data Management Strategies Scalability Performance Consistency Simon Schäffner � 6
Attributes of Data Management Strategies Scalability Performance Consistency Redundancy Simon Schäffner � 7
Attributes of Data Management Strategies Scalability Performance Consistency Redundancy Overhead Simon Schäffner � 8
Attributes of Data Management Strategies Scalability Performance Consistency Redundancy Overhead Attack Resistance Simon Schäffner � 9
Comparison of Data Management Strategies Simon Schäffner � 10
Comparison of Data Management Strategies Simon Schäffner � 11
Peer-To-Peer Filesharing Started out with Napster and gnutella Legally controversial usage Interesting for completely legal use, as well Simon Schäffner � 12
BitTorrent Web Server Simon Schäffner � 13
BitTorrent .torrent Web Server Simon Schäffner � 14
BitTorrent Web Server .torrent Simon Schäffner � 15
BitTorrent Web Server { announce: http://bttracker.debian.org:6969/ announce, comment: "Debian CD from cdimage.debian.org”, creation date: 1520682848, httpseeds: [ https://cdimage.debian.org/cdimage/ release/9.4.0//srv/cdbuilder.debian.org/ dst/deb-cd/weekly-builds/amd64/iso-cd/ debian-9.4.0-amd64-netinst.iso ], info: { length: 305135616, name: debian-9.4.0-amd64-netinst.iso, piece length: 262144, … } } Simon Schäffner � 16
BitTorrent http://bttracker.debian.org:6969/announce Tracker Origin .torrent .torrent .torrent Simon Schäffner � 17
BitTorrent http://bttracker.debian.org:6969/announce Tracker Origin .torrent .torrent .torrent Simon Schäffner � 17
BitTorrent http://bttracker.debian.org:6969/announce Tracker Origin .torrent .torrent .torrent Simon Schäffner � 18
BitTorrent http://bttracker.debian.org:6969/announce Tracker Origin .torrent .torrent .torrent Simon Schäffner � 19
BitTorrent http://bttracker.debian.org:6969/announce Tracker Origin .torrent .torrent .torrent Simon Schäffner � 19
BitTorrent: Scalability Source: [4] Simon Schäffner � 20
BitTorrent: Attributes (1) Scalability • Scales well with large #nodes � 21
BitTorrent: Attributes (1) Scalability • Scales well with large #nodes Performance • Good upload utilisation � 22
BitTorrent: Attributes (1) Scalability • Scales well with large #nodes Performance • Good upload utilisation Consistency • Content does not change • Checksums in metadata file � 23
BitTorrent: Attributes (1) Scalability • Scales well with large #nodes Performance • Good upload utilisation Consistency • Content does not change • Checksums in metadata file Redundancy • Max. redundancy possible � 24
BitTorrent: Attributes (1) Scalability • Scales well with large #nodes Performance • Good upload utilisation Consistency • Content does not change • Checksums in metadata file Redundancy • Max. redundancy possible Overhead • Metadata file • 1/1000 of traffic to tracker [5] • Metainformation sent to other nodes � 25
BitTorrent: Attributes (2) Attack Resistance • Download of corrupt file very unlikely • Poisoning attacks possible − Uploading large amount of fake files / malware − Flooding all peers with download requests � 26
BitTorrent Attack Data Update Scalability Performance Consistency Redundancy Overhead Resistance no data ++ + 0 + 0 - update Upload upload util. no data high high, but poisoning util. indep. very good updates; aligns with attacks Of #nodes corrupt data goal possible identified Simon Schäffner � 27
Kademlia Peer-to-peer distributed hash table (DHT) identifiers: 160-bit (both key and node IDs) Keys stored on “close” nodes d ( x , y ) = x ⊕ y Simon Schäffner � 28
Kademlia Attack Data Update Scalability Performance Consistency Redundancy Overhead Resistance active + ++ +/0 + - + republication storage parallel deleted <key, value> optimized protection scales lin. redundant information pairs stored process agains With requests, max. 24h in in >=k nodes republishin node #nodes; efficient system g <key, flooding Lookup caching value> time pairs scales with every hour O(log_2 n) Simon Schäffner � 29
Content Delivery Networks First idea: Proxies for caching static content Internet now very dynamic, hit rates for proxies low (25-40%) [9] Content Delivery Networks (CDNs) more elaborate Simon Schäffner � 30
Content Delivery Networks Source: [10] Simon Schäffner � 31
Akamai >240.000 servers >130 countries Within >1.700 networks [10] Handle flashcrowds by allocating more servers to the sites that need them at the moment Server nearby for low latency and small packet-loss Simon Schäffner � 32
Akamai: Tiered Distribution Source: [11] Simon Schäffner � 33
Akamai: Attributes (1) Scalability • Large amount of data available at high speed within network � 34
Akamai: Attributes (1) Scalability • Large amount of data available at high speed within network Performance • Overlay network (speed improvements up to 30-50%) • Sometimes Border Gateway Protocol is not optimal • Increased reliability by offering alternate routes • Reduced packet loss by sending packet through parallel routes • Forward error correction techniques • Transport protocol optimisations over TCP • Application level optimisations (content compression, application logic on edge servers) � 35
Akamai: Attributes (1) Scalability • Large amount of data available at high speed within network Performance • Overlay network (speed improvements up to 30-50%) • Sometimes Border Gateway Protocol is not optimal • Increased reliability by offering alternate routes • Reduced packet loss by sending packet through parallel routes • Forward error correction techniques • Transport protocol optimisations over TCP • Application level optimisations (content compression, application logic on edge servers) Consistency • Standard techniques for caching (TTLs, versioned URLs) � 36
Akamai: Attributes (2) Redundancy • Dependent on customer’s needs • Tiered distribution provides balance between redundancy and fast availability � 37
Akamai: Attributes (2) Redundancy • Dependent on customer’s needs • Tiered distribution provides balance between redundancy and fast availability Overhead • Proprietary overlay network • Claim to have improvements over TCP (reduced setup & teardown time per connection) � 38
Akamai: Attributes (2) Redundancy • Dependent on customer’s needs • Tiered distribution provides balance between redundancy and fast availability Overhead • Proprietary overlay network • Claim to have improvements over TCP (reduced setup & teardown time per connection) Attack Resistance • No attackers within closed network • Engineered for high failure rate of network and equipment � 39
Akamai Highly distributed nature fundamental to high performance Entire communication within overlay network optimised, two small hops on either end should not matter Attack Data Update Scalability Performance Consistency Redundancy Overhead Resistance dep. on ++ ++ 0 ++ 0 ++ customer >240.000 optimised based on tiered improvem only attacks nodes transfer TTL/version distribution ents over from speed in URLs, but allows for any TCP, dep. outside network dep. on degree on possible, customer customer high recovery rate Simon Schäffner � 40
Distributed Databases Sharding: partition data over many servers Redundancy Online Transaction Processing (OTLP): each query may be handled by different node Online Analytical Processing (OLAP): network can work together on single query Simon Schäffner � 41
CouchDB Document storage NoSQL database • Non relational • n ot o nly SQL (SQL-like query languages supported) • Usually worse consistency, better Documents: number of fields and attachments Documents are versioned Use case: multiple offline clients, synchronise upon reconnection Simon Schäffner � 42
Recommend
More recommend