problem scaling content delivery
play

Problem: Scaling Content Delivery Millions of clients server and - PowerPoint PPT Presentation

3/17/2019 Problem: Scaling Content Delivery Millions of clients server and network meltdown 15-441/641: Content Delivery Peer-to-Peer 15-441 Spring 2019 Profs Peter Steenkiste & Justine Sherry Fall 2019


  1. 3/17/2019 Problem: Scaling Content Delivery • Millions of clients  server and network meltdown 15-441/641: Content Delivery Peer-to-Peer 15-441 Spring 2019 Profs Peter Steenkiste & Justine Sherry Fall 2019 https://computer-networks.github.io/sp19/ P2P System Outline • Peer-to-peer • Overlays: naming, addressing, and routing • CDNs • Leverage the resources of client machines (peers) Computation, storage, bandwidth • 3 4 1

  2. 3/17/2019 P2P Definition Why p2p? • Harness lots of spare capacity Distributed systems consisting of interconnected 1 Big Fast Server: $10k/month++ versus 1000s .. 1000000s clients: $ ?? • nodes able to self-organize into network topologies Capacity grows with the number of users! with the purpose of sharing resources such as • content, CPU cycles, storage and bandwidth, • Build very large-scale, self-managing systems capable of adapting to failures and accommodating Same techniques useful for companies, • transient populations of nodes while maintaining E.g. Akamai’s 14,000+ nodes, Google’s 100,000+ nodes • acceptable connectivity and performance, without But: servers vs. arbitrary nodes, hard vs. soft state (backups vs caches), …. • requiring the intermediation or support of a global Also: security, fairness, freeloading, .. centralized server or authority. • No single point of failure • – A Survey of Peer-To-Peer Content Distribution Technologies, Some nodes go down – others take over • Androutsellis-Theotokis and Spinellis … government shuts down nodes – peers in other countries are available • 6 P2P Construction Key Idea: Network Overlay P2P Overlay Network • A network overlay is a network that is layered on top of the Internet • Simplified picture: overlays use IP as their datalink layer • Overlays need the equivalent of all the functions IP networks need: Clients • Naming and addressing Servers • Routing SPRINT • Bootstrapping Verizon • Security, error recovery, etc. CMU AT&T 2

  3. 3/17/2019 Names, addresses, and routing Common P2P Framework Content retrieval: The Internet ● End-point: content ● Endpoint: host N 2 N 1 N 3 New peer ● Name: identifies content you ● Name: hierarchical Join are looking for domain name ● E.g., hash of file, key words ● Address: IP address of Internet Key=“title” ? node that has the content, ● Address: the IP address of Value=MP3 data… Client Search plus content name node that has the content, Publish Lookup(“title”) Fetch Content plus content name N 4 N 6 ● Routing: how to reach N 5 host, e.g., BGP, … ● Routing: how to find the data 10 Napster: Central Database What is (was) out there? Central Flood Super- Route 123.2.0.18 node flood insert(X, 123.2.21.23) Whole Napster Gnutella Freenet Fetch search(A) File --> 123.2.0.18 Query Publish Reply Chunk BitTorrent KaZaA DHTs (bytes, Based eDonkey Where is file A? not 2000 I have X, Y, and Z! chunks) 123.2.21.23 Join: contact server 12 11 3

  4. 3/17/2019 Napster: Discussion Gnutella: Flooding I have file A. • Pros: I have file A. • Simple • Search scope is O(1) Reply • Controllable (pro or con?) • Cons: • Server maintains O(N) State Query • Server does all processing Where is file A? • Single point of failure Join: contact peers Publish: noop Fetch: direct p2p 13 14 Gnutella: Discussion KaZaA: Query Flooding • Pros: • First released in 2001 and also very popular • Fully de-centralized • Search cost distributed • Processing @ each node permits powerful search semantics • Join : on startup, client contacts a “supernode” ... may at some point • Cons: become one itself • Search scope is O( N ) • Publish : send list of files to supernode • Search time is O(???) • Search : send query to supernode, supernodes flood query amongst • Nodes leave often, network unstable themselves. • TTL-limited search works well for haystacks. • Fetch : get the file directly from peer(s); can fetch simultaneously • For scalability, does NOT search every node. from multiple peers • May have to re-issue query later 15 16 4

  5. 3/17/2019 KaZaA: Discussion KaZaA: Intelligent Query Flooding • Works better than Gnutella because of query consolidation Group of servers: “Super Nodes” • Several nodes may have requested file... How to tell? Gnutella-style Must be able to distinguish identical files Flooding • Same filename not necessarily same file... • • Use Hash of file Can fetch bytes [0..1000] from A, [1001...2000] from B • • Pros: Tries to take into account node heterogeneity: Bandwidth, computational resources, … • Napster-style • Cons: Still no guarantees on search scope or time Client-server • Challenge: want stable superpeers – good prediction Model • Must also be capable platforms 17 20 BitTorrent: Publish/Join BitTorrent: Swarming • Started in 2001 to efficiently support flash crowds Tracker • Focus is on fetching, not searching • Publish : Run a tracker server. • Search : Find a tracker out-of-band for a file, e.g., Google • Join : contact central “tracker” server for list of peers. • Fetch : Download chunks of the file from your peers. Upload chunks you have to them. • Comparison with earlier architectures: Focus on fetching of “few large files” • Chunk based downloading • Anti-freeloading mechanisms • 23 22 5

  6. 3/17/2019 BitTorrent: Fetch BitTorrent: Summary • Pros: • Works reasonably well in practice • Gives peers incentive to share resources; avoids freeloaders • Cons: • Pareto Efficiency relative weak condition • Central tracker server needed to bootstrap swarm • (Tracker is a design choice, not a requirement, as you know from your projects. Could easily combine with other approaches.) 24 26 When are p2p Useful? Outline Works well for caching and “soft-state”, read-only data • • Peer-to-peer Works well! BitTorrent, KaZaA, etc., all use peers as caches for hot data • • Overlays, naming, .. Difficult to extend to persistent data • • CDNs Nodes come and go: need to create multiple copies for availability and • replicate more as nodes leave Not appropriate for search engine styles searches • Complex intersection queries (“the” + “who”): billions of hits for each term alone • Sophisticated ranking: Must compare many results before returning a subset to • user Need massive compute power • 27 28 6

  7. 3/17/2019 Improving HTTP Performance: Content Delivery: Caching with Forward Proxies Possible Bottlenecks Cache documents close to clients •  decrease latency Server • Typically done by ISPs or enterprises Last Mile First Mile Problem Problem  reduce provider traffic load • CDNs proactively cache for the Backbone ISP content providers (their clients) ISP-1 ISP-2 • Typically cache at different levels Forward proxies in the Internet hierarchy: • Last mile ISPs for low latency End User Host Server Clients • Closer to core for broader Internet Peering Backbone coverage Problem Problem What is the CDN? Potential Benefits Very good scalability • Edge Caches: work with ISP and networks everywhere to install • Near infinite if deployed properly edge caches • Good economies at large scales • • Edge = close to customers Infrastructure is shared efficiently by customers • • Content delivery: getting content to the edge caches Statistical multiplexing: hot sites use more resources • • Content can be objects, video, or entire web sites Can reduce latency – more predictable performance • Through mapping to closest server • • Mapping: find the “closest” edge server for each user and deliver Avoids congestion and long latencies • content from that server Can be extremely reliable • • Network proximity not the same as geographic proximity Very high degree of redundancy • • Focus is on performance as observed by user (quality) Can mitigate some DoS attacks • 15-441 S'10 32 33 7

Recommend


More recommend