cs cs 754 754 adv advanced ed d distribut uted s ed
play

CS CS 754 754 Adv Advanced ed D Distribut uted S ed System - PowerPoint PPT Presentation

CS CS 754 754 Adv Advanced ed D Distribut uted S ed System ems Introduction to Data Centers Data Center Overview Why DC? Economy of scale (amortize capital and maintenance cost). Machines->Racks->Cluster Design Metrics


  1. CS CS 754 754 Adv Advanced ed D Distribut uted S ed System ems Introduction to Data Centers

  2. Data Center Overview Why DC? Economy of scale (amortize capital and maintenance cost). Machines->Racks->Cluster

  3. Design Metrics • Performance (request per second) • Cost (capital and operation) (request per dollar) • Power (request per Watt)

  4. DC Node Design Option 1: SMP: Symmetric Multi-processor Shared memory multiprocessor: set of CPU each with its own cache, sharing the main memory over a single bus. + High performance per node - Expensive

  5. DC Node Design Option 2: Commodity Nodes Using off the shelf components. + Equal performance to SMP at scale + Lower cost - Fails more

  6. SMP vs Commodity Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs . Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)] Remote access Local access

  7. SMP vs Commodity Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs . Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)] Remote access Local access

  8. DC Node Design Option 3: Wimpy nodes Using low-end CPUs (e.g., ARM processors) + Lower cost + Lower energy Disadvantage: Hard to use efficiently

  9. DC Node Design Wimpy design disadvantages • Amdahl’s law bounds : Task execution is T = (1-p)T + pT (p ratio of code that can run in parallel , 0 ≤ p ≤ 1) After parallelization on s cores: T’ = (1-p)T + (p/s)T Speed-up = T/T’ = 1/((1-p) + p/s) If s  inf speed-up = 1/(1-p) • Higher number of threads --> higher serialization/communication cost • Harder to program --> higher software cost • Higher networking cost • Lower utilization For I/O intensive workloads (e.g., for Google workloads) using commodity machines is a better choice.

  10. Storage Design Design paradigms: • NAS: network attached storage, dedicated storage appliance • Distributed storage: aggregate storage space from nodes in cluster. Design dimensions: • Reliability: replication or erasure coding (RS coding) • Reduce cost by using cheap disks: they fail more but we will replicate anyway • Consistency: varies depending on application

  11. Storage Design Option 1: Network attached storage (NAS) A dedicated storage appliance. + Simpler deployment + Control and management (QoS) + Lower network overhead (appliance replication)

  12. Storage Design Option 2: Distributed storage: aggregate storage space from nodes in cluster. Reduce cost by using cheap disks: they fail more but we will replicate anyway. + Lower cost + Higher availability + Higher performance + Higher Data locality - Higher network overhead - Lower component reliability

  13. Storage Design NAS Distributed Storage (GFS) Simpler deployment +Lower cost Control and management (QoS) +Higher availability Lower network overhead (appliance +Higher performance replication) + data locality (at different levels and technology) - Higher write network overhead

  14. Network Design Challenge: build high speed, scalable network at lower cost Optimizations tricks: - Reduce core bandwidth: 5:1 ratio is common - Multiple networks (SAN, supercomputer example)

  15. DC Design Implications Software using DC needs to be aware of the storage hierarchy Jeff Dean

  16. Example Data location Latency Throughput RAM 100ns 20GBps Hard Disk 10ms 80MBps 70 µ s Network- Rack 128 MBps (1Gbps) 500 µ s Network – DC 25 MBps (subscription ratio of 5:1) RAM Disk Rack RAM Rack Disk DC RAM DC Disk Latency BW

  17. Example Jeff Dean

  18. Example Jeff Dean

  19. DC Design Implications • Software using DC needs to be aware of the network and storage hierarchy • Software fault tolerance is necessary Programing framework to hide complexity • Technology changes: - Much more memory - New disks: Shingled, Kinetic, PCIeNV - SSD , NVM - SDN networks - Programmable NIC and switches - Faster network

  20. Large Scale Services Two categories: - Online. e.g., ecommerce, instant messaging • Low latency • Highly-available • Mostly read operations - Offline. Batch processing. E.g., data processing • Compute and I/O intensive • Throughput centric

  21. Model

  22. Load Manager • DNS-based - May take hours to adapt - Not available to small clusters • Appliance or switch (L4) • Smart client (L7) Load balancing techniques: • Round robin • Least number of connections • Response time • Source IP hash • SDN based • Chained failover

  23. High Availability Metric (uptime): percent of time the system is available to answer client requests. ---|Fail |---Recover------|-------------available------------------|Fail|---Recover--|------------ Uptime: (MTBF – MTTR)/MTBF MTBF: Mean time between failures MTTR: Mean time to repair

  24. High Availability Uptime: (MTBF – MTTR)/MTBF Brewer recommendation: Do your best effort to reduce MTBF but focus on reducing MTTR. Why? • MTBF need weeks of testing. • MTTR is easier to improve. Easier to debug and measure. Problem with uptime: Not all second as equal (idle vs peak time)

  25. High Availability Yield = queries completed / queries offered Harvest = data available/complete data DQ principle: Data per query (D) x query per second (Q) --> constant The underlying limitation is data movement (seeks, I/O BW, ..etc) Good for: • Comparing system • Decide on upgrades • Measuring failure effect

  26. Graceful Degradation Degradation of service under overload. (instead of complete system failure) Overload will happen: single event burst, peak-to-average ratio is 6:1, failures. Techniques: • Limit D (partial results) and maintain Q • Limit Q (by admission control) and maintain D • QoS, cost-based • Priorities • Reduce data quality (freshness)

  27. Evolution Perfect software is hard, costly, takes a long time. Aim for: software that handles failures well (high MTBF, low MTTR, no cascading failures) Other bugs are less critical: memory leaks, slow …etc (try throwing more hardware at it) Reasoning : upgrades are controlled failures. Do it off-peak. Strategies (all have the same DQ loss over time): • Fast reboot of all cluster nodes. Easier ( jump between versions), risky (could be buggy), downtime • Rolling upgrade: 5% at a time. More complex (two versions will run at the same time), slow • Big Flip: jump from one version to the other half-a-cluster at a time. Rolling upgrade is the most popular.

  28. Replication vs. Partitioning Replication  higher harvest Partitioning  higher yield E.g., Two node cluster, one node fails: Replication: 100% harvest, 50% yield (but replication need more DQ for write) Partitioning: 50% harvest, 100% yield Same DQ value (lower by 50%) As capacity is not an issue (capacity is cheap), use replication: Better harvest, effects yield only under heavy load, easier to manage, scales, easier disaster recovery.

Recommend


More recommend