CS CS 754 754 Adv Advanced ed D Distribut uted S ed System ems Introduction to Data Centers
Data Center Overview Why DC? Economy of scale (amortize capital and maintenance cost). Machines->Racks->Cluster
Design Metrics • Performance (request per second) • Cost (capital and operation) (request per dollar) • Power (request per Watt)
DC Node Design Option 1: SMP: Symmetric Multi-processor Shared memory multiprocessor: set of CPU each with its own cache, sharing the main memory over a single bus. + High performance per node - Expensive
DC Node Design Option 2: Commodity Nodes Using off the shelf components. + Equal performance to SMP at scale + Lower cost - Fails more
SMP vs Commodity Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs . Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)] Remote access Local access
SMP vs Commodity Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs . Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)] Remote access Local access
DC Node Design Option 3: Wimpy nodes Using low-end CPUs (e.g., ARM processors) + Lower cost + Lower energy Disadvantage: Hard to use efficiently
DC Node Design Wimpy design disadvantages • Amdahl’s law bounds : Task execution is T = (1-p)T + pT (p ratio of code that can run in parallel , 0 ≤ p ≤ 1) After parallelization on s cores: T’ = (1-p)T + (p/s)T Speed-up = T/T’ = 1/((1-p) + p/s) If s inf speed-up = 1/(1-p) • Higher number of threads --> higher serialization/communication cost • Harder to program --> higher software cost • Higher networking cost • Lower utilization For I/O intensive workloads (e.g., for Google workloads) using commodity machines is a better choice.
Storage Design Design paradigms: • NAS: network attached storage, dedicated storage appliance • Distributed storage: aggregate storage space from nodes in cluster. Design dimensions: • Reliability: replication or erasure coding (RS coding) • Reduce cost by using cheap disks: they fail more but we will replicate anyway • Consistency: varies depending on application
Storage Design Option 1: Network attached storage (NAS) A dedicated storage appliance. + Simpler deployment + Control and management (QoS) + Lower network overhead (appliance replication)
Storage Design Option 2: Distributed storage: aggregate storage space from nodes in cluster. Reduce cost by using cheap disks: they fail more but we will replicate anyway. + Lower cost + Higher availability + Higher performance + Higher Data locality - Higher network overhead - Lower component reliability
Storage Design NAS Distributed Storage (GFS) Simpler deployment +Lower cost Control and management (QoS) +Higher availability Lower network overhead (appliance +Higher performance replication) + data locality (at different levels and technology) - Higher write network overhead
Network Design Challenge: build high speed, scalable network at lower cost Optimizations tricks: - Reduce core bandwidth: 5:1 ratio is common - Multiple networks (SAN, supercomputer example)
DC Design Implications Software using DC needs to be aware of the storage hierarchy Jeff Dean
Example Data location Latency Throughput RAM 100ns 20GBps Hard Disk 10ms 80MBps 70 µ s Network- Rack 128 MBps (1Gbps) 500 µ s Network – DC 25 MBps (subscription ratio of 5:1) RAM Disk Rack RAM Rack Disk DC RAM DC Disk Latency BW
Example Jeff Dean
Example Jeff Dean
DC Design Implications • Software using DC needs to be aware of the network and storage hierarchy • Software fault tolerance is necessary Programing framework to hide complexity • Technology changes: - Much more memory - New disks: Shingled, Kinetic, PCIeNV - SSD , NVM - SDN networks - Programmable NIC and switches - Faster network
Large Scale Services Two categories: - Online. e.g., ecommerce, instant messaging • Low latency • Highly-available • Mostly read operations - Offline. Batch processing. E.g., data processing • Compute and I/O intensive • Throughput centric
Model
Load Manager • DNS-based - May take hours to adapt - Not available to small clusters • Appliance or switch (L4) • Smart client (L7) Load balancing techniques: • Round robin • Least number of connections • Response time • Source IP hash • SDN based • Chained failover
High Availability Metric (uptime): percent of time the system is available to answer client requests. ---|Fail |---Recover------|-------------available------------------|Fail|---Recover--|------------ Uptime: (MTBF – MTTR)/MTBF MTBF: Mean time between failures MTTR: Mean time to repair
High Availability Uptime: (MTBF – MTTR)/MTBF Brewer recommendation: Do your best effort to reduce MTBF but focus on reducing MTTR. Why? • MTBF need weeks of testing. • MTTR is easier to improve. Easier to debug and measure. Problem with uptime: Not all second as equal (idle vs peak time)
High Availability Yield = queries completed / queries offered Harvest = data available/complete data DQ principle: Data per query (D) x query per second (Q) --> constant The underlying limitation is data movement (seeks, I/O BW, ..etc) Good for: • Comparing system • Decide on upgrades • Measuring failure effect
Graceful Degradation Degradation of service under overload. (instead of complete system failure) Overload will happen: single event burst, peak-to-average ratio is 6:1, failures. Techniques: • Limit D (partial results) and maintain Q • Limit Q (by admission control) and maintain D • QoS, cost-based • Priorities • Reduce data quality (freshness)
Evolution Perfect software is hard, costly, takes a long time. Aim for: software that handles failures well (high MTBF, low MTTR, no cascading failures) Other bugs are less critical: memory leaks, slow …etc (try throwing more hardware at it) Reasoning : upgrades are controlled failures. Do it off-peak. Strategies (all have the same DQ loss over time): • Fast reboot of all cluster nodes. Easier ( jump between versions), risky (could be buggy), downtime • Rolling upgrade: 5% at a time. More complex (two versions will run at the same time), slow • Big Flip: jump from one version to the other half-a-cluster at a time. Rolling upgrade is the most popular.
Replication vs. Partitioning Replication higher harvest Partitioning higher yield E.g., Two node cluster, one node fails: Replication: 100% harvest, 50% yield (but replication need more DQ for write) Partitioning: 50% harvest, 100% yield Same DQ value (lower by 50%) As capacity is not an issue (capacity is cheap), use replication: Better harvest, effects yield only under heavy load, easier to manage, scales, easier disaster recovery.
Recommend
More recommend