When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016
Big Data == Big Memory Low latency → Real-time What is the performance Can we execute complex What’s the best performance for 100kW? for 16 TB system? queries in 10 ms? 4/3/2016 BPOE 7 @ ASPLOS 2016 2
Lowest Highest power! capacity! Which is best? Which is best? It depends Best performance! 4/3/2016 BPOE 7 @ ASPLOS 2016 3
Big Memory Machines Memory capacity 3 TB (3,072 GB) Memory bandwidth 408 GB/s Processors Dell PowerEdge R930 64 cores 4/3/2016 BPOE 7 @ ASPLOS 2016 4
Amount accessible DRAM (per socket) per second Amount accessible in 10 ms 1 GB 4/3/2016 BPOE 7 @ ASPLOS 2016 5
Processing 2x–10x faster than data supply Amount accessible CPU processing per second in 10 ms Amount accessible GPU processing in 10 ms in 10 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 6
3D Die-Stacking DRAM (per socket) Amount accessible Amount accessible per second in 10 ms Data supply to data processing ≈1 4/3/2016 BPOE 7 @ ASPLOS 2016 7
Traditional Big-Memory Die-Stacked Server Server Server ↑ Higher bandwidth ↑↑ Higher capacity (compared to traditional) 4/3/2016 BPOE 7 @ ASPLOS 2016 8
Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 9
Evaluation Option 1: Build the hardware Option 2: Simulation Option 3: Analytical Model! 4/3/2016 BPOE 7 @ ASPLOS 2016 10
Model Example Provisioning: 10 ms response time Data to read: 16,384 GB × 0.20 = 3,276.8 GB Bandwidth: 3,276.8 GB ÷ 0.010 s = 327.680 TB/s Chips needed: 327.680 TB/s ÷ 102 GB/s/chip Power: 458 kW = 3213 chips = 800 blades Capacity: 800 TB For traditional server 4/3/2016 BPOE 7 @ ASPLOS 2016 11
Model details From the paper Online research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ 4/3/2016 BPOE 7 @ ASPLOS 2016 12
Workload Assumptions ▪ 16 TB data corpus ▪ Each request accesses 20% of data corpus (3.2 TB) ▪ One core can process 6 GB/s ▪ No communication between cores https://xkcd.com/1339/ 4/3/2016 BPOE 7 @ ASPLOS 2016 13
Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 14
Metrics Performance Response time (SLA) Power Major component of datacenter cost Data capacity Workload size 4/3/2016 BPOE 7 @ ASPLOS 2016 15
Performance Provisioning Goal: Design cluster Get matches 10 ms to meet a service level agreement Sort 50 ms (SLA) Ads 50 ms 100 ms . . . 50 ms 500 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 16
Performance Provisioning 10 ms SLA Power Capacity 213 ✕ Current systems require memory over provisioning 50 ✕ 1 ✕ 4/3/2016 BPOE 7 @ ASPLOS 2016 17
Memory Over Provisioning 50% Wasted 4/3/2016 BPOE 7 @ ASPLOS 2016 18
Performance Provisioning Die-stacking : 10 ms SLA 2–5 ✕ less power Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 19
Performance Provisioning Power for relaxed SLAs Traditional needs less over provisioned memory 4/3/2016 BPOE 7 @ ASPLOS 2016 20
Power Provisioning 10–20 kW 100kW–1MW Goal: Design cluster to not exceed some power constraint 10–100 MW 4/3/2016 BPOE 7 @ ASPLOS 2016 21
Power Provisioning Die-stacking : Die-stacking : 1 MW Power Less capacity for 3–5 ✕ faster power budget Response time Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 22
Data Capacity Provisioning Search: Inverted Index Graph: Friends lists Goal: Design cluster capacity for workload Database: Purchases 4/3/2016 BPOE 7 @ ASPLOS 2016 23
Data Capacity Provisioning 16 TB Database Response time Power Die-stacking : Die-stacking : 25-50 ✕ more power 60–256 ✕ faster 4/3/2016 BPOE 7 @ ASPLOS 2016 24
Traditional Big Memory Die-Stacked Over 2–5x less Best for SLA provisioned power for Performance 60+ms memory 10ms SLA 2x faster 3x memory 3–4x faster Power with 50 KW capacity with 1 MW Somewhere 2–50x less 60–250x Data capacity between power faster 4/3/2016 BPOE 7 @ ASPLOS 2016 25
Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 26
Model deficiencies You chose the wrong number! See research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ Communication between cores This makes 2048 die-stacked systems worse How to move data between stacks? Compute energy or data energy? Cost? 4/3/2016 BPOE 7 @ ASPLOS 2016 27
In Memory Big Data Workloads Which is best? Today: It depends… Today: It depends… Tomorrow: Die-stacked? 4/3/2016 BPOE 7 @ ASPLOS 2016 28
Questions ‽ research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ bit.ly/bpoe-interactive powerjg@cs.wisc.edu 4/3/2016 BPOE 7 @ ASPLOS 2016
Systems Big memory Traditional Die-stacked 102 GB/s 196 GB/s 256 GB/s Bandwidth 256 GB 2 TB 8 GB Capacity Blades 16 8 228 (16TB) Cluster 6.4 TB/s 1.5 TB/s 512 TB/s bandwidth 4/3/2016 BPOE 7 @ ASPLOS 2016 30
Power Breakdown Compute power dominates die-stacked 4/3/2016 BPOE 7 @ ASPLOS 2016 31
Decreased Compute Power 10 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 32
Increased Memory Density 100 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 33
Recommend
More recommend