when to use 3d die stacked memory for bandwidth
play

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - PowerPoint PPT Presentation

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016 Big Data == Big Memory Low latency Real-time What is the performance Can we


  1. When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016

  2. Big Data == Big Memory Low latency → Real-time What is the performance Can we execute complex What’s the best performance for 100kW? for 16 TB system? queries in 10 ms? 4/3/2016 BPOE 7 @ ASPLOS 2016 2

  3. Lowest Highest power! capacity! Which is best? Which is best? It depends Best performance! 4/3/2016 BPOE 7 @ ASPLOS 2016 3

  4. Big Memory Machines Memory capacity 3 TB (3,072 GB) Memory bandwidth 408 GB/s Processors Dell PowerEdge R930 64 cores 4/3/2016 BPOE 7 @ ASPLOS 2016 4

  5. Amount accessible DRAM (per socket) per second Amount accessible in 10 ms 1 GB 4/3/2016 BPOE 7 @ ASPLOS 2016 5

  6. Processing 2x–10x faster than data supply Amount accessible CPU processing per second in 10 ms Amount accessible GPU processing in 10 ms in 10 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 6

  7. 3D Die-Stacking DRAM (per socket) Amount accessible Amount accessible per second in 10 ms Data supply to data processing ≈1 4/3/2016 BPOE 7 @ ASPLOS 2016 7

  8. Traditional Big-Memory Die-Stacked Server Server Server ↑ Higher bandwidth ↑↑ Higher capacity (compared to traditional) 4/3/2016 BPOE 7 @ ASPLOS 2016 8

  9. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 9

  10. Evaluation Option 1: Build the hardware Option 2: Simulation Option 3: Analytical Model! 4/3/2016 BPOE 7 @ ASPLOS 2016 10

  11. Model Example Provisioning: 10 ms response time Data to read: 16,384 GB × 0.20 = 3,276.8 GB Bandwidth: 3,276.8 GB ÷ 0.010 s = 327.680 TB/s Chips needed: 327.680 TB/s ÷ 102 GB/s/chip Power: 458 kW = 3213 chips = 800 blades Capacity: 800 TB For traditional server 4/3/2016 BPOE 7 @ ASPLOS 2016 11

  12. Model details From the paper Online research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ 4/3/2016 BPOE 7 @ ASPLOS 2016 12

  13. Workload Assumptions ▪ 16 TB data corpus ▪ Each request accesses 20% of data corpus (3.2 TB) ▪ One core can process 6 GB/s ▪ No communication between cores https://xkcd.com/1339/ 4/3/2016 BPOE 7 @ ASPLOS 2016 13

  14. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 14

  15. Metrics Performance Response time (SLA) Power Major component of datacenter cost Data capacity Workload size 4/3/2016 BPOE 7 @ ASPLOS 2016 15

  16. Performance Provisioning Goal: Design cluster Get matches 10 ms to meet a service level agreement Sort 50 ms (SLA) Ads 50 ms 100 ms . . . 50 ms 500 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 16

  17. Performance Provisioning 10 ms SLA Power Capacity 213 ✕ Current systems require memory over provisioning 50 ✕ 1 ✕ 4/3/2016 BPOE 7 @ ASPLOS 2016 17

  18. Memory Over Provisioning 50% Wasted 4/3/2016 BPOE 7 @ ASPLOS 2016 18

  19. Performance Provisioning Die-stacking : 10 ms SLA 2–5 ✕ less power Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 19

  20. Performance Provisioning Power for relaxed SLAs Traditional needs less over provisioned memory 4/3/2016 BPOE 7 @ ASPLOS 2016 20

  21. Power Provisioning 10–20 kW 100kW–1MW Goal: Design cluster to not exceed some power constraint 10–100 MW 4/3/2016 BPOE 7 @ ASPLOS 2016 21

  22. Power Provisioning Die-stacking : Die-stacking : 1 MW Power Less capacity for 3–5 ✕ faster power budget Response time Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 22

  23. Data Capacity Provisioning Search: Inverted Index Graph: Friends lists Goal: Design cluster capacity for workload Database: Purchases 4/3/2016 BPOE 7 @ ASPLOS 2016 23

  24. Data Capacity Provisioning 16 TB Database Response time Power Die-stacking : Die-stacking : 25-50 ✕ more power 60–256 ✕ faster 4/3/2016 BPOE 7 @ ASPLOS 2016 24

  25. Traditional Big Memory Die-Stacked Over 2–5x less Best for SLA provisioned power for Performance 60+ms memory 10ms SLA 2x faster 3x memory 3–4x faster Power with 50 KW capacity with 1 MW Somewhere 2–50x less 60–250x Data capacity between power faster 4/3/2016 BPOE 7 @ ASPLOS 2016 25

  26. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 26

  27. Model deficiencies You chose the wrong number! See research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ Communication between cores This makes 2048 die-stacked systems worse How to move data between stacks? Compute energy or data energy? Cost? 4/3/2016 BPOE 7 @ ASPLOS 2016 27

  28. In Memory Big Data Workloads Which is best? Today: It depends… Today: It depends… Tomorrow: Die-stacked? 4/3/2016 BPOE 7 @ ASPLOS 2016 28

  29. Questions ‽ research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ bit.ly/bpoe-interactive powerjg@cs.wisc.edu 4/3/2016 BPOE 7 @ ASPLOS 2016

  30. Systems Big memory Traditional Die-stacked 102 GB/s 196 GB/s 256 GB/s Bandwidth 256 GB 2 TB 8 GB Capacity Blades 16 8 228 (16TB) Cluster 6.4 TB/s 1.5 TB/s 512 TB/s bandwidth 4/3/2016 BPOE 7 @ ASPLOS 2016 30

  31. Power Breakdown Compute power dominates die-stacked 4/3/2016 BPOE 7 @ ASPLOS 2016 31

  32. Decreased Compute Power 10 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 32

  33. Increased Memory Density 100 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 33

Recommend


More recommend