When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - PowerPoint PPT Presentation

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016

Big Data == Big Memory Low latency → Real-time What is the performance Can we execute complex What’s the best performance for 100kW? for 16 TB system? queries in 10 ms? 4/3/2016 BPOE 7 @ ASPLOS 2016 2

Lowest Highest power! capacity! Which is best? Which is best? It depends Best performance! 4/3/2016 BPOE 7 @ ASPLOS 2016 3

Big Memory Machines Memory capacity 3 TB (3,072 GB) Memory bandwidth 408 GB/s Processors Dell PowerEdge R930 64 cores 4/3/2016 BPOE 7 @ ASPLOS 2016 4

Amount accessible DRAM (per socket) per second Amount accessible in 10 ms 1 GB 4/3/2016 BPOE 7 @ ASPLOS 2016 5

Processing 2x–10x faster than data supply Amount accessible CPU processing per second in 10 ms Amount accessible GPU processing in 10 ms in 10 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 6

3D Die-Stacking DRAM (per socket) Amount accessible Amount accessible per second in 10 ms Data supply to data processing ≈1 4/3/2016 BPOE 7 @ ASPLOS 2016 7

Traditional Big-Memory Die-Stacked Server Server Server ↑ Higher bandwidth ↑↑ Higher capacity (compared to traditional) 4/3/2016 BPOE 7 @ ASPLOS 2016 8

Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 9

Evaluation Option 1: Build the hardware Option 2: Simulation Option 3: Analytical Model! 4/3/2016 BPOE 7 @ ASPLOS 2016 10

Model Example Provisioning: 10 ms response time Data to read: 16,384 GB × 0.20 = 3,276.8 GB Bandwidth: 3,276.8 GB ÷ 0.010 s = 327.680 TB/s Chips needed: 327.680 TB/s ÷ 102 GB/s/chip Power: 458 kW = 3213 chips = 800 blades Capacity: 800 TB For traditional server 4/3/2016 BPOE 7 @ ASPLOS 2016 11

Model details From the paper Online research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ 4/3/2016 BPOE 7 @ ASPLOS 2016 12

Workload Assumptions ▪ 16 TB data corpus ▪ Each request accesses 20% of data corpus (3.2 TB) ▪ One core can process 6 GB/s ▪ No communication between cores https://xkcd.com/1339/ 4/3/2016 BPOE 7 @ ASPLOS 2016 13

Metrics Performance Response time (SLA) Power Major component of datacenter cost Data capacity Workload size 4/3/2016 BPOE 7 @ ASPLOS 2016 15

Performance Provisioning Goal: Design cluster Get matches 10 ms to meet a service level agreement Sort 50 ms (SLA) Ads 50 ms 100 ms . . . 50 ms 500 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 16

Performance Provisioning 10 ms SLA Power Capacity 213 ✕ Current systems require memory over provisioning 50 ✕ 1 ✕ 4/3/2016 BPOE 7 @ ASPLOS 2016 17

Memory Over Provisioning 50% Wasted 4/3/2016 BPOE 7 @ ASPLOS 2016 18

Performance Provisioning Die-stacking : 10 ms SLA 2–5 ✕ less power Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 19

Performance Provisioning Power for relaxed SLAs Traditional needs less over provisioned memory 4/3/2016 BPOE 7 @ ASPLOS 2016 20

Power Provisioning 10–20 kW 100kW–1MW Goal: Design cluster to not exceed some power constraint 10–100 MW 4/3/2016 BPOE 7 @ ASPLOS 2016 21

Power Provisioning Die-stacking : Die-stacking : 1 MW Power Less capacity for 3–5 ✕ faster power budget Response time Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 22

Data Capacity Provisioning Search: Inverted Index Graph: Friends lists Goal: Design cluster capacity for workload Database: Purchases 4/3/2016 BPOE 7 @ ASPLOS 2016 23

Data Capacity Provisioning 16 TB Database Response time Power Die-stacking : Die-stacking : 25-50 ✕ more power 60–256 ✕ faster 4/3/2016 BPOE 7 @ ASPLOS 2016 24

Traditional Big Memory Die-Stacked Over 2–5x less Best for SLA provisioned power for Performance 60+ms memory 10ms SLA 2x faster 3x memory 3–4x faster Power with 50 KW capacity with 1 MW Somewhere 2–50x less 60–250x Data capacity between power faster 4/3/2016 BPOE 7 @ ASPLOS 2016 25

Model deficiencies You chose the wrong number! See research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ Communication between cores This makes 2048 die-stacked systems worse How to move data between stacks? Compute energy or data energy? Cost? 4/3/2016 BPOE 7 @ ASPLOS 2016 27

In Memory Big Data Workloads Which is best? Today: It depends… Today: It depends… Tomorrow: Die-stacked? 4/3/2016 BPOE 7 @ ASPLOS 2016 28

Questions ‽ research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ bit.ly/bpoe-interactive powerjg@cs.wisc.edu 4/3/2016 BPOE 7 @ ASPLOS 2016

Systems Big memory Traditional Die-stacked 102 GB/s 196 GB/s 256 GB/s Bandwidth 256 GB 2 TB 8 GB Capacity Blades 16 8 228 (16TB) Cluster 6.4 TB/s 1.5 TB/s 512 TB/s bandwidth 4/3/2016 BPOE 7 @ ASPLOS 2016 30

Power Breakdown Compute power dominates die-stacked 4/3/2016 BPOE 7 @ ASPLOS 2016 31

Decreased Compute Power 10 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 32

Increased Memory Density 100 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 33

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - PowerPoint PPT Presentation

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016 Big Data == Big Memory Low latency Real-time What is the performance Can we

STACKED GRAPHS STACKED GRAPHS EVOLUTION OF STACKED GRAPHS Stacked Area Chart Themeriver

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

IO and Instructions Original by Koen Claessen How Would You do That? (1) Suppose you wanted to

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular

Classifyng Objects at Differnts Sizes with Multi-scale Stacked Sequential Learning Eloi Puertas,

5nm IMEC ( VLSI 2016) 7nm Leti ( IEDM 2008 ) 10nm Stacked-NWs (nanosheets) S. Barraud et al,

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

Universitat de Valncia COMETVAL research group, IULMA, University of Valencia This work

Touching Lives: Over 3000 youth have attended our RYLA over the past 20-plus years RYLA by the

The FITS Corpus: Tracing the origins of fifteenth- century Scots sounds and spellings Benjamin

Extending Corpus-Based Discourse Analysis for Exploring Japanese Social Media Philipp Heinrich 1

Investor Handout October 2018 www.nblmidstream.com Forward Looking Statements This presentation

FROM COMMUNITIES OF PRACTICE TO THE EMERGENCE OF THIRDNESS: VOICES, IDENTITIES, AND SUBJECT

Corporate presentation October 2018 Cautionary statements Forward-looking statements The

Deepwater Horizon Natural Resources Damage Assessment Texas Trustee Implementation Group Public

Sambuz

Useful Links

Newsletter

Mail Us