An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University
Outline Outline • Introduction • Our Approach • Buffered Compare Architecture • Evaluation • Summary 2
Introduction – Memory Wall Introduction – Memory Wall CPU • The number of cores in a chip is increasing • The memory bandwidth is not as much… ‐‐ > “memory wall” problem • Emerging big data applications require even more bandwidth Memory • In reality, much of the bandwidth is being wasted! 3
Introduction – Table Scan Introduction – Table Scan • Which items are made of Item# Material Weight wood? A Wood 10kg B Metal 1.5kg • Which items are heavier C Metal 7kg than 5kg? D Stone 3kg E Wood 2kg … 4
Introduction – Table Scan Introduction – Table Scan Host Key Search key ③ ② Cmp Result D0 D1 D2 D3 ① Data in table D0 D1 D2 D3 DRAM • Data are read and the comparisons are done • We only need the result – waste in bandwidth! 5
Introduction – Table Scan Introduction – Table Scan Host Key Result Reduced traffic Key Parallelism with wider Cmp bandwidth D0 D1 D2 D3 DRAM • Do compare within the memory • Only two transfers needed instead of many • Essentially a PIM (processing ‐ in ‐ memory) approach 6
Introduction ‐ PIM Introduction ‐ PIM • PIM research was active late 90’s ~ early 00’s – EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. – Multiple cores in DRAM – Hard to integrate ‐‐ > not successful • Re ‐ gaining interests due to – Big data workloads – Limited improvement of memory bandwidth – 3D stacked memory (HMC, HBM, etc.) enables integration of cores 7
Introduction ‐ PIM Introduction ‐ PIM • PIM with 3D stacked memory Host Processor HMC Last-Level Out-Of-Order L1 Cache L2 Cache DRAM Cache Core PCU Controller Crossbar Network HMC Controller DRAM PCU PCU Controller PMU … PIM Directory y DRAM Locality PCU Controller Monitor Host Processor PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015] In-Order Core … Controller DRAM … List Prefetch Pref. Buffer Crossbar Network Tesseract Mes.Trig. … Pref. [J. Ahn et al., ISCA 2015] … Message Queue NI 8
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Row Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Activated Activated Bank Bank … Activated Activated Activated … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global Buffered link Local Sense Amp. Shared Bus (Row Buffer) Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • A single chip is comprised of 8 ‐ 16 banks • When accessing data, a row in a bank is “activated” and stored in a row buffer • A cache line (64B) is fetched in one burst 9
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated … Global wordline Zzz.. Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Zzz.. Zzz.. Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Multiple banks are used for interleaving since activating a row takes long time • One bank can fill up the bandwidth for the off ‐ chip link • Thus we have 8X ‐ 16X internal bandwidth, most of which is wasted 10
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated Compute Compute … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Compute Compute Compute Compute Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Compute inside each bank to utilize the excess bandwidth 11
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • We focus only on ‘compare ‐ n ‐ op’ pattern over a long range of data DRAM … D0 D1 D2 DN CMP Key 12
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ read – Returns the match results for each item DRAM … D0 D1 D2 DN CMP Key Result: (=, <, =, … , >) 13
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ select – Returns the min/max among each item DRAM … D0 D1 D2 DN CMP Max Max: (D7) 14
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ increment – Increments matching items DRAM K0, K1, K2, K2, KN, … V0 V1 V2 VN V2++ CMP K2 15
Buffered Compare Architecture Buffered Compare Architecture DRAM Chip Bank Mat Global Row Decoder Local Row Decoder Mat Local Wordline … Local Bitline Bank Bank … Global wordline … Dataline Chip I/O Global 512 x 512 cells Internal Shared Bus Local Sense Amp. Column Decoder (Row Buffer) Bank I/O CGEN … Bank Bank Global Dataline Result Key Buffer Queue Arithmetic Unit • Key buffer : Holds a value written by the processor • Arithmetic unit : Performs computation (cmp, add, etc.) using Bank I/O and Key buffer as operands • Result queue : Stores compare results • CGEN : Repeats the bank ‐ local commands • The datapath is 64 bits wide • 0.53% overhead in DRAM area 16
Buffered Compare Architecture ‐ Programming Model Buffered Compare Architecture ‐ Programming Model SW code Work items __kernel search(keys[], skey, d[]){ … CMP CMP CMP CMP int id = get_global_id(0) if (keys[id] == skey) Item 0 Item 1 … Item N-1 d[id] = 1 Target Data } … Core Core Core Instruction BC_cmp_read(skey, keys, N) … Memory Controller DRAM cmd CMP_RD(skey, addr, range) DRAM Banks • OpenCL based programming model • Programmers need not be aware of DRAM parameters (page size, number of banks, …) 17
Evaluation ‐ Setup Evaluation ‐ Setup • McSimA+ simulator • Processor – 22nm, 16 OoO cores running at 3GHz – 16KB private L1 – 32MB S ‐ NUCA L2 – Directory ‐ based MESI coherence • Memory – 28nm – DDR4 ‐ 2000 – 4 ranks per channel – 16 banks per chip – PAR ‐ BS (parallelism ‐ aware batch scheduling) 18
Evaluation ‐ Setup Evaluation ‐ Setup • Six workloads – TSC : In ‐ memory linear scan (Column ‐ store) – TSR : In ‐ memory linear scan (Row ‐ store) – BT : B+ tree traversal (index scan) – MAX : MAX aggregation – SA : Sequence assembly – KV : Key ‐ value store • BC was evaluated against baseline and AMO (Active Memory Operation) 19
Evaluation ‐ Speedup Evaluation ‐ Speedup • BC performs 3.62 times better than the baseline 20
Evaluation – Energy Reduction Evaluation – Energy Reduction 6.05 2 Proc. Mem Norm. Energy Baseline 1.5 AMO 1 0.5 BC 0 TSC TSR MAX BT KV SA Geomean • Energy consumption reduced by 73.3% on average – Proc: 77.2% – Mem: 43.9% 21
Summary Summary • We proposed buffered compare, a processing ‐ in ‐ memory approach to utilizing internal bandwidth of DRAM – Minimal overhead to the DRAM area – Less invasive to existing DDR protocols – 3.62X speedup and 73.3% energy reduction • Limitations – Utilization of cache – Utilization of critical ‐ word ‐ first policy – When using x4 devices, only up to 32bits are supported for the operands 22
Recommend
More recommend