Near-Memory Processing: It’s the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019
The End is Near Here! Where do we go from here?
An exponential is ending… 10%, 20%, .. improvement in performance of component X won’t get you far – No new transistors – Fixed power ceiling Emerging technologies are either incremental (e.g., Intel’s Xpoint Memory) or cover niche areas (e.g., quantum) 3
The Way Forward: Vertical Integration Software/hardware co-design for high efficiency and programmability Is this always a good idea? No! Need high volume for cost-efficiency Need large perf/Watt gains to be worth the effort 4
This Talk Vertical integration for in-memory data analytics 5
Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel 7
Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel Data movement – High energy cost – High BW requirement Data movement bottlenecks data analytics 8
Cost of Moving Data Memory access 640 pJ Fixed point Add 0.1 pJ CPU DRAM Data access much more expensive than arithmetic operation 10
DRAM BW Bottleneck Memory Array 100’s of GB/s internally Row Buffer Memory Array CPU DRAM 24 GB/s off-chip BW Internal DRAM BW presents big opportunity 11
Logic inside DRAM? Not a Good Idea Fabrication processes not compatible Memory Array – DRAM is optimized for density – Logic is irregular, wire-intensive Logic Memory Array In-memory logic failed in the 90s – DRAM is cost-sensitive DRAM Must exploit DRAM in a non-disruptive manner 12
Near-Memory Processing (NMP) 3D logic/DRAM stack – Exposes internal BW to processing elements – But constrains logic layer’s area/power envelope 640 pJ 24 GB/s 150 pJ 128 GB/s CPU DRAM Logic Exploit the bandwidth without data movement 13
How to Best Exploit DRAM BW? DRAM internals optimized for density DRAM accesses must activate rows – Single access activates KBs of data – Activations dominate access latency & energy DRAM Can’t utilize internal BW with random access – Need to maintain many open rows – Complex bookkeeping logic Need sequential access to utilize DRAM BW 14
NMP HW-Algorithm Co-Design Algorithms: Must have sequential access – Even if we perform more work Hardware: Must leverage data parallelism – On a tight area/power budget HW-algorithm co-design necessary to make best use of NMP 15
Example data operator: Join Iterates over a pair of tables to find matching keys Major operation in data analytics Q: SELECT ... FROM A, B WHERE A.Key = B.Key A B Result C A F G A Z A Join D C C B M E E E 16
Baseline: CPU Hash Join Best performing algorithm in CPU-centric systems Performed in two phases: Partition & Probe 1. Partition generates cache sized partitions 2. Probe builds and queries cache resident hash tables Partition Probe A E F H(x) C D F E F C B B A D B E E F B Optimized for random accesses to cached data 17
NMP Hash Join C C F D H(X) A F E D A B E B To DRAM DRAM Goal: maximum MLP • Limited by bookkeeping logic NMP 18
NMP Hash Join C C F D H(X) A F &C E D &F A B B E C F To DRAM C F DRAM Poor row buffer utilization NMP 19
NMP Hash Join C C F D H(X) A F &A E D &D A B B E To DRAM DRAM NMP Random accesses are inefficient and under-utilize internal BW 20
Eliminate Random Access? Insight: use Sort Join – Performs mostly sequential accesses – But has higher algorithmic complexity Trade algorithmic complexity for desirable access pattern Hash join Sort join O(n) random accesses O(n log n) sequential acesses C A C A H(x) F D F C A F A D D C D F Utilizing internal DRAM BW compensates for increased cost 21
NMP Sort Join: Sequential Accesses A C base E G base B D F H To DRAM Drop OoO logic DRAM • Reduces area/power of NMP Add stream buffer • Simple logic utilizes BW NMP 22
NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H To DRAM DRAM NMP 23
NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM Good row buffer utilization DRAM NMP 24
NMP Sort Join: Sequential Accesses A C base &A E 4 3 3 2 2 1 1 0 G base &B B 3 4 3 2 2 1 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM &A + 0 &A + 1 &B + 0 DRAM &B + 1 NMP 25
NMP Sort Join: Sequential Accesses A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H &A + 1 &B + 1 To DRAM DRAM NMP Sequential access moves bottleneck to compute 26
NMP Sort Join: Compute A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H To DRAM DRAM Use area/power budget for SIMD NMP General purpose SIMD keeps up with memory BW 27
Partitioning Phase Partitioning basics: – Each partition contains buckets of objects – For a given object, target bucket determined using a hash – The order of objects within each bucket is irrelevant à buckets are unordered Insight: the order in which tuples are written into a bucket in the target partition is irrelevant Partitioning phase: tuples are permutable 28
Partitioning Phase Leverage tuple’s permutability property Turn partition’s random accesses sequential – Enable use of SIMD during partition 29
Mondrian Algorithm + hardware co-design for near-memory processing of data analytics NMP Algorithms – Use sequential memory accesses – Avoid random memory accesses – Target partitioning and compute phases NMP Harware – High memory parallelism using simple SIMD hardware – Exploit sequential memory accesses 30
Methodology Flexus cycle accurate simulator [Wenisch’06] Big data operators: Simulated systems: – Scan • CPU-centric: ARM Cortex-A57 – Join – 16 cores – 3-wide,128-entry ROB @ 2GHz – Group By – Sort • NMP: Mobile ARM core Memory subsystem: – 16 cores per stack – 3-wide, 48-entry ROB @ 1GHz • 4 HMC stacks • Mondrian: SIMD in-order – 20 GB/s external BW – 16 cores per stack – 128 GB/s internal BW – 1024-bit SIMD @ 1GHz 31
Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator 32
Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian achieves superior BW utilization 33
Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator NMP can’t utilize memory BW with random accesses 34
Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian BW utilization compensates for extra log(n) work 35
Summary End of technology scaling à must think vertical – Software + hardware co-design Big data analytics are a critical workload – Large datasets, little locality à memory bottleneck! Moving compute near memory improves performance – But need to conform to DRAM constraints Mondrian is algorithm-hardware NMP for analytics – Adapt algorithms/HW to DRAM constraints – Sequential rather than random memory access – Simple hardware to exploit memory bandwidth 36
Thank you! Questions ? inf.ed.ac.uk/bgrot 37
Mondrian Energy Efficiency 100 NMP-OoO Mondrian Efficiency Improvement (performance/energy) 10 1 Scan Sort Group by Join Operator 38
Recommend
More recommend