near memory processing it s the sw and hw stupid
play

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - PowerPoint PPT Presentation

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The End is Near Here! Where do we go from here? An exponential is ending 10%, 20%, .. improvement in performance of component X wont get you far


  1. Near-Memory Processing: It’s the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019

  2. The End is Near Here! Where do we go from here?

  3. An exponential is ending… 10%, 20%, .. improvement in performance of component X won’t get you far – No new transistors – Fixed power ceiling Emerging technologies are either incremental (e.g., Intel’s Xpoint Memory) or cover niche areas (e.g., quantum) 3

  4. The Way Forward: Vertical Integration Software/hardware co-design for high efficiency and programmability Is this always a good idea? No! Need high volume for cost-efficiency Need large perf/Watt gains to be worth the effort 4

  5. This Talk Vertical integration for in-memory data analytics 5

  6. Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel 7

  7. Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel Data movement – High energy cost – High BW requirement Data movement bottlenecks data analytics 8

  8. Cost of Moving Data Memory access 640 pJ Fixed point Add 0.1 pJ CPU DRAM Data access much more expensive than arithmetic operation 10

  9. DRAM BW Bottleneck Memory Array 100’s of GB/s internally Row Buffer Memory Array CPU DRAM 24 GB/s off-chip BW Internal DRAM BW presents big opportunity 11

  10. Logic inside DRAM? Not a Good Idea Fabrication processes not compatible Memory Array – DRAM is optimized for density – Logic is irregular, wire-intensive Logic Memory Array In-memory logic failed in the 90s – DRAM is cost-sensitive DRAM Must exploit DRAM in a non-disruptive manner 12

  11. Near-Memory Processing (NMP) 3D logic/DRAM stack – Exposes internal BW to processing elements – But constrains logic layer’s area/power envelope 640 pJ 24 GB/s 150 pJ 128 GB/s CPU DRAM Logic Exploit the bandwidth without data movement 13

  12. How to Best Exploit DRAM BW? DRAM internals optimized for density DRAM accesses must activate rows – Single access activates KBs of data – Activations dominate access latency & energy DRAM Can’t utilize internal BW with random access – Need to maintain many open rows – Complex bookkeeping logic Need sequential access to utilize DRAM BW 14

  13. NMP HW-Algorithm Co-Design Algorithms: Must have sequential access – Even if we perform more work Hardware: Must leverage data parallelism – On a tight area/power budget HW-algorithm co-design necessary to make best use of NMP 15

  14. Example data operator: Join Iterates over a pair of tables to find matching keys Major operation in data analytics Q: SELECT ... FROM A, B WHERE A.Key = B.Key A B Result C A F G A Z A Join D C C B M E E E 16

  15. Baseline: CPU Hash Join Best performing algorithm in CPU-centric systems Performed in two phases: Partition & Probe 1. Partition generates cache sized partitions 2. Probe builds and queries cache resident hash tables Partition Probe A E F H(x) C D F E F C B B A D B E E F B Optimized for random accesses to cached data 17

  16. NMP Hash Join C C F D H(X) A F E D A B E B To DRAM DRAM Goal: maximum MLP • Limited by bookkeeping logic NMP 18

  17. NMP Hash Join C C F D H(X) A F &C E D &F A B B E C F To DRAM C F DRAM Poor row buffer utilization NMP 19

  18. NMP Hash Join C C F D H(X) A F &A E D &D A B B E To DRAM DRAM NMP Random accesses are inefficient and under-utilize internal BW 20

  19. Eliminate Random Access? Insight: use Sort Join – Performs mostly sequential accesses – But has higher algorithmic complexity Trade algorithmic complexity for desirable access pattern Hash join Sort join O(n) random accesses O(n log n) sequential acesses C A C A H(x) F D F C A F A D D C D F Utilizing internal DRAM BW compensates for increased cost 21

  20. NMP Sort Join: Sequential Accesses A C base E G base B D F H To DRAM Drop OoO logic DRAM • Reduces area/power of NMP Add stream buffer • Simple logic utilizes BW NMP 22

  21. NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H To DRAM DRAM NMP 23

  22. NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM Good row buffer utilization DRAM NMP 24

  23. NMP Sort Join: Sequential Accesses A C base &A E 4 3 3 2 2 1 1 0 G base &B B 3 4 3 2 2 1 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM &A + 0 &A + 1 &B + 0 DRAM &B + 1 NMP 25

  24. NMP Sort Join: Sequential Accesses A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H &A + 1 &B + 1 To DRAM DRAM NMP Sequential access moves bottleneck to compute 26

  25. NMP Sort Join: Compute A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H To DRAM DRAM Use area/power budget for SIMD NMP General purpose SIMD keeps up with memory BW 27

  26. Partitioning Phase Partitioning basics: – Each partition contains buckets of objects – For a given object, target bucket determined using a hash – The order of objects within each bucket is irrelevant à buckets are unordered Insight: the order in which tuples are written into a bucket in the target partition is irrelevant Partitioning phase: tuples are permutable 28

  27. Partitioning Phase Leverage tuple’s permutability property Turn partition’s random accesses sequential – Enable use of SIMD during partition 29

  28. Mondrian Algorithm + hardware co-design for near-memory processing of data analytics NMP Algorithms – Use sequential memory accesses – Avoid random memory accesses – Target partitioning and compute phases NMP Harware – High memory parallelism using simple SIMD hardware – Exploit sequential memory accesses 30

  29. Methodology Flexus cycle accurate simulator [Wenisch’06] Big data operators: Simulated systems: – Scan • CPU-centric: ARM Cortex-A57 – Join – 16 cores – 3-wide,128-entry ROB @ 2GHz – Group By – Sort • NMP: Mobile ARM core Memory subsystem: – 16 cores per stack – 3-wide, 48-entry ROB @ 1GHz • 4 HMC stacks • Mondrian: SIMD in-order – 20 GB/s external BW – 16 cores per stack – 128 GB/s internal BW – 1024-bit SIMD @ 1GHz 31

  30. Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator 32

  31. Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian achieves superior BW utilization 33

  32. Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator NMP can’t utilize memory BW with random accesses 34

  33. Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian BW utilization compensates for extra log(n) work 35

  34. Summary End of technology scaling à must think vertical – Software + hardware co-design Big data analytics are a critical workload – Large datasets, little locality à memory bottleneck! Moving compute near memory improves performance – But need to conform to DRAM constraints Mondrian is algorithm-hardware NMP for analytics – Adapt algorithms/HW to DRAM constraints – Sequential rather than random memory access – Simple hardware to exploit memory bandwidth 36

  35. Thank you! Questions ? inf.ed.ac.uk/bgrot 37

  36. Mondrian Energy Efficiency 100 NMP-OoO Mondrian Efficiency Improvement (performance/energy) 10 1 Scan Sort Group by Join Operator 38

Recommend


More recommend