Processing-in-memory (PIM) is regaining attention for energy - PowerPoint PPT Presentation

Lifeng Nai *† Ramyad Hadidi † He Xiao † Hyojong Kim † Jaewoong Sim ‡ Hyesoon Kim † † Georgia Institute of Technology * Google, ‡ Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs

2/24 Processing-in-memory (PIM) is regaining attention for energy efficient computing • Graph Workloads: Data-Intensive, Little Data Reuse Basic Concept: Offload compute to memory • Reduce costly energy consumption of data movement • Enable using large internal memory bandwidth DATA: 0xf0f0 INST: ADD DATA: 0xf0f4 Conventional Data Processing Processing-in-Memory

3/24 PIM could increase memory temperature beyond normal operating temperature (85 ° C) • High BW (hundreds of GBs ~ TBs) from 3D-stacked memory • Less effective heat transfer compared to DIMMs • PIM would make these thermal problems worse! Rarely exceeds 85 ° C DATA: 0xf0f0 DATA: 0xf0f4 Too Hot Memory Stack? • Slower processing for memory requests • Decreasing overall system performance INST: ADD CoolPIM keeps the memory “ Cool ” to achieve better PIM performance

4/24 Introduction Hybrid Memory Cube • Background • Thermal Measurements & Thermal Modeling of Future HMC CoolPIM • Software-Based Throttling • Hardware-Based Throttling Evaluation Conclusion

5/24 A Hybrid Memory Cube (HMC) from Micron • Multiple 3D-stacked DRAM layers + one logic layer with TSVs • Vaults: equivalent to memory channels • Full-duplex serial links between the host and HMC No PIM functionality for existing HMC products yet Packets TSVs Vault External Serial Links DRAM Layers Logic Layer

6/24 Instruction-level PIM supported in future HMC (HMC 2.0) Type HMC 2.0 PIM Instruction • Perform Read-Modify-Write (RMW) operations atomically Arithmetic Signed Add • Similar to READ/WRITE packets; just different CMD in the Header Bitwise Swap, bit write • No HMC 2.0 product yet! Boolean AND/NAND/OR/NOR/XOR Q: Can we offload all the PIM operations to HMC? Comparison CAS-equal/greater What is the thermal impact of PIM in future HMC? PIM-ADD (addr, imm) DRAM Layers ACK Logic Layer Header Tail addr, imm ( PIM-ADD )

7/24 Experiment Platform (Pico SC-6 Mini System) • Intel Core i7 + FPGA Compute Modules (AC-510) HMC } AC-510: 4GB HMC 1.1, Kintex Ultrascale HMC Measure the temperature on the heat sink • Controlling memory BW via FPGA FPGA • Applying three different cooling methods BW Control } High-End Active Heat Sink RTL } Low-End Active Heat Sink } Passive Heat Sink FPGA HMC 1.1 has no PIM functionality!

8/24 HMC Busy Shutdown Idle FPGA High-end Low-end Passive Active Active

9/24 Thermal modeling for HMC 2.0 with commodity-server active cooling • HMC 2.0 (w/o PIM) would reach 81 ° C at a full external BW (320GB/s) } We validated our thermal model against the measurements on HMC 1.1 Passiv e Low - e nd Commodity High-end 120 Peak DRAM Temp. ( ° C) Better cooling 80 HMC operating temperature: 40 0 ° C-105 ° C High bandwidth 0 0 40 80 120 160 200 240 280 320 Data Bandwidth (GB/s) We need at least commodity-server cooling to benefit from PIM!

10/24 PIM increases memory temperature due to power consumption of logic and DRAM layers. • In our modeling, the maximum PIM offloading rate is 6.5 PIM ops/ns • A high offloading rate could reduce memory performance for cool down 110 Too Hot Peak DRAM Temp. ( ° C) 95 ° C-105 ° C Reduced memory 100 performance 85 ° C-95 ° C 90 0 ° C-85 ° C Desirable Temp. 80 Range 70 0 1 2 3 4 5 6 7 PIM Rate (op/ns)

11/24 PIM intensity needs to be controlled!! Performance Higher BW benefits Higher DRAM temperature è Better performance è Low memory performance PIM Offloading Rate

12/24 CoolPIM Controls PIM Intensity with Thermal Consideration

13/24 We propose two methods for GPU/HMC 1) A SW mechanism with no hardware changes 2) A HW mechanism with changes in GPU architectures Dynamic source throttling based on thermal warning messages from HMC • Thermal warning -> lowers PIM intensity -> reduces internal temperature of HMC SW Method Updated # of PIM-Enabled GPU Runtime HW Method CUDA Blocks Software -based Source Throttling Updated # of PIM- Enabled Warps Hardware -based Source Throttling Thermal GPU HW Architecture Warning Updated Offloading Intensity PIM HMC Offloading

14/24 GPU runtime implements some components to control PIM intensity • PIM Token Pool (PTP) } # of maximum thread blocks that are allowed to use PIM functionality • Thread Block Manager } Check PTP and launch PIM code if tokens are available • Initialization } Estimate the initial PTP size based on static analysis at compile time Mem Request Launch Blk Vault SM SM Th-Blk Manager Vault Initialization SM SM PIM Offloading Switch PIM Token Pool Thermal Warning ... Forward Thermal Interrupt Handler Interrupt Vault PCI-E HMC Links GPU Runtime GPU HMC

15/24 The GPU compiler generates PIM-enabled and non-PIM kernels at compile time • Source-to-source translation • IR-to-IR translation Void cuda_kernel (arg_list) void cuda_kernel_np (arg_list) { { for (int i=0; i<end; i++) for (int i=0; i<end; i++) { { uint addr = addrArray[i]; uint addr = addrArray[i]; PIM_Add (addr, 1); cuda atomicAdd (addr, 1); } } } } Original PIM Code Shadow Non-PIM Code

16/24 PIM Control Unit • Controls # of PIM-enabled warps • Performs dynamic binary translation • See the paper for detail! Mem Request Vault SM SM SM SM Type PIM Instruction Non-PIM Arithmetic Signed Add atomicAdd Vault PIM Offloading Switch Bitwise Swap, bit write atomicExch Control PIM-enabled Warp # Thermal Warning ... Boolean AND, OR atomicAND/OR PIM Control Unit Comparison CAS-equal/greater atomicCAS/Max Vault HMC Links GPU HMC

17/24 Evaluation

18/24 Thermal Evaluation • Temp Measurement: Real HMC 1.1 Platform • Thermal Modeling: HMC 2.0 using 3D-ICE • Power & Area: Synopsys (28nm/50nm CMOS) Performance Evaluation Thermal Verilog Power Camera • MacSim w/ VaultSim Synopsys and Validation Area HMC BW Control Benchmark RTL Thermal HMC Spec FPGA Modeling • GraphBIG benchmark with LDBC dataset } BFS, SSSP, PageRank, etc… GPU/HMC Timing Benchmarks Simulation

19/24 Speedup over baseline (Non-Offloading) • Naïve/SW/HW : using a commodity-server active heat sync • Ideal Thermal : with unlimited cooling No n-O ff l oading Naïve-Offloading CoolPIM (SW) CoolPIM (HW) Ideal Ther mal Speedup over Non-Offloading 2.0 1.5 1.0 0.5 0.0 c c c e c k c a c c c c n w t w w d t t w n a t t r d t - t d o - a e - d t d t - s - p c - - r s p f - M s - p e b f s k p s f s b f s s b g s G f b s s s b s s a s s p s On average, CoolPIM (SW/HW) improves performance by 1.21x/1.25x!

20/24 PIM Offloading Rate • Naïve: 3~4 op/ns à Temperature goes beyond the normal operating region. • CoolPIM: 1.3 op/ns à No memory performance slowdown Naïv e Offl oadin g Co olP IM (S W ) Co olP IM (HW) 1 0 0 Peak DRAM Temp. ( ° C) 9 5 9 0 8 5 8 0 7 5 bfs-d tc bfs-d w c bfs-ta bfs-ttc bfs-twc dc kc o re sss p -dtc sss p -dwc sss p -ttc sss p -twc pag e ra n k CoolPIM maintains peak DRAM temperature within normal operating temp!

21/24 Conclusion

22/24 Observation: PIM integration requires careful thermal consideration • Naive PIM offloading may cause a thermal issue and degrades overall system performance CoolPIM: Source throttling techniques to control PIM intensity • Keeps HMC ”Cool” to avoid thermal-triggered memory performance degradation Results: CoolPIM improves performance by 1.37x over naïve offloading • 1.2x over non-offloading on average

23/24 Thank You

24/24 Backup

25/24 Type Thermal Resistance Cooling Power* 4.0 ° C/W Passive heat sink 0 2.0 ° C/W Low-end active heat sink 1x 0.5 ° C/W Commodity-server active 104x heat sink 0.2 ° C/W High-end heat sink 380x * We assume the same plate-fin heat sink model for all configurations.

26/24 | Validate our thermal evaluation environment • Model HMC 1.1 temperature and compare with measurements Su face (m easu red ) Die (est i m at ed) Die (m odelin g) 80 Temperature ( ° C) 60 40 20 0 Low -en d High -en d

27/24 Component Configuration GPU, 16 PTX SMs, 32 threads/warp, 1.4GHz Host 16KB private L1D and 1MB 16-way L2 cache 8 GB cube, 1 logic die, 8 DRAM dies 32 vaults, 512 DRAM banks tCL=tRCD=tRP=13.75ns,tRAS=27.5ns 4 links per package, 120 GB/s per link HMC 80 GB/s data bandwidth per link DRAM Temp. phase: 0-85 ° C, 85-95 ° C, 95-105 ° C 20% DRAM freq reduction (high temp. phases)

28/24 | Bandwidth consumption normalized to baseline (non-offloading) Non-Offloading Naï ve-Offloading CoolPIM (SW ) CoolPIM (H W) 1 Normalized Bandwidth 0.8 0.6 0.4 0.2 0 c c n c a c c k c e c c c t w w t d t w t n a r w t t d o d - t a e d t - - t - d s c - p M - s - r f p s - - k p b f s e p f s s b f G b s s g f b s s s b s s s a s s s p

29/24 Issue Reduce Interrupt PIM Token CUDA Blk. token size Handler Pool Manager Thermal Select warning PIM Code Blk. HMC GPU SMs Launch Offloading Non-PIM Code CUDA Blk.

Processing-in-memory (PIM) is regaining attention for energy - PowerPoint PPT Presentation

Lifeng Nai * Ramyad Hadidi He Xiao Hyojong Kim Jaewoong Sim Hyesoon Kim Georgia Institute of Technology * Google, Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs 2/24

Passive Intermodulation (PIM), an interference challenge for the radio Passive Intermodulation

draft-ietf-pim-sm-bsr-04.txt PIM WG, IETF-60, San Diego, August 3 2004 Alexander Gall

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

IP Multicast with PIM-SM over a MPLS TE Core draft-raggarwa-pim-sm-mpls-te-00.txt Rahul Aggarwal

Use of p2mp BFD in PIM-SM (over shared-media segment) draft-mirsky-pim-bfd-p2mp-use-case Greg

Securing PIM-SM Link- Local Messages J.W. Atwood Salekul Islam Concordia University

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Be aware: diploid hybrid potatoes are coming! Pim Lindhout, Menno ter Maat and Michiel de Vries

PERFOOD PERfluorinated Organics in Our Diet Pim de Voogt University of Amsterdam - UvA-IBED,

Research Paper Recommender System Based on Deep Text Comprehension Dongyu Ru Kun Chen SJTU

Analysing Switch-Case Tables by Partial Evaluation Niklas Holsti Tidorum Ltd www.tidorum.fi

Firmware at the Mu2e Test Stand Micol Rigatti Final Report 25/09/2019 Mu2e Experiment A search

DtCraft: A High-performance Distributed Execution Engine at Scale Dr. Tsung-Wei Huang Department

Andy Tolmie What? Director of the Bloomsbury DTC since 2011, and UCL DTC since 2015; now

Dusit it Thani Plc lc. . Analyst Meeting Strengthening the Foundation 11 April 2018

E Economics, QoS, and charging i Q S d h i in the next great in the next great telecom

Earth System Research Lab in Boulder, Colorado What is WRF? WRF is NCARs mesoscale and