Lifeng Nai *† Ramyad Hadidi † He Xiao † Hyojong Kim † Jaewoong Sim ‡ Hyesoon Kim † † Georgia Institute of Technology * Google, ‡ Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs
2/24 Processing-in-memory (PIM) is regaining attention for energy efficient computing • Graph Workloads: Data-Intensive, Little Data Reuse Basic Concept: Offload compute to memory • Reduce costly energy consumption of data movement • Enable using large internal memory bandwidth DATA: 0xf0f0 INST: ADD DATA: 0xf0f4 Conventional Data Processing Processing-in-Memory
3/24 PIM could increase memory temperature beyond normal operating temperature (85 ° C) • High BW (hundreds of GBs ~ TBs) from 3D-stacked memory • Less effective heat transfer compared to DIMMs • PIM would make these thermal problems worse! Rarely exceeds 85 ° C DATA: 0xf0f0 DATA: 0xf0f4 Too Hot Memory Stack? • Slower processing for memory requests • Decreasing overall system performance INST: ADD CoolPIM keeps the memory “ Cool ” to achieve better PIM performance
4/24 Introduction Hybrid Memory Cube • Background • Thermal Measurements & Thermal Modeling of Future HMC CoolPIM • Software-Based Throttling • Hardware-Based Throttling Evaluation Conclusion
5/24 A Hybrid Memory Cube (HMC) from Micron • Multiple 3D-stacked DRAM layers + one logic layer with TSVs • Vaults: equivalent to memory channels • Full-duplex serial links between the host and HMC No PIM functionality for existing HMC products yet Packets TSVs Vault External Serial Links DRAM Layers Logic Layer
6/24 Instruction-level PIM supported in future HMC (HMC 2.0) Type HMC 2.0 PIM Instruction • Perform Read-Modify-Write (RMW) operations atomically Arithmetic Signed Add • Similar to READ/WRITE packets; just different CMD in the Header Bitwise Swap, bit write • No HMC 2.0 product yet! Boolean AND/NAND/OR/NOR/XOR Q: Can we offload all the PIM operations to HMC? Comparison CAS-equal/greater What is the thermal impact of PIM in future HMC? PIM-ADD (addr, imm) DRAM Layers ACK Logic Layer Header Tail addr, imm ( PIM-ADD )
7/24 Experiment Platform (Pico SC-6 Mini System) • Intel Core i7 + FPGA Compute Modules (AC-510) HMC } AC-510: 4GB HMC 1.1, Kintex Ultrascale HMC Measure the temperature on the heat sink • Controlling memory BW via FPGA FPGA • Applying three different cooling methods BW Control } High-End Active Heat Sink RTL } Low-End Active Heat Sink } Passive Heat Sink FPGA HMC 1.1 has no PIM functionality!
8/24 HMC Busy Shutdown Idle FPGA High-end Low-end Passive Active Active
9/24 Thermal modeling for HMC 2.0 with commodity-server active cooling • HMC 2.0 (w/o PIM) would reach 81 ° C at a full external BW (320GB/s) } We validated our thermal model against the measurements on HMC 1.1 Passiv e Low - e nd Commodity High-end 120 Peak DRAM Temp. ( ° C) Better cooling 80 HMC operating temperature: 40 0 ° C-105 ° C High bandwidth 0 0 40 80 120 160 200 240 280 320 Data Bandwidth (GB/s) We need at least commodity-server cooling to benefit from PIM!
10/24 PIM increases memory temperature due to power consumption of logic and DRAM layers. • In our modeling, the maximum PIM offloading rate is 6.5 PIM ops/ns • A high offloading rate could reduce memory performance for cool down 110 Too Hot Peak DRAM Temp. ( ° C) 95 ° C-105 ° C Reduced memory 100 performance 85 ° C-95 ° C 90 0 ° C-85 ° C Desirable Temp. 80 Range 70 0 1 2 3 4 5 6 7 PIM Rate (op/ns)
11/24 PIM intensity needs to be controlled!! Performance Higher BW benefits Higher DRAM temperature è Better performance è Low memory performance PIM Offloading Rate
12/24 CoolPIM Controls PIM Intensity with Thermal Consideration
13/24 We propose two methods for GPU/HMC 1) A SW mechanism with no hardware changes 2) A HW mechanism with changes in GPU architectures Dynamic source throttling based on thermal warning messages from HMC • Thermal warning -> lowers PIM intensity -> reduces internal temperature of HMC SW Method Updated # of PIM-Enabled GPU Runtime HW Method CUDA Blocks Software -based Source Throttling Updated # of PIM- Enabled Warps Hardware -based Source Throttling Thermal GPU HW Architecture Warning Updated Offloading Intensity PIM HMC Offloading
14/24 GPU runtime implements some components to control PIM intensity • PIM Token Pool (PTP) } # of maximum thread blocks that are allowed to use PIM functionality • Thread Block Manager } Check PTP and launch PIM code if tokens are available • Initialization } Estimate the initial PTP size based on static analysis at compile time Mem Request Launch Blk Vault SM SM Th-Blk Manager Vault Initialization SM SM PIM Offloading Switch PIM Token Pool Thermal Warning ... Forward Thermal Interrupt Handler Interrupt Vault PCI-E HMC Links GPU Runtime GPU HMC
15/24 The GPU compiler generates PIM-enabled and non-PIM kernels at compile time • Source-to-source translation • IR-to-IR translation Void cuda_kernel (arg_list) void cuda_kernel_np (arg_list) { { for (int i=0; i<end; i++) for (int i=0; i<end; i++) { { uint addr = addrArray[i]; uint addr = addrArray[i]; PIM_Add (addr, 1); cuda atomicAdd (addr, 1); } } } } Original PIM Code Shadow Non-PIM Code
16/24 PIM Control Unit • Controls # of PIM-enabled warps • Performs dynamic binary translation • See the paper for detail! Mem Request Vault SM SM SM SM Type PIM Instruction Non-PIM Arithmetic Signed Add atomicAdd Vault PIM Offloading Switch Bitwise Swap, bit write atomicExch Control PIM-enabled Warp # Thermal Warning ... Boolean AND, OR atomicAND/OR PIM Control Unit Comparison CAS-equal/greater atomicCAS/Max Vault HMC Links GPU HMC
17/24 Evaluation
18/24 Thermal Evaluation • Temp Measurement: Real HMC 1.1 Platform • Thermal Modeling: HMC 2.0 using 3D-ICE • Power & Area: Synopsys (28nm/50nm CMOS) Performance Evaluation Thermal Verilog Power Camera • MacSim w/ VaultSim Synopsys and Validation Area HMC BW Control Benchmark RTL Thermal HMC Spec FPGA Modeling • GraphBIG benchmark with LDBC dataset } BFS, SSSP, PageRank, etc… GPU/HMC Timing Benchmarks Simulation
19/24 Speedup over baseline (Non-Offloading) • Naïve/SW/HW : using a commodity-server active heat sync • Ideal Thermal : with unlimited cooling No n-O ff l oading Naïve-Offloading CoolPIM (SW) CoolPIM (HW) Ideal Ther mal Speedup over Non-Offloading 2.0 1.5 1.0 0.5 0.0 c c c e c k c a c c c c n w t w w d t t w n a t t r d t - t d o - a e - d t d t - s - p c - - r s p f - M s - p e b f s k p s f s b f s s b g s G f b s s s b s s a s s p s On average, CoolPIM (SW/HW) improves performance by 1.21x/1.25x!
20/24 PIM Offloading Rate • Naïve: 3~4 op/ns à Temperature goes beyond the normal operating region. • CoolPIM: 1.3 op/ns à No memory performance slowdown Naïv e Offl oadin g Co olP IM (S W ) Co olP IM (HW) 1 0 0 Peak DRAM Temp. ( ° C) 9 5 9 0 8 5 8 0 7 5 bfs-d tc bfs-d w c bfs-ta bfs-ttc bfs-twc dc kc o re sss p -dtc sss p -dwc sss p -ttc sss p -twc pag e ra n k CoolPIM maintains peak DRAM temperature within normal operating temp!
21/24 Conclusion
22/24 Observation: PIM integration requires careful thermal consideration • Naive PIM offloading may cause a thermal issue and degrades overall system performance CoolPIM: Source throttling techniques to control PIM intensity • Keeps HMC ”Cool” to avoid thermal-triggered memory performance degradation Results: CoolPIM improves performance by 1.37x over naïve offloading • 1.2x over non-offloading on average
23/24 Thank You
24/24 Backup
25/24 Type Thermal Resistance Cooling Power* 4.0 ° C/W Passive heat sink 0 2.0 ° C/W Low-end active heat sink 1x 0.5 ° C/W Commodity-server active 104x heat sink 0.2 ° C/W High-end heat sink 380x * We assume the same plate-fin heat sink model for all configurations.
26/24 | Validate our thermal evaluation environment • Model HMC 1.1 temperature and compare with measurements Su face (m easu red ) Die (est i m at ed) Die (m odelin g) 80 Temperature ( ° C) 60 40 20 0 Low -en d High -en d
27/24 Component Configuration GPU, 16 PTX SMs, 32 threads/warp, 1.4GHz Host 16KB private L1D and 1MB 16-way L2 cache 8 GB cube, 1 logic die, 8 DRAM dies 32 vaults, 512 DRAM banks tCL=tRCD=tRP=13.75ns,tRAS=27.5ns 4 links per package, 120 GB/s per link HMC 80 GB/s data bandwidth per link DRAM Temp. phase: 0-85 ° C, 85-95 ° C, 95-105 ° C 20% DRAM freq reduction (high temp. phases)
28/24 | Bandwidth consumption normalized to baseline (non-offloading) Non-Offloading Naï ve-Offloading CoolPIM (SW ) CoolPIM (H W) 1 Normalized Bandwidth 0.8 0.6 0.4 0.2 0 c c n c a c c k c e c c c t w w t d t w t n a r w t t d o d - t a e d t - - t - d s c - p M - s - r f p s - - k p b f s e p f s s b f G b s s g f b s s s b s s s a s s s p
29/24 Issue Reduce Interrupt PIM Token CUDA Blk. token size Handler Pool Manager Thermal Select warning PIM Code Blk. HMC GPU SMs Launch Offloading Non-PIM Code CUDA Blk.
Recommend
More recommend