Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Introduction ▪ Recent advances in memory technology and packaging — High bandwidth memories – HBM, HMC — Non-volatile memory – 3D XPoint — Potential for logic and compute functions co-located with the memory — Brought attention to computer memory system design and evaluation HMC HBM 3D XPoint Hongshin Jun, et. al. IMW 2017 Creative Commons Attribution Micron Technology 2 LLNL-PRES-750335
Introduction ▪ Emerging memories exhibit a wide range of bandwidths, Memory/Storage Hierarchy latencies, and capacities Capacity Latency 8 MB SRAM 10 ns — Challenge for the computer 128 MB 45 ns Near DRAM architects to navigate the design 8 GB 70 ns DDR DRAM space 8 GB 100 ns — Challenge for application Far DRAM 64 GB 200 ns developers to assess performance NVM implications 128 GB 50 us SSD 6 TB 10 ms HDD 3 LLNL-PRES-750335
Introduction ▪ Need for system level exploration of the design space — Combinations of memory technology — Various memory hierarchies — Potential benefit of near-memory accelerators — Prototype architectural ideas in detail ▪ Need to quantitatively evaluate the performance impact on applications – beyond an isolated function — Accelerator communication overhead — Cache management overhead — Operating System overhead — Byte addressable vs. block addressable — Scratchpad vs. Cache — Cache size to working data set size — Latency impact 4 LLNL-PRES-750335
Background ▪ M. Butts, J. Batcheller and J. Varghese, “An efficient logic emulation system,” Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors , Cambridge, MA, 1992, pp. 138-141. — Realizer System ▪ “Virtex - 7 2000T FPGA for ASIC Prototyping & Emulation,” https://www.xilinx.com/video/fpga/virtex-7-2000t-asic-prototyping-emulation.html — Prototype ARM A9 processor subsystem (dual-core, caches) mapped into a single Virtex-7 2000T FPGA ▪ Our approach uses the native hard IP cores/cache hierarchy and focuses on external memory 5 LLNL-PRES-750335
LiME (Logic in Memory Emulator) ZCU102 development board with Xilinx Zynq UltraScale+ MPSoC device 6 LLNL-PRES-750335
Trace DRAM Trace Subsystem LiME (Logic in Memory Emulator) Zynq UltraScale+ MPSoC Programmable Logic (PL) Implementation AXI Performance Trace Capture Monitor (APM) Device ▪ Use embedded CPU and cache hierarchy in Zynq MPSoC to save AXI Peripheral FPGA logic and development time Memory Subsystem Interconnect Program DRAM Accelerator ▪ Route memory traffic through hardware IP blocks deployed in Monitor programmable logic Delay Delay ▪ Emulate the latencies of a wide HP2,3 HP0,1 HPM0 HPM1 range of memories by using Main Switch programmable delay units in the DDR Memory loopback path Host Subsystem Controller Coherent Interconnect ▪ Capture time-stamped memory Not Used L2 Cache transactions using trace subsystem L1 L1 L1 L1 ARM ARM ARM ARM Core Core Core Core Open Source: Processing System (PS) http://bitbucket.org/perma/emulator_st/ 7 LLNL-PRES-750335
LiME (Logic in Memory Emulator) Implementation LLNL Hardware IP Blocks AXI Delay AXI Shim AXI Trace Capture Device LiME uses only 13% of the device resources 8 LLNL-PRES-750335
Emulation Method Clock Domains Zynq UltraScale+: Actual ▪ ARM cores are slowed to run Programmable Subsystem Accelerator Memory Logic (PL) Program at a frequency similar to DRAM 62.5 MHz programmable logic 300 MHz 4.8 GB/s 950 MHz ▪ A scaling factor of 20x is Subsystem DDR 137.5 MHz 2.2 GB/s 15.2 GB/s 475 MHz Host 128 7.6 GB/s applied to the entire system 64 APU Processing 137.5 MHz System (PS) ▪ Other scaling factors can be Zynq UltraScale+: Emulated at 20x used depending on the target Programmable Subsystem Accelerator Memory Program Logic (PL) peak bandwidth to memory DRAM 1.25 GHz 6 GHz 96 GB/s ▪ CPU peak bandwidth is 19 GHz Subsystem DDR 2.75 GHz limited to 44 GB/s 44 GB/s 304 GB/s 9.5 GHz Host 128 152 GB/s 64 APU Processing 2.75 GHz System (PS) 9 LLNL-PRES-750335
Emulation Method Scaling by 20 Example Component Actual Emulated Memory Bandwidth (PL) 4.8 GB/s 96 GB/s Memory Latency (PL) 230 ns 12 ns (too low) Memory Latency (PL) w/delay 230 ns 12+88 = 100 ns CPU Frequency 137.5 MHz 2.75 GHz CPU Bandwidth 2.2 GB/s 44 GB/s Accelerator Frequency 62.5 MHz 1.25 GHz Accelerator Bandwidth Up to 4.8 GB/s Up to 96 GB/s Delay is programmable over a wide range: 0 - 174 us in 0.16 ns increments 10 LLNL-PRES-750335
Emulation Method Address Space Original Modified Contiguous AXI peripherals Loopback path through PL 11 LLNL-PRES-750335
Emulation Method Zynq UltraScale+ MPSoC Delay & Loopback Programmable Logic (PL) 0x04_0000_0000 (R1, R2) Map Width: 20 bits ▪ Address ranges R1, R2 AXI Map In: 0x04000 Shift R1 Shim Map Out: 0x08000 intended to have different Memory Subsystem 0x 08_000 0_0000 (R1) 0x 04_001 0_0000 (R2) access latencies (e.g. Map Width: 8 bits AXI SRAM, DRAM) Map In: 0x04 Shift R2 Shim Map Out: 0x18 Addr Width: 40 bits 0x 08 _0000_0000 (R1) Program DRAM Data Width: 128 bits ▪ Shims shift and separate 0x 18 _0010_0000 (R2) R1: 1M range R2: 4G range AXI address ranges (R1, R2) for Route R1, R2 SmartConnect easier routing 0x 08 _0000_0000 (R1) 0x 18 _0010_0000 (R2) AXI AXI Addr Width: Addr Width: ▪ Standard AXI Interconnect 36 bits 36 bits Delay Delay routes requests through 0x 08 _0000_0000 (R1) 0x 08 _0010_0000 (R2) S_AXI_HP0 S_AXI_HP1 M_AXI_HPM0 different delay units Host Subsystem Main Switch DDR Memory Controller ▪ Delay units have separate Coherent Interconnect programmable delays for read and write access Not Used APU Processing System (PS) 12 LLNL-PRES-750335
Emulation Method Macro Insertion ▪ Insert macros at the start and end of the region of interest (ROI) ▪ CLOCKS_EMULATE/CLOCKS_NORMAL — Modify the clock frequencies and configure the delay units ▪ TRACE_START/TRACE_STOP — Trigger the hardware to start/stop recording memory events in Trace DRAM ▪ STATS_START/STATS_STOP — Trigger the hardware to start/stop the performance monitor counters ▪ TRACE_CAP — Save captured trace from Trace DRAM to SD card 13 LLNL-PRES-750335
Memory Trace Capture LiME trace.bin parser.c trace.csv CPU = 0, Accelerator = 1 Each count represents 0.16 ns 14 LLNL-PRES-750335
Use Cases Bandwidth Analysis from Trace 15 LLNL-PRES-750335
Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 16 LLNL-PRES-750335
Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 17 LLNL-PRES-750335
Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 18 LLNL-PRES-750335
Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 19 LLNL-PRES-750335
Use Cases Evaluation of Future Storage Class Memory DGEMM execution time on 64-bit processor SpMV execution time on 64-bit processor at varying latencies and varying cache-to- at varying latencies with a cache-to- memory ratios. memory ratio of 1:112. Cache can hide memory latency for a Latency has direct impact. Application will working set size up to twice the size of need a high level of concurrency and greater cache. throughput to offset the loss in performance. 20 LLNL-PRES-750335
Use Cases Evaluation of Near-Memory Acceleration Engines Memory Subsystem ▪ Multiple Memory Channels ▪ Up to 16 concurrent memory requests Memory Memory Memory Memory Channel Channel Channel Channel ▪ DREs are located in the Memory Subsystem ▪ Scratchpad is used to communicate parameters and results between CPU and accelerator Switch To Switch DRE DRE DRE DRE Links Scratchpad Shared Cache Cache Cache Load-Store Control Unit Processor CPU CPU Core Core Data Rearrangement Engine (DRE) Processor 21 LLNL-PRES-750335
Use Cases Evaluation of Near-Memory Acceleration Engines The results demonstrate that substantial speedup can be gained with a DRE due to the higher number of in-flight requests issued by the near-memory accelerator. 22 LLNL-PRES-750335
Use Cases Comparing Performance Across CPUs ▪ 32-bit ARM A9 (Out-of-order with 11-stage pipeline) using Zynq 7000 — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 512 KB (8-way set-associative) — Cache Line Size: 32 Bytes ▪ 64-bit ARM A53 (In-order with 8-stage pipeline) using Zynq UltraScale+ — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 1MB (16-way set-associative) — Cache Line Size: 64 Bytes Bandwidth-dominated STREAM-triad Random Access is mostly dependent Image Difference requires some runs significantly faster on the 64-bit on memory latency with little computation, giving the 64-bit core an processor with wider data paths. difference from CPU architecture. advantage. 23 LLNL-PRES-750335
Recommend
More recommend