Microscope on Memory: MPSoC-enabled Computer Memory System - PowerPoint PPT Presentation

Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Introduction ▪ Recent advances in memory technology and packaging — High bandwidth memories – HBM, HMC — Non-volatile memory – 3D XPoint — Potential for logic and compute functions co-located with the memory — Brought attention to computer memory system design and evaluation HMC HBM 3D XPoint Hongshin Jun, et. al. IMW 2017 Creative Commons Attribution Micron Technology 2 LLNL-PRES-750335

Introduction ▪ Emerging memories exhibit a wide range of bandwidths, Memory/Storage Hierarchy latencies, and capacities Capacity Latency 8 MB SRAM 10 ns — Challenge for the computer 128 MB 45 ns Near DRAM architects to navigate the design 8 GB 70 ns DDR DRAM space 8 GB 100 ns — Challenge for application Far DRAM 64 GB 200 ns developers to assess performance NVM implications 128 GB 50 us SSD 6 TB 10 ms HDD 3 LLNL-PRES-750335

Introduction ▪ Need for system level exploration of the design space — Combinations of memory technology — Various memory hierarchies — Potential benefit of near-memory accelerators — Prototype architectural ideas in detail ▪ Need to quantitatively evaluate the performance impact on applications – beyond an isolated function — Accelerator communication overhead — Cache management overhead — Operating System overhead — Byte addressable vs. block addressable — Scratchpad vs. Cache — Cache size to working data set size — Latency impact 4 LLNL-PRES-750335

Background ▪ M. Butts, J. Batcheller and J. Varghese, “An efficient logic emulation system,” Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors , Cambridge, MA, 1992, pp. 138-141. — Realizer System ▪ “Virtex - 7 2000T FPGA for ASIC Prototyping & Emulation,” https://www.xilinx.com/video/fpga/virtex-7-2000t-asic-prototyping-emulation.html — Prototype ARM A9 processor subsystem (dual-core, caches) mapped into a single Virtex-7 2000T FPGA ▪ Our approach uses the native hard IP cores/cache hierarchy and focuses on external memory 5 LLNL-PRES-750335

LiME (Logic in Memory Emulator) ZCU102 development board with Xilinx Zynq UltraScale+ MPSoC device 6 LLNL-PRES-750335

Trace DRAM Trace Subsystem LiME (Logic in Memory Emulator) Zynq UltraScale+ MPSoC Programmable Logic (PL) Implementation AXI Performance Trace Capture Monitor (APM) Device ▪ Use embedded CPU and cache hierarchy in Zynq MPSoC to save AXI Peripheral FPGA logic and development time Memory Subsystem Interconnect Program DRAM Accelerator ▪ Route memory traffic through hardware IP blocks deployed in Monitor programmable logic Delay Delay ▪ Emulate the latencies of a wide HP2,3 HP0,1 HPM0 HPM1 range of memories by using Main Switch programmable delay units in the DDR Memory loopback path Host Subsystem Controller Coherent Interconnect ▪ Capture time-stamped memory Not Used L2 Cache transactions using trace subsystem L1 L1 L1 L1 ARM ARM ARM ARM Core Core Core Core Open Source: Processing System (PS) http://bitbucket.org/perma/emulator_st/ 7 LLNL-PRES-750335

LiME (Logic in Memory Emulator) Implementation LLNL Hardware IP Blocks AXI Delay AXI Shim AXI Trace Capture Device LiME uses only 13% of the device resources 8 LLNL-PRES-750335

Emulation Method Clock Domains Zynq UltraScale+: Actual ▪ ARM cores are slowed to run Programmable Subsystem Accelerator Memory Logic (PL) Program at a frequency similar to DRAM 62.5 MHz programmable logic 300 MHz 4.8 GB/s 950 MHz ▪ A scaling factor of 20x is Subsystem DDR 137.5 MHz 2.2 GB/s 15.2 GB/s 475 MHz Host 128 7.6 GB/s applied to the entire system 64 APU Processing 137.5 MHz System (PS) ▪ Other scaling factors can be Zynq UltraScale+: Emulated at 20x used depending on the target Programmable Subsystem Accelerator Memory Program Logic (PL) peak bandwidth to memory DRAM 1.25 GHz 6 GHz 96 GB/s ▪ CPU peak bandwidth is 19 GHz Subsystem DDR 2.75 GHz limited to 44 GB/s 44 GB/s 304 GB/s 9.5 GHz Host 128 152 GB/s 64 APU Processing 2.75 GHz System (PS) 9 LLNL-PRES-750335

Emulation Method Scaling by 20 Example Component Actual Emulated Memory Bandwidth (PL) 4.8 GB/s 96 GB/s Memory Latency (PL) 230 ns 12 ns (too low) Memory Latency (PL) w/delay 230 ns 12+88 = 100 ns CPU Frequency 137.5 MHz 2.75 GHz CPU Bandwidth 2.2 GB/s 44 GB/s Accelerator Frequency 62.5 MHz 1.25 GHz Accelerator Bandwidth Up to 4.8 GB/s Up to 96 GB/s Delay is programmable over a wide range: 0 - 174 us in 0.16 ns increments 10 LLNL-PRES-750335

Emulation Method Address Space Original Modified Contiguous AXI peripherals Loopback path through PL 11 LLNL-PRES-750335

Emulation Method Zynq UltraScale+ MPSoC Delay & Loopback Programmable Logic (PL) 0x04_0000_0000 (R1, R2) Map Width: 20 bits ▪ Address ranges R1, R2 AXI Map In: 0x04000 Shift R1 Shim Map Out: 0x08000 intended to have different Memory Subsystem 0x 08_000 0_0000 (R1) 0x 04_001 0_0000 (R2) access latencies (e.g. Map Width: 8 bits AXI SRAM, DRAM) Map In: 0x04 Shift R2 Shim Map Out: 0x18 Addr Width: 40 bits 0x 08 _0000_0000 (R1) Program DRAM Data Width: 128 bits ▪ Shims shift and separate 0x 18 _0010_0000 (R2) R1: 1M range R2: 4G range AXI address ranges (R1, R2) for Route R1, R2 SmartConnect easier routing 0x 08 _0000_0000 (R1) 0x 18 _0010_0000 (R2) AXI AXI Addr Width: Addr Width: ▪ Standard AXI Interconnect 36 bits 36 bits Delay Delay routes requests through 0x 08 _0000_0000 (R1) 0x 08 _0010_0000 (R2) S_AXI_HP0 S_AXI_HP1 M_AXI_HPM0 different delay units Host Subsystem Main Switch DDR Memory Controller ▪ Delay units have separate Coherent Interconnect programmable delays for read and write access Not Used APU Processing System (PS) 12 LLNL-PRES-750335

Emulation Method Macro Insertion ▪ Insert macros at the start and end of the region of interest (ROI) ▪ CLOCKS_EMULATE/CLOCKS_NORMAL — Modify the clock frequencies and configure the delay units ▪ TRACE_START/TRACE_STOP — Trigger the hardware to start/stop recording memory events in Trace DRAM ▪ STATS_START/STATS_STOP — Trigger the hardware to start/stop the performance monitor counters ▪ TRACE_CAP — Save captured trace from Trace DRAM to SD card 13 LLNL-PRES-750335

Memory Trace Capture LiME trace.bin parser.c trace.csv CPU = 0, Accelerator = 1 Each count represents 0.16 ns 14 LLNL-PRES-750335

Use Cases Bandwidth Analysis from Trace 15 LLNL-PRES-750335

Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 16 LLNL-PRES-750335

Use Cases Evaluation of Future Storage Class Memory DGEMM execution time on 64-bit processor SpMV execution time on 64-bit processor at varying latencies and varying cache-to- at varying latencies with a cache-to- memory ratios. memory ratio of 1:112. Cache can hide memory latency for a Latency has direct impact. Application will working set size up to twice the size of need a high level of concurrency and greater cache. throughput to offset the loss in performance. 20 LLNL-PRES-750335

Use Cases Evaluation of Near-Memory Acceleration Engines Memory Subsystem ▪ Multiple Memory Channels ▪ Up to 16 concurrent memory requests Memory Memory Memory Memory Channel Channel Channel Channel ▪ DREs are located in the Memory Subsystem ▪ Scratchpad is used to communicate parameters and results between CPU and accelerator Switch To Switch DRE DRE DRE DRE Links Scratchpad Shared Cache Cache Cache Load-Store Control Unit Processor CPU CPU Core Core Data Rearrangement Engine (DRE) Processor 21 LLNL-PRES-750335

Use Cases Evaluation of Near-Memory Acceleration Engines The results demonstrate that substantial speedup can be gained with a DRE due to the higher number of in-flight requests issued by the near-memory accelerator. 22 LLNL-PRES-750335

Use Cases Comparing Performance Across CPUs ▪ 32-bit ARM A9 (Out-of-order with 11-stage pipeline) using Zynq 7000 — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 512 KB (8-way set-associative) — Cache Line Size: 32 Bytes ▪ 64-bit ARM A53 (In-order with 8-stage pipeline) using Zynq UltraScale+ — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 1MB (16-way set-associative) — Cache Line Size: 64 Bytes Bandwidth-dominated STREAM-triad Random Access is mostly dependent Image Difference requires some runs significantly faster on the 64-bit on memory latency with little computation, giving the 64-bit core an processor with wider data paths. difference from CPU architecture. advantage. 23 LLNL-PRES-750335

Microscope on Memory: MPSoC-enabled Computer Memory System - PowerPoint PPT Presentation

Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Continuously learning AI pathologist: An AI powered smart microscope that can automatically scan

Sub-Micron Lithography with the Sub-Micron Lithography with the Atomic Force Microscope Atomic

Experimental/Analytical Evaluation of the Effect of Tip Mass on Atomic Force Microscope

How do you choose the optimal microscope/camera combinations? What do you need? How do you

How to best use your microscope Matuhijn Vos 7th of November 2014 13/11/14 SPA revolutjon S

Unit 3: Microscopes, cells and Viruses pp. 191-200 Modern Microscopes Compound light microscope

Olympus IAQ Microscopes .................................46 Microscope Slides and Coverslips

Microscope)Imaging) Colin)Sheppard) Nano5Physics)Department)

Even more jobs Spring Career Fair Shading Architectures Fair Wednesday, March 28th

MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&M

XT9? XT9? Integrating Integrating and and Operating Operating a a Conjoined XT4 Conjoined X

Direct and Inverse Problems for Nonlinear Time-harmonic Maxwells Equations T ING Z HOU

1 Authors Smita Ghaisas and Preethu Rose Anish - TATA Research Development and Design Center,

Portable Hotplugging A Peek into NetBSDs uvm_hotplug(9) API Development Santhosh N. Raju

CSE 326: Data Structures Maintain a set of pairwise disjoint sets. Disjoint Sets

Characterization of trajectories using constraint programming and abstract interpretation T. Le

Microscope on Memory: MPSoC-enabled Computer Memory System - PowerPoint PPT Presentation

Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Continuously learning AI pathologist: An AI powered smart microscope that can automatically scan

Sub-Micron Lithography with the Sub-Micron Lithography with the Atomic Force Microscope Atomic

Experimental/Analytical Evaluation of the Effect of Tip Mass on Atomic Force Microscope

How do you choose the optimal microscope/camera combinations? What do you need? How do you

How to best use your microscope Matuhijn Vos 7th of November 2014 13/11/14 SPA revolutjon S

Unit 3: Microscopes, cells and Viruses pp. 191-200 Modern Microscopes Compound light microscope

Olympus IAQ Microscopes .................................46 Microscope Slides and Coverslips

Microscope)Imaging) Colin)Sheppard) Nano5Physics)Department)

Even more jobs Spring Career Fair Shading Architectures Fair Wednesday, March 28th

MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&amp;M

XT9? XT9? Integrating Integrating and and Operating Operating a a Conjoined XT4 Conjoined X

Direct and Inverse Problems for Nonlinear Time-harmonic Maxwells Equations T ING Z HOU

1 Authors Smita Ghaisas and Preethu Rose Anish - TATA Research Development and Design Center,

Portable Hotplugging A Peek into NetBSDs uvm_hotplug(9) API Development Santhosh N. Raju

CSE 326: Data Structures Maintain a set of pairwise disjoint sets. Disjoint Sets

Characterization of trajectories using constraint programming and abstract interpretation T. Le

MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&M