Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube (HMC) Ramyad Hadidi, BaharAsgari, Jeffrey Young, Burhan Ahmad Mudassar, KartikayGarg, TusharKrishna, and HyesoonKim
Introduction to HMC 2 } Hybrid Memory Cube (HMC) vs High-Bandwidth Memory (HBM) } HMC: Serial, packet-based interface } HBM: Wide bus, standard DRAM protocol } Found in high-end GPUs and Intel’s Knight’s Landing Illustration credits: AMD and Micron ISPASS 2018 2
Why is HMC Interesting? 3 } -Serialized, high-speed link addresses pin limitation issues with DRAM and HBM } -Abstracted packet interface provides opportunities for novel memories and addressing opportunities Can be used with DRAM, PCM, STT-RAM, NVM, etc. } } -Memory controller sits on top of a “routing” layer Allows for more interesting connections between processors and } memory elements This study addresses the impacts of } the network on chip (NOC) for architects/application developers Illustration credits: Micron ISPASS 2018 3
This Study’s Contributions 4 We examine the NoC of the HMC using an FPGA-based prototype to answer the following: 1) How does the NoC behave under low- and high-load conditions? 2) Can we relate QoS concepts to 3D stacked memories? 3) How does the NoC affect latency within the HMC? 4) What potential bottlenecks are there and how can we avoid them? Host EX700 PCIe Board AC-510 Driver HMC PCIe 3.0 x16 Vault Software Vault NoC Vault Logic Layer FPGA Configs/ Mem. Trace ISPASS 2018 4
Hybrid Memory Cube (HMC) 5 HMC 1.1 (Gen2): 4GB size Bank Bank TSV n o t i i t r a P Vault Logic Layer Vault Controller DRAM Layer ISPASS 2018 5
Hybrid Memory Cube (HMC) 6 HMC 1.1 (Gen2): 4GB size Bank Bank 16 Banks/Vault TSV n o t i i t r a P Total Number of Banks = 256 Size of Each Bank = 16 MB Vault Logic Layer Vault Controller DRAM Layer ISPASS 2018 6
HMC Memory Addressing 8 o Closed-page policy Page Size = 256 B o Low-order-interleaving address mapping policy o 34-bit address field: 4K OS Page 33 32 15 11 9 7 4 0 … Block Address Vault ID in a Quadrant Ignored Bank ID Quadrant ID ISPASS 2018 8
HMC Communication I 9 o Follows a serialized packet-switched protocol o Partitioned into 16-byte flit o Each transfer incurs 1 flit of overhead ISPASS 2018 9
HMC Communication II 10 Flow Control Request/Response ISPASS 2018 10
Our HMC Test Infrastructure 11 Host EX700 PCIe Board AC-510 Driver HMC PCIe 3.0 x16 Vault Software Vault NoC Vault Logic Layer FPGA Configs/ Mem. Trace -Micron’s AC-510 module contains a Xilinx Kintex FPGA and HMC 1.1 4 GB part -2 half-width links for a total of 60 GB/s of bandwidth -Host SW communicates over PCIe to FPGA-based queues ISPASS 2018 11
Methodology (GUPS) Pico API 12 PCIe Driver Software GUPS Host Pico PCIe 3.0 x16 ISPASS 2018 EX700 PCIe PCIe Switch 3.0 x8 AC-510 FPGA (GUPS Firmware) Add. Gen. Add. Gen. Add. Gen. Add. Gen. Ports (9x) Monitoring Monitoring Monitoring Arbitration Arbitration Arbitration Arbitration Rd. Tag Wr. Req. Pool FIFO Rd. Tag Wr. Req. Rd. Tag Wr. Req. Rd. Tag Wr. Req. Pool FIFO Pool FIFO Pool FIFO Data Gen. Data Gen. Data Gen. Data Gen. Monitoring HMC Controller Transceiver Transceiver AXI-4 2x 15Gbps 8x links HMC NoC Vault Vault Vault ... 12
Methodology (multi-port stream) Multi-Port Pico API 13 PCIe Driver Software Stream Host Pico Memory Traces PCIe 3.0 x16 ISPASS 2018 EX700 PCIe PCIe Switch 3.0 x8 AC-510 FPGA (Multi-Port Stream Firmware) Add. Gen. Add. Gen. Add. Gen. Ports (9x) Rd. Addr. FIFO Rd. Data. FIFO Command Wr. Data. FIFO Monitoring Monitoring Monitoring FIFO Arbitration Arbitration Arbitration Rd. Tag Wr. Req. Rd. Tag Wr. Req. Rd. Tag Wr. Req. Pool FIFO Pool FIFO Pool FIFO Logic Glue Rd. Tag Pool Data Gen. Data Gen. Data Gen. Monitoring HMC Controller Transceiver Transceiver 2x 15Gbps AXI-4 8x links HMC NoC Vault Vault Vault ... 13
Experiments 14 [1] High-Contention Latency Analysis (GUPS design) [2] Low-Contention Latency Analysis (Multi-port stream) [3] Quality of Service Analysis (Multi-port) [4] High-Contention Latency Histograms Per Vault (Multi- port) [5] Requested and Response Bandwidth Analysis (GUPS) ISPASS 2018 14
[1] Read-only Latency vs. Bandwidth 15 Size 16B Size 32B Size 64B Size 128B 30 1 bank 25 20 Latency (μs) 4 banks 2 vaults 2 banks 8 banks 4 vaults 15 1 vault 8 vaults 10 16 vaults 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Bandwidth (GB/s) ISPASS 2018 15
[2] Average Latency vs Requests 16 16B 32B 64B 128B 2.2 2.0 1.8 Latency (μs) 1.6 1.4 1.2 1.0 0.8 0.6 1 0 5 10 15 20 25 30 35 40 45 50 55 Number of Read Requests ISPASS 2018 16
[2] Average Latency vs Requests II 17 16B 32B 64B 128B 4.0 Linear Increment 3.5 Latency (μs) 3.0 2.5 2.0 1.5 1.0 0.5 0 1 50 100 150 200 250 300 Number of Read Requests ISPASS 2018 17
[3] QoS for 4 Vaults 18 16B 32B 64B 128B 7 Maximum Latency (μs) 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vault Number 7 Maximum Latency (μs) 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vault Number ISPASS 2018 18
[4] Latency vs. Request Size 19 15 0.4 15 15 15 0.3 14 14 14 14 0.25 0.35 13 13 0.2 13 13 0.25 12 12 12 12 0.3 11 11 11 11 0.2 10 10 0.15 Vault Number 10 10 0.2 Vault Number Vault Number Vault Number 0.25 9 9 9 9 0.15 8 8 8 8 0.2 0.15 7 7 7 7 0.1 6 6 6 6 0.15 0.1 5 5 0.1 5 5 4 4 0.1 4 4 0.05 3 3 0.05 3 3 0.05 2 0.05 2 2 2 1 1 1 1 0 0 0 1617 1624 1631 1639 1646 1653 1661 1668 1675 0 1 7 2 8 3 8 4 9 5 2573 2641 2708 2776 2844 2911 2979 3046 3114 3894 3945 3996 4046 4097 4148 4198 4249 4300 3 5 8 0 3 5 8 0 3 9 9 9 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 Latency (ns) Latency (ns) Latency (ns) Latency (ns) 16B 32B 128B 64B ISPASS 2018 19
[4] Latency vs. Request Size 20 2135 1675 0.15 2109 1668 Latency (ns) Latency (ns) 0.1 2084 1661 0.1 2058 1653 2033 1646 2008 1639 0.05 0.05 1982 1631 1957 1624 1931 1617 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vault Number Vault Number 16B 32B 4300 3114 0.15 3046 4249 0.1 Latency (ns) Latency (ns) 4198 2979 4148 2911 0.1 2844 4097 0.05 4046 2776 0.05 3996 2708 3945 2641 3894 2573 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vault Number Vault Number 64B 128B ISPASS 2018 20
[4] Average Latency – 4 Vaults 21 Average Standard Deviation Average Latency (μs) 5 200 Deviation ( ! ) (ns) Latency Standard 4 160 3 120 2 80 1 40 0 0 16B 32B 64B 128B Request Size ISPASS 2018 21
[5] GUPS – Bandwidth vs. Active Ports 22 16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 24 24 22 22 20 20 Bandwidth (GB/s) 18 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 #Active Ports ( � Request Bandwidth) #Active Ports ( � Request Bandwidth) (a) 16B (b) 32B ISPASS 2018 22 � �
[5] GUPS – Bandwidth vs. Active Ports II 23 � � 16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 24 24 24 24 22 22 20 20 Bandwidth (GB/s) 18 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 � � #Active Ports ( � Request Bandwidth) #Active Ports ( � Request Bandwidth) (c) 64B (d) 128B ISPASS 2018 23 � �
[6] GUPS – Outstanding Requests 24 2 banks 4 banks Number of Outstanding 600 500 400 Requests 300 200 100 0 16 32 64 128 Average Request size (Byte) ISPASS 2018 24
Takeaways 25 } Large and small requests allow tuning for bandwidth- or latency- optimized applications better than DRAM } Vault- and bank-level parallelism are key to achieving higher BW Vault latencies are more correlated with access patterns and traffic than with physical vault location } Queuing delays will continue to be a concern with NOCs in the HMC } Address via host-side queuing/scheduling or by distributing accesses across vaults (data structures or compiler passes) } The HMC’s NoC complicates QoS due to variability } However, trade-offs in packet size and ”private” vaults can improve QoS ISPASS 2018 25
Questions? 26 Thanks to Micron for helping to support our HMC testbed! ISPASS 2018 26
Recommend
More recommend