Application-Transparent Near-Memory Processing Architecture with Memory Channel Network Mohammad Alian 1 , Seung Won Min 1 , Hadi Asgharimoghaddam 1 , Ashutosh Dhar 1 , Dong Kai Wang 1 , Thomas Roewer 2 , Adam McPadden 2 , Oliver O'Halloran 2 , Deming Chen 1 , Jinjun Xiong 2 , Daehoon Kim 1 , Wen-mei Hwu 1 , and Nam Sung Kim 1,3 1 University of Illinois Urbana-Champaign 2 IBM Research and Systems 3 Samsung Electronics
2 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor 48MB Memory CPU+ Vector Unit 3M $ Qaud 48MB Memory Processing Tile or DRAM Block Qaud Network IRAM’97 SmartMemories’00 ISCA’15
3 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor • Solution : memory module based NMP + Memory Channel Network (MCN) ✓ Recognize NMP memory modules as distributed computing nodes over Ethernet no change in application code or memory subsystem of host processors ✓ Seamlessly integrate NMP w/ distributed computing frameworks for better scalability
4 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor • Solution : memory module based NMP + Memory Channel Network (MCN) ✓ Recognize NMP memory modules as distributed computing nodes over Ethernet no change in application code or memory subsystem of host processors ✓ Seamlessly integrate NMP w/ distributed computing frameworks for better scalability • Feasibility & Performance: ✓ Demonstrate the feasibility w/ an IBM POWER8 + experimental memory module ✓ Improve the performance and processing bandwidth by 43% and 4× , respectively
5 Overview of f MCN-based NMP host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
6 Overview of f MCN-based NMP host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device ✓ An MCN processor runs its own lightweight OS including the minimum network stack Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
7 Overview of f MCN-based NMP MCN distributed computing MCN node host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device • Special driver faking memory channels as Ethernet connections Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
8 Higher Processing BW * w/ Commodity DRAM *bandwidth • Conventional memory system ✓ More DIMMs larger capacity but the same bandwidth DDR4 DIMM DDR4 DIMM Memory Controller DRAM DRAM DRAM DRAM Data Buffer Data Buffer Global/Shared Channel Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
9 Higher Processing BW * w/ Commodity DRAM *bandwidth • Conventional memory system w/ near memory processing DIMMs ✓ an MCN processor w/ local DRAM devices through private channels scaling aggregate processing memory bandwidth w/ # of MCN DIMMs Local/Private Channels DDR4 DIMM MCN DIMM DDR4 DIMM DDR4 DIMM Memory Controller Memory Controller DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM MCN PROC MCN PROC Data Buffer Data Buffer Global/Shared Channel Global/Shared Channel Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
10 MCN DIM IMM Architecture IBM Centaur DIMM Buffer Device 80 DDR DRAM chips + buffer chip w/ a tall form factor Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
11 MCN DIM IMM Architecture IBM Centaur DIMM Buffer Device Near-Memory Processor MCN Processor MCN PROC core core core core 0 1 2 3 global DDR interface DDR IRQ LLC / interconnect channel dual-port SRAM control MC TX RX local DRAM 80 DDR DRAM chips Snapdragon AP w/ 4 A57 ARM cores + + buffer chip ~20W TDP & 2MB LLC, GPU, 2 MCs, etc. @ w/ a tall form factor ~10 mm ×10 mm ~5W & ~8×8 mm 2 (1.8W & ~2×2 mm 2 ) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
12 MCN DIM IMM Architecture: In Interface Logic MCN PROC core core core core 0 1 2 3 global DDR interface DDR IRQ LLC / interconnect channel dual-port SRAM control MC TX RX MCN Interface local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
13 MCN DIM IMM Architecture: In Interface Logic MCN buffer layout MCN PROC core core core core 64 Bytes ... 0 1 2 3 0 4 8 12 63 global DDR interface Tx-head Tx-tail Tx-poll reserved DDR IRQ LLC / interconnect Rx-head Rx-tail Rx-poll reserved channel Tx circular buffer 48KB dual-port SRAM control MC TX Rx circular buffer 48KB RX Mapped to a range of physical memory space MCN Interface directly accessed by MC like normal DRAM local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
14 MCN DIM IMM Architecture: In Interface Logic MCN buffer layout MCN PROC core core core core 64 Bytes ... 0 1 2 3 0 4 8 12 63 global DDR interface Tx-head Tx-tail Tx-poll reserved DDR IRQ LLC / interconnect Rx-head Rx-tail Rx-poll reserved channel Tx circular buffer 48KB dual-port SRAM control MC TX Rx circular buffer 48KB RX Mapped to a range of physical memory space MCN Interface directly accessed by MC like normal DRAM local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
15 MCN Driver application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel regular access MCN access hardware MCN SRAM DDR memory Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
16 MCN Packet Routing 1. A packet is passed from the host network stack and the packet goes to • Host MCN the corresponding MCN DIMM or NIC application user space host TX RX DDR4 DIMM MCN DIMM IP: X.X.X.X linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
17 MCN Packet Routing 2. If the packet needs to be sent to an MCN DIMM, the forwarding engine checks the MAC of the packet stored in • Host MCN main memory application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel MAC: AA.AA.AA.AA.AA.AA Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
18 MCN Packet Routing 3. If the MAC matches w/ that of an MCN DIMM, the packet is copied to the • Host MCN SRAM buffer of the MCN DIMM application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel MAC: AA.AA.AA.AA.AA.AA Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion
Recommend
More recommend