Building manycore processor-to-DRAM networks using monolithic - PowerPoint PPT Presentation

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi † , Christopher Batten † , Vladimir Stojanovi ć † , Krste Asanovi ć ‡ † MIT, 77 Massachusetts Ave, Cambridge MA 02139 ‡ UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720 {joshi, cbatten, vlada}@mit.edu, krste@eecs.berkeley.edu High Performance Embedded Computing (HPEC) Workshop 23-25 September 2008

Manycore systems design space MIT/UCB

Manycore system bandwidth requirements MIT/UCB

Manycore systems – bandwidth, pin count and power scaling Server & HPC 1 Byte/Flop, 8 Flops/core @ 5GHz Mobile Client 4 MIT/UCB

Interconnect bottlenecks Manycore system cores CPU CPU CPU Interconnect Network Bottlenecks due to energy and bandwidth density limitations Cache Cache Cache Interconnect Network DRAM DRAM DRAM DIMM DIMM DIMM MIT/UCB

Interconnect bottlenecks Manycore system cores CPU CPU CPU Interconnect Network Bottlenecks due to energy and bandwidth density limitations Cache Cache Cache Interconnect Network Need to jointly optimize on-chip and off-chip DRAM DRAM DRAM DIMM DIMM DIMM interconnect network MIT/UCB

Outline � Motivation � Monolithic silicon photonic technology � Processor-memory network architecture exploration � Manycore system using silicon photonics � Conclusion MIT/UCB

Unified on-chip/off-chip photonic link � Supports dense wavelength-division multiplexing that improves bandwidth density � Uses monolithic integration that reduces energy consumption � Utilizes the standard bulk CMOS flow MIT/UCB

Optical link components 65 nm bulk CMOS chip designed to test various optical devices MIT/UCB

Silicon photonics area and energy advantage Energy Bandwidth Metric (pJ/b) density (Gb/s/ μ ) Global on-chip photonic link 0.25 160-320 Global on-chip optimally repeated electrical link 1 5 Off-chip photonic link (50 μ coupler pitch) 0.25 13-26 Off-chip electrical SERDES (100 μ pitch) 5 0.1 On-chip/off-chip seamless photonic link 0.25 MIT/UCB

Outline � Motivation � Monolithic silicon photonic technology � Processor-memory network architecture exploration � Baseline electrical mesh topology � Electrical mesh with optical global crossbar topology � Manycore system using silicon photonics � Conclusion MIT/UCB

Baseline electrical system architecture Mesh physical view Mesh logical view C = core, DM = DRAM module � Access point per DM distributed across the chip � Two on-chip electrical mesh networks � Request path – core � access point � DRAM module � Response path – DRAM module � access point � core MIT/UCB

Interconnect network design methodology � Ideal throughput and zero load latency used as design metrics � Energy constrained approach is adopted � Energy components in a network � Mesh energy ( E m ) (router-to-router links (RRL), routers) � IO energy ( E io ) (logic-to-memory links (LML)) Total energy budget Calculate on-chip RRL energy Calculate energy Calculate LML Calculate total Flit width budget for LML width mesh energy Calculate on-chip router energy Calculate mesh Calculate zero Calculate I/O throughput load latency throughput MIT/UCB

Network throughput and zero load latency (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) � System throughput limited by on-chip mesh or I/O links � On-chip mesh could be over-provisioned to overcome mesh bottleneck � Zero load latency limited by data serialization MIT/UCB

Network throughput and zero load latency OPF:4 OPF:1 OPF:2 (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) � System throughput limited by on-chip mesh or I/O links � On-chip mesh could be over-provisioned to overcome mesh bottleneck � Zero load latency limited by data serialization MIT/UCB

Network throughput and zero load latency On-chip serialization OPF:4 OPF:1 OPF:2 Off-chip serialization (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) � System throughput limited by on-chip mesh or I/O links � On-chip mesh could be over-provisioned to overcome mesh bottleneck � Zero load latency limited by data serialization MIT/UCB

Outline � Motivation � Monolithic silicon photonic technology � Processor-memory network architecture exploration � Baseline electrical mesh topology � Electrical mesh with optical global crossbar topology � Manycore system using silicon photonics � Conclusion MIT/UCB

Optical system architecture Mesh physical view Mesh logical view C = core, DM = DRAM module � Off-chip electrical links replaced with optical links � Electrical to optical conversion at access point � Wavelengths in each optical link distributed across various core-DRAM module pairs MIT/UCB

Network throughput and zero load latency � Reduced I/O cost improves system bandwidth � Reduction in latency due to lower serialization latency � On-chip network is the new bottleneck MIT/UCB

Optical multi-group system architecture Ci = core in group i , DM = DRAM module, S = global crossbar switch � Break the single on-chip electrical mesh into several groups � Each group has its own smaller mesh � Each group still has one AP for each DM � More APs � each AP is narrower (uses less λ s) � Use optical network as a very efficient global crossbar � Need a crossbar switch at the memory for arbitration MIT/UCB

Network throughput vs zero load latency � Grouping moves traffic from energy-inefficient B mesh channels to energy-efficient photonic 10x-15x channels A � Grouping and silicon photonics provides 10x- 15x throughput improvement � Grouping reduces ZLL in photonic range, but increases ZLL in electrical range MIT/UCB

Simulation results 256 cores,16 DM Uniform random traffic 256 cores,16 DM Uniform random traffic � Grouping � 2x improvement in bandwidth at comparable latency � Overprovisioning � 2x-3x improvement in bandwidth for small group count at comparable latency � Minimal improvement for large group count MIT/UCB

Simulation results 256 cores 256 cores,16 DM 16 DM Uniform random Uniform traffic random traffic � Replacing off-chip electrical with photonics (Eg1x4 � Og1x4) � 2x improvement in bandwidth at comparable latency Using opto-electrical global crossbar (Eg4x2 � Og16x1) � � 8x-10x improvement in bandwidth at comparable latency MIT/UCB

Simplified 16-core system design MIT/UCB

Full 256-core system design MIT/UCB

Conclusion � On-chip network design and memory bandwidth will limit manycore system performance � Unified on-chip/off-chip photonic link is proposed to solve this problem � Grouping with optical global crossbar improves system throughput � For an energy-constrained approach, photonics provide 8-10x improvement in throughput at comparable latency MIT/UCB

Backup MIT/UCB

MIT Eos1 65 nm test chip � Texas Instruments standard 65 nm bulk CMOS process � First ever photonic chip in sub-100nm CMOS � Automated photonic device layout � Monolithic integration with electrical modulator drivers MIT/UCB

Two-ring filter Vertical coupler grating Digital driver Ring modulator One-ring filter Photo detector Paperclips Waveguide crossings M-Z test structures 4 ring filter banks MIT/UCB

Optical waveguide SEM image of a poly silicon waveguide Cross-sectional view of a photonic chip � Waveguide made of polysilicon � Silicon substrate under waveguide etched away to provide optical cladding � 64 wavelengths per waveguide in opposite directions MIT/UCB

Modulators and filters Double-ring resonant filter � 2 nd order ring filters used � Rings tuned using sizing and heating Resonant racetrack modulator � Modulator is tuned using charge injection � Sub-100 fJ/bit energy cost for the modulator driver MIT/UCB

Photodetectors � Embedded SiGe used to create photodetectors � Monolithic integration enable good optical coupling � Sub-100 fJ/bit energy cost required for the receiver MIT/UCB

Building manycore processor-to-DRAM networks using monolithic - PowerPoint PPT Presentation

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi , Christopher Batten , Vladimir Stojanovi , Krste Asanovi MIT, 77 Massachusetts Ave, Cambridge MA 02139 UC Berkeley, 430 Soda

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics Christopher Batten

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Who Cares About the Impact on Performance Memory Hierarchy? Suppose a processor executes at

IC220: Set #13: Building a real processor! ( Chapter 5) 1 The Processor: Datapath & Control

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in modern processor

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Lets Build a Processor Almost ready to move into chapter 5 and start building a processor

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

CS 6354: Processor Networks 5 October 2016 1 To read more This days papers: Scott,

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail

4 1 3 2 Instruction ALU Registers Memory Fetch and Decode Building Blocks Processor

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Building manycore processor-to-DRAM networks using monolithic - PowerPoint PPT Presentation

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi , Christopher Batten , Vladimir Stojanovi , Krste Asanovi MIT, 77 Massachusetts Ave, Cambridge MA 02139 UC Berkeley, 430 Soda

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics Christopher Batten

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Who Cares About the Impact on Performance Memory Hierarchy? Suppose a processor executes at

IC220: Set #13: Building a real processor! ( Chapter 5) 1 The Processor: Datapath &amp; Control

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in modern processor

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Lets Build a Processor Almost ready to move into chapter 5 and start building a processor

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

CS 6354: Processor Networks 5 October 2016 1 To read more This days papers: Scott,

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail

4 1 3 2 Instruction ALU Registers Memory Fetch and Decode Building Blocks Processor

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

IC220: Set #13: Building a real processor! ( Chapter 5) 1 The Processor: Datapath & Control