Photonic Many-Core Architecture Study Nadya Bliss 1 , Krste Asanovi ć 2 , Keren Bergman 3 , Luca Carloni 3 , Jeremy Kepner 1 , Sanjeev Mohindra 1 , Vladimir Stojanovi ć 4 1 MIT Lincoln Laboratory, 2 University of California Berkeley, 3 Columbia University, 4 MIT Research Laboratory of Electronics September 23 rd , 2008 PM: Jagdeep Shah This work is sponsored by DARPA under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory HPEC2008 1 NTBliss 9/29/2008
Outline • Introduction • Logical Architecture Abstraction • Modeling and Mapping • Experiments and Results • Summary MIT Lincoln Laboratory HPEC2008 2 NTBliss 9/29/2008
Emerging Device Trends 3D Fabrication Feature Size Reduction Photonic Interconnects 1970s Intel 80486DX2 Die: 12x6.75mm Intel 4004 10 microns Sun Sparc 0.8 microns AMD Athlon 0.18 microns Reduced path length STI Cell for accesses across 65 nm the memory hierarchy VS Intel Core 2 45 nm 2008 Emerging device technologies create a large Emerging device technologies create a large parameter space of possible future architectures parameter space of possible future architectures MIT Lincoln Laboratory HPEC2008 3 NTBliss 9/29/2008
Benefits of Photonic Interconnects ELECTRONICS OPTICS TO MEMORY • Use optical network as an efficient global crossbar • Communication to memory banks is chip power and pin/wire density limited • Better scaling with N groups • Poor scaling of on-chip mem controllers with cores • Expected performance - 40-80 Tb/sec • At most 3-6 Tb/sec in the next few years CORE-TO-CORE RX RX RX RX RX TX RX TX TX TX TX TX • Modulate/receive data once per communication • Buffer, receive and re-transmit at every switch • Scalable, low power switch fabric • Power dissipation grows with data rate • Balanced communication and computation Photonics can provide high bandwidth, low latency communication Photonics can provide high bandwidth, low latency communication while meeting power requirements of embedded systems. while meeting power requirements of embedded systems. MIT Lincoln Laboratory HPEC2008 4 NTBliss 9/29/2008
System Level View -Photonic Many-core Architecture Network: PhotoMAN- Selecting a system level architecture allows the parameter space Selecting a system level architecture allows the parameter space to be narrowed while meeting requirements of DoD applications. to be narrowed while meeting requirements of DoD applications. • Manycore processor chip – 64-256 cores (in 22nm node) • Off-chip memory – a set of DRAM chips – minimum capacity - 128 GB (at 22nm) • Evaluate interaction of the photonic network and memory hierarchy • Board power limit 500 W – Consistent with power constraints of medium-sized UAV RQ-7 Shadow To evaluate the architecture develop To evaluate the architecture develop 1. Expressive logical abstraction 1. Expressive logical abstraction 2. Modeling and mapping framework 2. Modeling and mapping framework MIT Lincoln Laboratory HPEC2008 5 NTBliss 9/29/2008
Outline • Introduction • Logical Architecture Abstraction • Modeling and Mapping • Experiments and Results • Summary MIT Lincoln Laboratory HPEC2008 6 NTBliss 9/29/2008
Logical Abstraction -Kuck* Memory Hierarchy- 2-LEVEL HIERARCHY EXAMPLE SM 2 Legend: • P - processor SMN 2 • N - inter-processor network • M - memory • SM - shared memory SM 1 SM 1 • SMN - shared memory network ... SMN 1 SMN 1 Subscript indicates hierarchy level M 0 ... M 0 M 0 ... M 0 P 0 P 0 P 0 P 0 N 0.5 N 0.5 x.5 subscript for N indicates indirect memory access N 1.5 The Kuck notation provides a clear way of describing a hardware The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy architecture along with the memory and communication hierarchy MIT Lincoln Laboratory HPEC2008 7 *High Performance Computing: Challenges for Future Systems , David Kuck, 1996 NTBliss 9/29/2008
PhotoMAN Logical Representation -MIT/UCB 1 Group Memory Configuration- Detailed System-Level High-Level The Kuck notation is suitable for both high-level The Kuck notation is suitable for both high-level Legend: and detailed physical descriptions of the and detailed physical descriptions of the • AP - access point • APG - access point group architecture, such as groups and access points. architecture, such as groups and access points. MIT Lincoln Laboratory HPEC2008 8 NTBliss 9/29/2008
PhotoMAN Logical Representation -MIT/UCB 4 Group Memory Configuration- 0 15 SM 0...15 are DRAM SM 2 SM 2 XS to SM connections memory banks, are 1-to-1 8GB each 0 XSG ... 0 15 XS 2 XS 2 APN connections are 0 1-to-Number of Groups APN 1 Number of access ... ... 0 3 ... APG APG points per group is 0 15 0 15 AP 1 AP 1 AP 1 AP 1 equal to number of SMN 0...3 is an memory banks ... electrical mesh 0 3 SMN 1 SMN 1 connecting only processors within the group 0 1 2 3 255 M 0 M 0 M 0 M 0 M 0 Logical view of the 16 (N) group configuration ... is similar 0 1 2 3 255 P 0 P 0 P 0 P 0 P 0 N 0.5 is a single electrical mesh N 0.5 Legend: While the Kuck representation is flexible, the PhotoMAN study While the Kuck representation is flexible, the PhotoMAN study • APN - access point network • XS - cross bar is focused on 1, 4, and 16 group memory configurations. is focused on 1, 4, and 16 group memory configurations. • XSG - cross bar group MIT Lincoln Laboratory HPEC2008 9 NTBliss 9/29/2008
Outline • Introduction • Logical Architecture Abstraction • Modeling and Mapping • Experiments and Results • Summary MIT Lincoln Laboratory HPEC2008 10 NTBliss 9/29/2008
pMapper: Modeling and Mapping Machine description together with an abstraction layer is used to generate a Maps (distribution specifications) performance model are generated for the application Application specification Results can be used to predict (MATLAB) is used to application performance and architecture parameters generate a signal flow graph pMapper performs • application to architecture mapping • application on APPLICATION architecture simulation SIGNAL FLOW GRAPH MIT Lincoln Laboratory HPEC2008 11 NTBliss 9/29/2008
PhotoMAN Machine Description Given a hardware model H Given a hardware model H and a program parse tree T , and a program parse tree T , pMapper finds maps M pMapper finds maps M that that minimize execution latency: minimize execution latency: Focus of the PhotoMAN study MIT Lincoln Laboratory HPEC2008 12 NTBliss 9/29/2008
Memory Hierarchy Formulation -MIT/UCB 1 Group Memory Configuration- PHYSICAL VIEW CORE-TO-CORE NETWORK, N 0.5 SHARED MEMORY NETWORK, SMN 1 • Bandwidth and latency matrices have the • Bandwidth and latency matrices have the same pattern of non-zeros same pattern of non-zeros • Topology for N 0.5 and SMN 1 is the same • Topology for N 0.5 and SMN 1 is the same for the 1-Group configuration for the 1-Group configuration • Diagonal entries encode • Diagonal entries encode • R N - bandwidth to local store • R N - bandwidth to local store AP-to-SM • R Mon - whether P i is an access point • R Mon - whether P i is an access point ACCESS POINTS MIT Lincoln Laboratory HPEC2008 13 NTBliss 9/29/2008
Memory Hierarchy Formulation -MIT/UCB N G Group Memory Configuration- PHYSICAL VIEW SHARED MEMORY NETWORK, SMN 1 ACCESS POINTS AP-XS-MEMORY NETWORK • Core-to-core network not shown and is • Core-to-core network not shown and is the same as in 1 group case the same as in 1 group case • While memory access requires one • While memory access requires one additional transfer, the topology is additional transfer, the topology is represented with a single matrix - R AXSon represented with a single matrix - R AXSon AP-XS BANDWIDTH XS-MEMORY BANDWIDTH MIT Lincoln Laboratory HPEC2008 14 NTBliss 9/29/2008
Outline • Introduction • Logical Architecture Abstraction • Modeling and Mapping • Experiments and Results • Summary MIT Lincoln Laboratory HPEC2008 15 NTBliss 9/29/2008
Maps 1D CYCLIC 2D CYCLIC 1D HIERARCHICAL 1D BLOCK 2D BLOCK ... P0 P1 P2 P3 INCREASING PROGRAMMING COMPLEXITY • High programmability is a desirable architecture characteristic • High programmability is a desirable architecture characteristic • Complexity of mapping chosen to optimize performance (minimize • Complexity of mapping chosen to optimize performance (minimize execution time) provides insight into programmability of hardware execution time) provides insight into programmability of hardware • The higher complexity of the mapping, the lower programmability • The higher complexity of the mapping, the lower programmability MIT Lincoln Laboratory HPEC2008 16 NTBliss 9/29/2008
Recommend
More recommend