implementing low diameter ocn for manycore processors
play

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled - PowerPoint PPT Presentation

I MPLEMENTING L OW -D IAMETER O N -C HIP N ETWORKS FOR M ANYCORE P ROCESSORS U SING A T ILED P HYSICAL D ESIGN M ETHODOLOGY Yanghui Ou, Shady Agwa, Christopher Batten Computer Systems Laboratory Cornell University R EAL M ANYCORE I


  1. I MPLEMENTING L OW -D IAMETER O N -C HIP N ETWORKS FOR M ANYCORE P ROCESSORS U SING A T ILED P HYSICAL D ESIGN M ETHODOLOGY Yanghui Ou, Shady Agwa, Christopher Batten Computer Systems Laboratory Cornell University

  2. R EAL M ANYCORE I MPLEMENTATIONS U SE S IMPLE M ESH OCN S Epiphany-V, 1024 cores, 32x32 mesh Celerity, 496 cores, 16x31 mesh KiloCore, 1000 cores, 32x32 mesh Adapteva, Inc University of Washington, UC Davis University of Michigan, Cornell University, UC San Diego Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 1 of 23

  3. P LENTY OF N OVEL OCN T OPOLOGIES P ROPOSED IN THE A CADEMIC A REA Flattened Butterfly Concentrated Mesh, Fat-Tree Multi-drop Express Channels Kim+, MICRO’07 Balfour+, ICS’06 Grot+, HPCA’06/ISCA’11 Clos Network Slim NoC Asymmetric High-Radix SMART NoC, Kao+, TCAS’11 Besta+, ASPLOS’18 Abeyratne+, HPCA’13 Chen +, HPCA’13 Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 2 of 23

  4. G AP B ETWEEN P RINCIPLE AND P RACTICE § Why do manycore processor implementations with 500-1000 cores continue to use simple high-diameter on-chip networks? § Manycores require simple, low-area routers § Manycores use standard-cell-based design § Manycores use a tiled physical design methodology with three key constraints: 1. Design is based on tiling a homogeneous hard macro across the chip 2. All chip top-level routing between hard macros must use short wires to neighboring macros 3. Timing closure for the hard macro must imply timing closure at the chip level Hard Macros in Celerity Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 3 of 23

  5. Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 4 of 23

  6. T ARGET C HIP : 16 X 16 M ANYCORE 3mm Manycore 3mm Per-core Area 24,250µm 2 103,500µm 2 Process 16nm 16nm Frequency ~1GHz 500MHz ISA RV32IM RV64G Issue Width Single Dual Component Area (µm 2 ) L1 Memory 8KB 64KB RV32IMAF-IO 15983 § 16x16 manycore at 1GHz using 14nm technology 4KB data cache 9407 § 3mm x 3mm, 185µm x 185µm per core 4KB inst. cache 9347 § Per-core area roughly corresponds to an in-order RV32IMAF processor Total 34737 with 4KB data cache and 4KB instruction cache Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 5 of 23

  7. R UCHE C HANNELS TO R EDUCE THE OCN D IAMETER 16x16 manycore No ruche channels Ruche factor of 2 Ruche factor of 3 § Directly skips one or more routers § Reduces network diameter § Increases the number of bisection channels § Increases router radix Concurrently proposed with T. Jung et al, Ruche Networks: Wire-Maximal No-Fuss NoCs Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 6 of 23

  8. C ONCENTRATION TO R EDUCE THE OCN D IAMETER 16x16 manycore Concentration factor of four Concentration factor of eight No concentration § Groups multiple cores together to share one router § Reduces network diameter § Reduces the number of routers § Reduces the number of bisection channels § Increases router radix Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 7 of 23

  9. T ILED P HYSICAL D ESIGN – N O R UCHE C HANNELS § Only has near channel in both dimensions Near Channel § Pins are aligned to ensure short global routing mesh-c1r0 tiled mesh-c1r0 hard macro in 1D physical design Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 8 of 23

  10. T ILED P HYSICAL D ESIGN – R UCHE F ACTOR OF T WO Feedthrough Channel § Near channel, far channel, and one feedthrough channel in one dimension Far Channel § Short cross-over routing between feedthrough channel and far channel Near mesh-c1r2 tiled mesh-c1r2 hard Channel macro in 1D physical design Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 9 of 23

  11. T ILED P HYSICAL D ESIGN – R UCHE F ACTOR OF T HREE Feedthrough Channel § Near channel, far channel, and two feedthrough channels in one dimension Far Channel § Short cross-over routing between feedthrough channels and far channel Near mesh-c1r3 tiled mesh-c1r3 hard Channel macro in 1D physical design Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 10 of 23

  12. T ILED P HYSICAL D ESIGN – F OLDED T ORUS § Only far channel and feedthrough channel in Feedthrough Channel one dimension Far § Short cross-over routing between Channel feedthrough channels and far channel torus-c4r0 tiled torus-c1r0 hard § Short wrap-around routing at the edge physical design macro in 1D Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 11 of 23

  13. Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 12 of 23

  14. A NALYTICAL M ODELING M ETHODOLOGY § Model the latency, area, and bandwidth analytically before doing physical design to narrow down our focus § Router area model and channel latency model are constructed based on physical results and floorplans § Zero-load latency is calculated analytically ø = $ % & % + $ ( & ( + ) ! * § Observation • Router area does not scale quadratically as radix increases • A packet can travel a very long distance on the channel in one cycle No Concentration Concentration Concentration factor of four factor of eight Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 13 of 23

  15. A NALYTICAL M ODELING R ESULTS Latency vs Area Latency vs Bandwidth 256b Message 256b Message Under 4Kb/cycle BW Constraint Under 10% Area Constraint § Moderate ruche factor improves bandwidth and/or reduces area § Moderate concentration reduces latency at similar bandwidth and area § Increasing ruche factor does not necessarily improves latency as it may lead to narrower channels which increases serialization latency Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 14 of 23

  16. Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 15 of 23

  17. H ARD M ACRO D ESIGN M ETHODOLOGY § Map global timing constraints to local timing constraints § Use three metal layers for local horizontal routing (M2, M4, M6), three layers for vertical routing (M3, M5, M7) mesh-c1r2 global constraints § Connect “dummy cores” to the injection and ejection ports of the router to prevent ASIC toolflow from Feedthrough Channel optimizing away any logic Far channel § Use routing and placement blockages to prevent the ASIC toolflow from Near channel using the routing resources reserved mesh-c1r2 local constraints for the real cores Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 16 of 23

  18. E XAMPLE H ARD M ACROS 185µm Concentration Factor of Four 185µm 185µm 375µm 375µm mesh-c1r0-b64 mesh-c1r0-b128 mesh-c1r0-b32 185µm 185µm mesh-c4r0-b128 mesh-c4r2-b64 torus-c1r0-b32 mesh-c1r0q0-b32 No Concentration & Ruche Channels No Ruche Channels Ruche Factor of Two Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 17 of 23

  19. C OMPOSING H ARD M ACROS AT C HIP T OP -L EVEL 275µm 3140µm 230µm 3100µm Straight Across Routing Wrap-Around Routing Cross-Over Routing Straight Across Routing Global Clock & Reset Routing torus-c1r0-b32 Close Up mesh-c4r2-b64 Close Up torus-c1r0-b32 Full Chip mesh-c4r2-b64 Full Chip 1. Design is based on tiling a homogeneous hard macro across the chip 2. All chip top-level routing between hard macros must use short wires to neighboring macros 3. Timing closure for the hard macro must imply timing closure at the chip level Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 18 of 23

  20. M ACRO -L EVEL R ESULTS FOR P ROMISING T OPOLOGIES Latency vs Area Latency vs Area Bandwidth vs Area 64b Message 256b Message § Increasing bandwidth leads to increase in area for all topologies § Increasing concentration and ruche factor leads to lower latency & lower Area Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 19 of 23

  21. Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework Page 20 of 23

Recommend


More recommend