1� Optimized Core-links for Low-latency NoCs � Ryuta Kawano † , Seiichi Tade † , Ikki Fujiwara †† , Hiroki Matsutani † , Hideharu Amano † , Michihiro Koibuchi †† † Keio University †† National Institute of Informatics blackbus@am.ics.keio.ac.jp �
2� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions
3� Increasing # of Cores on NoCs � Number of PEs (caches are not included) � Accelerator Geforce GTX480 � Graphic processing units Many simple PEs are integrated 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � STI Cell BE � 8 � Sparc T1 � Sparc T2 � Intel Xeon, AMD Opteron � 4 � IBM Power7, Chip Multi-Processors � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �
4� Intel Teraflops Chip: 80-tile 2D Mesh Number of PEs (caches are not included) � Geforce GTX480 � 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � STI Cell BE � 8 � Teraflops chip (HPC) [3] Sparc T1 � Sparc T2 � Intel Xeon, AMD Opteron � 4 � IBM Power7, [3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �
5� Intel Teraflops Chip: 80-tile 2D Mesh Number of PEs (caches are not included) � Geforce GTX480 � 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � Conventional topologies MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � (e.g. 2D Mesh) STI Cell BE � 8 � have large # of hops Teraflops chip (HPC) [3] Sparc T1 � Sparc T2 � Intel Xeon, on many cores. AMD Opteron � 4 � IBM Power7, [3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �
6� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions
7� Small-world Topology for off-Chip Network � Reduction of # of hops using small-world effects [ Koibuchi et al. 2012 ] � Router � Ring + Random Links Ring + Non-Random Links
8� Small-world Topology on Chips � Core � Router � 2 D Mesh + Inter -‐‑– router additional links 1 � 2 � 3 � [ Ogras et al. 2006 ] � • Need to use custom routing 4 � 5 � 6 � • to reduce path hops • t o a void deadlocks � 7 � 8 � 9 �
9� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions
10� Multiple C ore-links to R educe Path Hops � Idea: Router topology + multiple links between a core and multiple routers (Mesh router topology) Router � Core � Conventional NoC Multiple Core-link NoC
11� Our Idea: Reduction of First and Last 1-hop Latencies with Shortcut Core-links Router � Core � • Using shortest path between source and destination cores • Achieving lower hops by small-world effects • Maintaining regularities of router topologies Multiple Core-links 、
12� Our Idea: Reduction of First and Last 1-hop Latencies with Shortcut Core-links Router � Core � • Using shortest path between source and destination cores • Achieving lower hops by small-world effects • Maintaining regularities of router topologies Multiple Core-links 、
13� Optimization Method for Picking Core-links � • Problem: Lower operating frequency by longer core-links • Solution: Optimization using GA (Genetic Algorithm) Definition of Individual 0 � 1 � ( Src. core )� 0 � 2 � 1 � 3 � 0 1 1 � 0 � 0 � 2 � 1 � 3 � 3 � 2 � Individual � 2 � 3 � ( Dst. router )� 2 3 Example of Individual � Corresponding Topology � Providing best tradeoff between link length and # of hops
14� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions
15� � Zero-load Latency � 70 • 8 × 8 Mesh router topology Zero-load latency [cycles] 60 + optimized core-links Conventional 50 8 × 8 Mesh � • Max. core-link length: 4 tiles 40 Max. � 30 Reduction of max. / ave. zero-load latencies 20 Ave. � by up to 49 % / 58 % 10 1 2 3 4 # of links per core
16� Costs (8 × 8 Mesh) � • Wire Density Overhead • Energy Consumption on each tile • 1.2 V supply voltage • two links per core • 65 nm CMOS Process • Wire capacitance load: Max. � 5.40 links � 0.20 [pJ / mm] Ave. � 3.06 links � (from ITRS 2007) SD � 1.58 links � 700 Energy consumption 600 • Router area [pJ / flit] (Fujitsu 65nm Process) 500 400 7.71 mm 2 � 1 link per core � 1 2 3 4 9.86 mm 2 � 2 links per core � # of links per core • Increase by 3.0 % at minimum • Increase by 27.8 %
17� Parameters of Full System Simulation � ・ GEM5 [Binkert et al. 2011] is used as full-system simulator ・ Using 7 applications from the OpenMP implementation of NAS Parallel Benchmarks Parameters of Simulation Parameters of Router Network Switching Wormhole Processor x86 (64-bit) Packet length 1- or 5-flit L1 cache size 32 KB (line: 64 B) 1 cycle Flit length 128-bit L1 cache latency 2 56 KB (assoc: 8) # of VCs 3 L2 cache size Size of VC 4 flits L2 cache latency 6 cycles Router latency 3 [cycles] Memory size 2 GB Link latency 1 or 2 [cycles] Memory latency 160 cycles Router topology 8 × 8 Mesh Chip Configuration Max. link length 4 [tiles] # of CPUs / L1 caches 8 Routing # of L2 caches 48 Inter-router XY routing # of Directory controllers 8 Between router Selecting and core shortest path
18� Full-system Simulation Results � 8 × 8 Mesh router topology Max. core-link length: 4 tiles 1.05 8x8 MESH w/ 1 link per core Normalized Execution Time 8x8 MESH w/ 2 links per core 8x8 MESH w/ 4 links per core 1 0.95 0.9 0.85 MG SP BT UA CG LU IS Benchmarks Reduction o f application execution time by up to 10.1 % �
19� Conclusions � • Idea: Multiple links between a core and routers on NoCs • Design : GA optimization for picking core-links • Reduction of max. / ave. zero-load latencies by up to 49 % / 58 % • Reduction of application execution time by up to 10.1 % Router � Core � Our O ptimized Core-link NoC Conventional NoC
20� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions
21� B ackup Slides �
22� Definition of Fitness Function f k � • Length of the j -th core-link for the i -th core (0 ≤ i < N , 0 ≤ j < x ) : l i, j • # of hops between the p -th and q -th core (0 ≤ p < N, 0 ≤ q < N ) : h p, q • # of links longer than the given maximum link length : I • Supplemental parameters : α = β = 1000 $ N − 1 x − 1 ∑ ∑ α ( β ⋅ I + l i , j ) ( I > 0) (1) & f k = % i = 0 j = 0 & max( h ) + mean( h ) (otherwise) (2) ' GA Parameters Reducing link length with function (1) Reducing # of hops with function (2) � • # of inds: 100, # of gens: 20000 • P. of crossover, mutation: 1 %, 20 % • Tournament size: 3
Recommend
More recommend