Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi - PowerPoint PPT Presentation

1� Optimized Core-links   for Low-latency NoCs � Ryuta Kawano † , Seiichi Tade † , Ikki Fujiwara †† ,   Hiroki Matsutani † , Hideharu Amano † , Michihiro Koibuchi †† † Keio University †† National Institute of Informatics blackbus@am.ics.keio.ac.jp �

2� Contents � • Conventional NoCs • Small-world Networks • Difficulty in applying on Chips • How do we reduce path hops of NoCs? • Adding multiple links between a core and routers • Optimization method for picking core-links • Evaluations • Zero-load latencies • Costs • Full-system simulation • Conclusions

3� Increasing # of Cores on NoCs � Number of PEs (caches are not included) � Accelerator Geforce GTX480 � Graphic processing units Many simple PEs are integrated 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � STI Cell BE � 8 � Sparc T1 � Sparc T2 � Intel Xeon, AMD Opteron � 4 � IBM Power7, Chip Multi-Processors � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �

4� Intel Teraflops Chip: 80-tile 2D Mesh Number of PEs (caches are not included) � Geforce GTX480 � 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � STI Cell BE � 8 � Teraflops chip (HPC) [3] Sparc T1 � Sparc T2 � Intel Xeon, AMD Opteron � 4 � IBM Power7, [3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �

5� Intel Teraflops Chip: 80-tile 2D Mesh Number of PEs (caches are not included) � Geforce GTX480 � 256 � Geforce GTX280 � Geforce 8800 � picoChip PC102 � 128 � TILE Gx100 � Intel 80-core � Xeon Phi � ClearSpeed CSX600 � 64 � TILERA TILE64 � Intel SCC � 32 � Conventional topologies MIT RAW � UT TRIPS (OPN) � Sparc T3 � 16 � (e.g. 2D Mesh) STI Cell BE � 8 � have large # of hops Teraflops chip (HPC) [3] Sparc T1 � Sparc T2 � Intel Xeon, on many cores. AMD Opteron � 4 � IBM Power7, [3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg � Fujitsu Sparc64 2 � 2012 � 2002 � 2004 � 2006 � 2008 � 2010 �

7� Small-world Topology for off-Chip Network � Reduction of # of hops using small-world effects [ Koibuchi et al. 2012 ] � Router � Ring + Random Links Ring + Non-Random Links

8� Small-world Topology on Chips � Core � Router � 2 D Mesh   + Inter -‐‑– router additional links   1 � 2 � 3 � [ Ogras et al. 2006 ] � • Need to use custom routing 4 � 5 � 6 � • to reduce path hops • t o a void deadlocks � 7 � 8 � 9 �

10� Multiple C ore-links to R educe Path Hops � Idea: Router topology + multiple links between a core and multiple routers (Mesh router topology) Router � Core � Conventional NoC Multiple Core-link NoC

11� Our Idea: Reduction of First and Last 1-hop   Latencies with Shortcut Core-links Router � Core � • Using shortest path between   source and destination cores • Achieving lower hops   by small-world effects • Maintaining regularities of   router topologies Multiple Core-links 、

12� Our Idea: Reduction of First and Last 1-hop   Latencies with Shortcut Core-links Router � Core � • Using shortest path between   source and destination cores • Achieving lower hops   by small-world effects • Maintaining regularities of   router topologies Multiple Core-links 、

13� Optimization Method for Picking Core-links � • Problem: Lower operating frequency by longer core-links • Solution: Optimization using GA (Genetic Algorithm) Definition of Individual 0 � 1 � （ Src. core ）� 0 � 2 � 1 � 3 � 0 1 1 � 0 � 0 � 2 � 1 � 3 � 3 � 2 � Individual � 2 � 3 � （ Dst. router ）� 2 3 Example of Individual � Corresponding Topology � Providing best tradeoff between link length and # of hops

15� � Zero-load Latency � 70 • 8 × 8 Mesh router topology   Zero-load latency [cycles] 60 + optimized core-links Conventional 50 8 × 8 Mesh � • Max. core-link length: 4 tiles 40 Max. � 30 Reduction of max. / ave.   zero-load latencies   20 Ave. � by up to 49 % / 58 % 10 1 2 3 4 # of links per core

16� Costs (8 × 8 Mesh) � • Wire Density Overhead   • Energy Consumption on each tile • 1.2 V supply voltage • two links per core • 65 nm CMOS Process • Wire capacitance load:   Max. � 5.40 links � 0.20 [pJ / mm]   Ave. � 3.06 links � (from ITRS 2007) SD � 1.58 links � 700 Energy consumption   600 • Router area   [pJ / flit] (Fujitsu 65nm Process) 500 400 7.71 mm 2 � 1 link per core � 1 2 3 4 9.86 mm 2 � 2 links per core � # of links per core • Increase by 3.0 % at minimum • Increase by 27.8 %

17� Parameters of Full System Simulation � ・ GEM5 [Binkert et al. 2011] is used as full-system simulator ・ Using 7 applications from the OpenMP implementation of NAS Parallel Benchmarks Parameters of Simulation Parameters of Router Network Switching Wormhole Processor x86 (64-bit) Packet length 1- or 5-flit L1 cache size 32 KB (line: 64 B) 1 cycle Flit length 128-bit L1 cache latency 2 56 KB (assoc: 8) # of VCs 3 L2 cache size Size of VC 4 flits L2 cache latency 6 cycles Router latency 3 [cycles] Memory size 2 GB Link latency 1 or 2 [cycles] Memory latency 160 cycles Router topology 8 × 8 Mesh Chip Configuration Max. link length 4 [tiles] # of CPUs / L1 caches 8 Routing # of L2 caches 48 Inter-router XY routing # of Directory controllers 8 Between router Selecting and core shortest path

18� Full-system Simulation Results �  8 × 8 Mesh router topology  Max. core-link length: 4 tiles 1.05 8x8 MESH w/ 1 link per core Normalized Execution Time 8x8 MESH w/ 2 links per core 8x8 MESH w/ 4 links per core 1 0.95 0.9 0.85 MG SP BT UA CG LU IS Benchmarks Reduction o f application execution time by up to 10.1 % �

19� Conclusions � • Idea: Multiple links between a core and routers on NoCs • Design ： GA optimization for picking core-links • Reduction of max. / ave. zero-load latencies by up to 49 % / 58 % • Reduction of application execution time by up to 10.1 % Router � Core � Our O ptimized Core-link NoC Conventional NoC

21� B ackup Slides �

22� Definition of Fitness Function f k � • Length of the j -th core-link for the i -th core (0 ≤ i < N , 0 ≤ j < x ) : l i, j • # of hops between the p -th and q -th core (0 ≤ p < N, 0 ≤ q < N ) : h p, q • # of links longer than the given maximum link length : I • Supplemental parameters ： α = β = 1000 $ N − 1 x − 1 ∑ ∑ α ( β ⋅ I + l i , j ) ( I > 0) (1) & f k = % i = 0 j = 0 & max( h ) + mean( h ) (otherwise) (2) ' GA Parameters Reducing link length with function (1) Reducing # of hops with function (2) � • # of inds: 100, # of gens: 20000 • P. of crossover, mutation: 1 %, 20 % • Tournament size: 3

Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi - PowerPoint PPT Presentation

1 Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi Tade , Ikki Fujiwara , Hiroki Matsutani , Hideharu Amano , Michihiro Koibuchi Keio University National Institute of

Links Student Web Presence Guidelines Summary 1. The Purpose of Links 2. Worst Links 3. Best

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs Wei Song and Doug Edwards The

NoCs 2 0 0 8 Zheng Shi and Alan Burns Real-time system group Department of computer science

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

LINKS AND RULES GENOME VISUALIZATION WITH CIRCOS LINKS AND RULES 1 Martin Krzywinski

How of the Conceptual Future Internet Links lead to links that link to other links. Many

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

SPEC MPI2007 Benchmarks for HPC Systems Ron Lieberman Dr. Matthias S. Mueller Chair, SPEC HPG

Stochas(c)Approach)for)Integrated)Rendering) of)Volumes)and)Semi9transparent)Surfaces

Post-K Development Yutaka Ishikawa Project Leader, Flagship 2020 RIKEN Center for Computational

Secure Browsing and Email Web Browsing with HTTPS Secure Email with OpenPGP Organised by Steven

Learning Character-Agnostic Motion for Motion Retargeting in 2D Kfir Aberman, Rundi Wu, Dani

System Buses Chapter 5 S. Dandamudi Outline Introduction Bus arbitration Dynamic

Traffic analysis and modelling 1 Service classification Services may be classified according

Beyond Object Recognition in 2D Georgia Gkioxari Object Recognition in 2D The World is 3D