Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies MPSoC Research Group @ University of Ferrara Daniele Ludovici , Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi
Outline • Motivation & Goal • Topologies under test • Physical Modeling Framework • 64-tile topologies: P&R results • What happens when the link needs to be pipelined? • System-level exploration • Conclusions
MPSoC Technology The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines With the advent of MPSoC technology Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system
MPSoC Technology The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines MPSoC: efficient computation can be achieved while only With the advent of MPSoC technology marginally impacting programmability and/or configurability Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system
MPSoC Architecture: The new Landscape • Hierarchical system - programmable accelerator - tile based subsystem of homogeneous processing units • Top level of the hierarchy: - DSP, I/O units, hw accelerators System complexity is more a matter of instantiation and connectivity capability rather than architecture development
The Physical Gap Connectivity patterns for large scale systems are well known from off-chip networking Nanoscale silicon Technologies Module Module Module Module Module Module Module Module Module Module Module Module Growing gap between pre- and post-layout properties of topology connectivity patterns
Layout Effects Regularity Latency in broken by injection asymmetric links? tile size or heterogeneous Latency in tiles! express links? Which Over-the-cell switch routing? operating Can automatic routing tools frequency handle this effectively? ? How is routing congestion at each metal layer impacted? Pencil-and-paper floorplanning considerations may be misleading Therefore topology comparison with layout-awareness is a must!
GOAL IDEA: there are many regular topologies feature better abstract properties than a 2D mesh GOAL: quantify to which extent such properties are impacted by the degradation effects of the physical synthesis on nanoscale silicon
GOAL IDEA: typically, an accurate physical modeling of interconnection networks is limited to small scale systems GOAL: we propose a NoC physical characterization methodology enabling layout aware analysis of large scale systems pruning time and memory requirements
GOAL IDEA: long links will be most probably inferred with link pipelining GOAL: we capture the impact of link pipelining on topology area and performance assessing whether and to which extent theoretical benefits are preserved
GOAL IDEA: System-level power management is achieved by structuring the MPSoC into voltage and frequency islands GOAL: we consider IP core-network speed decoupling typical of GALS systems in the topology evaluation framework
Topologies under test 8ary-2mesh => 2D mesh
Topologies under test 4ary-3mesh
Topologies under test 4ary-2mesh
Topologies under test 2ary 6mesh Other concentrated variants
Topologies under test 8-Cmesh
Topology specification Transactional Simulator Topology generation RTL SystemC/Verilog OCP Traffic Generator Physical Synthesis Simulation Placement VCD Trace Floorplan Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic Extraction Prime time Power estimation Prime Time PX SDF (timing)
Characterization Methodology Challenge: layout aware physical modeling of large scale NoC topologies
Post-layout Results Highest switch radix determines maximum frequency (post-synthesis) Longest link determines the highest achievable frequency (post-layout) The critical path is determined by the switch-to-switch link in a NoC topology! Most of the topologies are not competitive with the 2D mesh because of their long links and even unusable!!!
Area for 64-tile, no pipelining • Final area footprint depends on the #of_switches, max switch radix and consequently their final synthesis frequency E.g., 2-ary 6-mesh has slower final frequency w.r.t. 8-ary 2-mesh but all the swiches have radix 8 vs. 4,5,6 => 8-ary 2-mesh has 10% area saving 4-ary 2-mesh: short link (3mm)=>small performance drop. Few switches (16): 20% saving KEY T/A: they are not more area efficient than the 2D mesh but due to their slow down, their area footprint can be overly optimized… …never forget target frequency when considering area footprint!!
Link Pipelining Does this mean that multi-dimensional and concentrated-mesh topologies are totally unusable? • Li Link Pipelining breaks long timing paths at a cheaper cost than switches • Flip data data switch switch flops
Link Pipelining The xpipes architecture uses stall/go flow control Pipeline stage Pipeline stage Normal flit Data 2 slot buffers Data Data propagation needed for valid stall/go Backup slot to stall stall stall flow control compensate propagation delay Control Control of backpressure Logic Logic signals en1 en2 sel
Not just a bunch of flip-flops area overhead 3.5 3 2.5 2 1.5 1 0.5 0 flip-flops barrier flow control stage • A flow control stage features a considerable area overhead with respect to a simple barrier of flip-flops. What are the implications on topology area figures? How is the area ratios between topolog. impacted by link pipelining?
Area for 64-tile, with pipelining • Insertion criteria : from the third link dimension onwards • each topology has a different number of links and a different number of required pipeline stages => depends on the maximum achievable frequency Key take-away: each topology has a different price to pay to restore the maximum achievable frequency dictated by its elementary switch block
Area Overhead for pipeline stage insertion • Area before and after pipeline stage insertion • Cell area increment in all cases comes from a twofold contribution: • pipeline stage insertion • restored higher frequency allowed by such insertion • The possibility to restore a high frequency changes radically the area overhead of the elementary switch block as well
Link Pipelining Multi-dimensional and C-mesh become again competitive!!! • However… flow control stages add latency to the link crossing • …and performance due to a better frequency could be vanished by an increased overall link latency Therefore, in order to quantify this… we performed • a system-level exploration
System-level Exploration • TLM simulator cycle accurate with the xpipes architecture • Back annotation of physical parameters from the layout synthesis such as real link latency, core and switch operating frequency, etc. • The target system implements dual-clock FIFO in order to model a scenario where every core can run at its own frequency (it was ratio-based) System-level exploration with layout awareness
Performance of 64-tile systems with uniform random traffic Theoretical => • Neglecting layout implications, 2-ary 6-mesh is the best solution • Several topologies outperform the 8-ary 2-mesh
Performance of 64-tile systems with uniform random traffic Layout aware no pipelining => • 8-ary 2-mesh is the best solution • the poor matching of several topologies with silicon technology completely offsets their better theoretical properties
Performance of 64-tile systems with uniform random traffic Layout aware, with link pipelining => • When the impact of wiring complexity over the critical path is alleviated by using link pipelining techniques: • 2-ary 6-mesh, 2-ary 5-mesh, 4-ary 3-mesh outperform the 2D mesh • BUT their performance comes at an area cost! Is the performance boost proportioned to the area overhead?
Area Efficiency • area efficiency metric: throughput/area • performance improvements achieved by complex topologies is NOT cost-effective in that the area overhead is disproportioned with the performance boost • only traffic pattern favoring low-hop count (perfect shuffle) achieves better area efficiency
Summing up • A comprehensive analysis framework to assess k-ary n-mesh and C-mesh topologies at different levels of abstraction: from system- to layout-level • Accurate physical modeling methodology to characterize topologies from an area and timing viewpoint while pruning implementation time and memory requirements • without link pipelining => forget about it • with link pipelining: area ratio wrt 2D-mesh is inverted (due to synthesis) • Even with an increased link latency some k-ary n-mesh and C-mesh topologies preserve performance benefits…. …but this comes at disproportioned area cost!
Questions? daniele.ludovici@unife.it
Recommend
More recommend