networks on chip evolution or revolution
play

Networks on chip: Evolution or Revolution? Luca Benini - PDF document

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita di Bologna MPSOC 2004 The evolution of SoC platforms MIPS TriMedia SDRAM General-purpose Scalable VLIW MIPS CPU MMI TriMedia CPU


  1. Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita’ di Bologna MPSOC 2004 The evolution of SoC platforms MIPS ™ TriMedia ™ SDRAM General-purpose Scalable VLIW MIPS CPU MMI TriMedia CPU Scalable RISC Media Processor: D$ D$ Processor • 100 to 300+ MHz PRxxxx TM-xxxx I$ I$ • 50 to 300+ MHz • 32-bit or 64-bit • 32-bit or 64-bit DEVICE IP BLOCK DEVICE IP BLOCK DVP MEMORY BUS Nexperia ™ DEVICE IP BLOCK DEVICE IP BLOCK Library of Device System Buses . . IP Blocks • 32-128 bit PI BUS PI BUS . . • Image coprocessors . . • DSPs DEVICE IP BLOCK DEVICE IP BLOCK • UART • 1394 • USB DVP SYSTEM SILICON … � 2 Cores : Philips’ Nexperia PNX8850 SoC platform for High-end digital video (2001) L. Benini MPSOC 2004 2

  2. Running forward… Four 350/400 MHz StarCore � SC140 DSP extended cores 16 ALUs: 5600/6400 MMACS � 1436 KB of internal SRAM & � multi-level memory hierarchy � Internal DMA controller supports 16 TDM unidirectional channels, Two internal coprocesssors � (TCOP and VCOP) to provide special-purpose processing capability in parallel with the core processors � 6 Cores : Motorola’s MSC8126 SoC platform for 3G base stations (late 2003) L. Benini MPSOC 2004 3 What’s happening in SoCs? � Technology: no slow-down in sight! � Faster and smaller transistors � … but slower wires, lower voltage, more noise! � Design complexity: from 2 to 10 to 100 cores! � Design reuse is essential � …but differentiation/innovation is key for winning on the market! � Performance and power: GOPS for MWs! � Performance requirements keep going up � …but power budgets don’t! L. Benini MPSOC 2004 4

  3. …and on-chip communication? � Starting point: the “on chip bus” � Advances in protocols � Advances in topologies � Revolutionary approaches � Networks on chip � Things are moving FAST � …but it’s evolution or revolution? L. Benini MPSOC 2004 5 Outline � Introduction and motivation � On-chip networking � The HW-SW interface L. Benini MPSOC 2004 6

  4. On-chip bus Architecture � Many alternatives � Large semiconductor firms (e.g. IBM Coreconnect, STMicro STBus) � Core vendors (e.g. ARM AMBA) � Interconnect IP vendors (e.g. SiliconBackplane) � Same topology, different protocols L. Benini MPSOC 2004 7 AMBA bus Master port Slave port CPU EU IO System- Peripheral Bridge AMBA High-speed bus CPU Bus EU Mem Mem APB : Simplified processor for AHB : high-speed high-bandwidth general purpose peripherals multi-master bus L. Benini MPSOC 2004 8

  5. AHB Bus architecture Dedicated wires Different wires NO Bidirectional wires L. Benini MPSOC 2004 9 AMBA basic transfer Pipelining increases Bus bandwidth For a write For a read L. Benini MPSOC 2004 10

  6. Bus arbitraton Shared address bus ARBITER HBREQ_M1 HBREQ_M2 Dedicated wires HBREQ_M3 Arbitration Protocol is defined, but Arbitration Policy is not L. Benini MPSOC 2004 11 The price for arbitration Time for handshaking Time for arbitration Wait state L. Benini MPSOC 2004 12

  7. Burst transfers � Burst transfers amortize arbitration cost � Grant bus control for a number of cycles � Help with DMA and block transfers � Help hiding arbitration latency � Requires safeguards against starvation � Split and error L. Benini MPSOC 2004 13 Critical analysis: bottlenecks � Protocol � Lacks parallelism � In order completion � No multiple outstanding transactions: cannot hide slave wait states � High arbitration overhead (on single-transfers) � Bus-centric vs. transaction-centric � Initiators and targets are exposed to bus architecture (e.g. arbiter) � Topology � Scalability limitation of shared bus solution! L. Benini MPSOC 2004 14

  8. STBUS � On-chip interconnect solution by ST � Level 1-3: increasing complexity (and performance) � Features � Higher parallelism: 2 channels (M-S and S-M) � Multiple outstanding transactions with out-of order completion � Supports deep pipelining � Supports Packets (request and response) for multiple data transfers � Support for protection, caches, locking � Deployed in a number of large-scale SoCs in STM L. Benini MPSOC 2004 15 STBUS Protocol (Type 3) Request channel Transaction level Transaction Resp Packet Req Packet Packet level Initiator Target Cell level Response channel Signal level Initiator port Target port L. Benini MPSOC 2004 16

  9. STBUS bottlenecks � Protocol is not fully transaction-centric � Cannot connect initiator to target (e.g. initiator does not have control flow on the response channel) � Packets are atomic on the interconnect � Cannot initiate nor receive multiple packets at the same time � Large data transfers may starve other initiators L. Benini MPSOC 2004 17 AMBA AXI � Latest (2003) evolution of AMBA � Advanced eXtensible Interface � Features � Fully transaction centric: can connect M to S with nothing in between � Higher parallelism: multiple channels � Supports bus-based power management � Support for protection, caches, locking � Deployment: ?? L. Benini MPSOC 2004 18

  10. Multi-channel M-S interface Channel hanshaking Address Channel Write channel DATA VALID Master Slave READY Read channel 4 parallel channels are Write response ch. available! L. Benini MPSOC 2004 19 Multiple outstanding transactions � A transaction implies activity on multiple channels � E.g Read uses the Address and Read channel � Channels are fully decoupled in time � Each transaction is labeled when it is started (Address channel) � Labels, not signals, are used to track transaction opening and closing � Out of order completion is supported (tracking logic in master), but master can request in order delivery � Burst support � Single-address burst transactions (multiple data channel slots) � Bursts are not atomic! � Atomicity is tricky � Exclusive access better than locked access L. Benini MPSOC 2004 20

  11. Scalability: Execution Time � Highly parallel benchmark (no slave bottlenecks) 110% 180% 170% 100% 160% 150% 90% 140% Relative execution time Relative execution time 80% 130% 120% 70% 110% 100% 60% 2 Cores 2 Cores 90% 4 Cores 4 Cores 50% 6 Cores 80% 6 Cores 70% 8 Cores 8 Cores 40% 60% 30% 50% 40% 20% 30% 20% 10% 10% 0% 0% AHB AXI STBus STBus (B) AHB AXI STBus STBus (B) � 256 B cache (high � 1 kB cache (low bus bus traffic) traffic) L. Benini MPSOC 2004 21 Scalability: Protocol Efficiency 100% 100% 90% 90% Interconnect usage efficiency 80% 80% 70% 70% Interconnect busy 60% 60% 50% 2 Cores 50% 2 Cores 4 Cores 4 Cores 40% 6 Cores 40% 6 Cores 8 Cores 8 Cores 30% 30% 20% 20% 10% 10% 0% 0% AHB AXI STBus STBus (B) AHB AXI STBus STBus (B) � Increasing contention: AXI, STBus show 80%+ efficiency, AHB < 50% L. Benini MPSOC 2004 22

  12. Scalability: latency 14 Latency for access completion (cycles) 13 12 11 10 9 8 7 6 STBus (B) write avg STBus (B) write min 5 STBus (B) read avg 4 STBus (B) read min AXI write avg 3 AXI write min 2 AXI read avg AXI read min 1 0 2 Cores 4 Cores 6 Cores 8 Cores � STBus management has less arbitration latency overhead, especially noticeable in low-contention conditions L. Benini MPSOC 2004 23 Topology � Single shared bus is clearly non-scalable � Evolutionary path � “Patch” bus topology � Two approaches B � Clustering & Bridging � Multi-layer/Multibus M M L. Benini MPSOC 2004 24

  13. Clustering and bridging Heterogeneous architectures with asymmetric traffic � � Cost for going across a bridge is HIGH Bus clusters for bandwidth & latency reasons � � Example: EASY SoCs for WLAN T I I T L. Benini MPSOC 2004 25 AMBA Multi-layer AHB � Enables parallel access paths between multiple masters and slaves � Fully compatible with AHB wrappers AHB1 Interconnect Slave1 Master1 Matrix Slave1 AHB2 Master2 Slave1 Slave Port L. Benini MPSOC 2004 26

  14. Multi-Layer AHB implementation � The matrix is made of slave ports � No explicit arbitration of slaves � Variable latency in case of destination conflicts Slave1 Decode Master1 Mux Crossbar arbitration Mux Master2 Slave4 Decode L. Benini MPSOC 2004 27 STBUS Crossbar & Partial CB FC PC L. Benini MPSOC 2004 28

  15. Topology speedup (AMBA AHB) 7000000 6000000 � Independent tasks (matrix multiply) 5000000 � With & without semaphore 4000000 Shared synchronization Bridging � 8 processors (small cache) 3000000 MultiLayer 2000000 1000000 0 Semaphore No semaphore L. Benini MPSOC 2004 29 Crossbar: critical analysis � No bandwidth reduction � Scales poorly � N 2 area and delay � A lot of wires and a lot of gates in a bus- based crossbar � E.g. Area_cell_4x4/Area_cell_bus ~2 for STbus � No locality � Does not scale beyond 10x10! L. Benini MPSOC 2004 30

  16. NoCs network interface switch � More radical solutions in the long term link DSP CPU � Nostrum � HiNoC � Linkoeping SoCBUS � SPIN � Star-connected on-chip network � Aethereal Memory � Proteo Memory � Xpipes CPU � … (at least 15 groups) L. Benini MPSOC 2004 31 NOCs vs. Busses � Packet-based STBUS and AXI � No distinction address/data, only packets (but of many types) � Complete separation between end-to-end transactions and data delivery protocols � Distributed vs. centralized � No global control bottleneck � Better link with placement and routing � Bandwidth scalability, of course! L. Benini MPSOC 2004 32

Recommend


More recommend