tendencias de uso y dise o de redes de interconexi n en
play

Tendencias de Uso y Diseo de Redes de Interconexin en Computadores - PDF document

Tendencias de Uso y Diseo de Redes de Interconexin en Computadores Paralelos 14 de Abril, 2016 Universidad Complutense de Madrid Ramn Beivide Universidad de Cantabria Outline 1. Introduction 2. Network Basis 3 . System networks 4.


  1. Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos 14 de Abril, 2016 Universidad Complutense de Madrid Ramón Beivide Universidad de Cantabria Outline 1. Introduction 2. Network Basis 3 . System networks 4. On-chip networks (NoCs) 5. Some current research 2

  2. 1. Intro: MareNostrum 3 1. Intro: MareNostrum BSC, Infiniband FDR10 non-blocking Folded Clos (up to 40 racks) Mellanox Mellanox Mellanox Mellanox Mellanox Infiniband Mellanox 648-port 648-port 648-port 648-port 648-port 648-port 648-port Latency: 0,7 μs IB IB IB IB IB IB FDR Bandwidth: 40Gb/s Core Switch Core Switch Core Switch Core Switch Core Switch Core Switch Core switch 560 560 560 560 560 560 FDR10 links 3 links to 3 links to 3 links to 3 links to 2 links to 3 links to 3 links to 3 links to 3 links to 2 links to each each each each each each each each each each core core core core core 18 18 18 18 12 18 18 18 18 12 Leaf switches core core core core core 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 18 18 18 18 12 18 18 18 18 12 40 iDataPlex racks / 3360 dx360 M4 nodes 4

  3. 1. Intro: Infiniband core switches 5 1. Intro: Cost dominated by (optical) wires 6

  4. 1. Intro: Blades 7 1. Intro: Blades 8

  5. 1. Intro: Multicore E5-2670 Xeon Processor 9 1. Intro: A row of servers in a Google DataCenter, 2012. 10

  6. 3. WSCs Array: Enrackable boards or blades + rack router To other clusters Figure 1.1: Sketch of the typical elements in warehouse-scale systems: 1U server (left), 7’ rack with Ethernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right). 11 3. WSC Hierarchy 12

  7. 1. Intro: Cray Cascade (XC30, XC40) 13 1. Intro: Cray Cascade (XC30, XC40) 14

  8. 1. Intro: An Architectural Model Interconnection Network S/R S/R … … … … Interconnection Network Interconnection Network S/R S/R … S/R S/R M 1 … M n M 1 … M n ATU ATU ATU ATU L/S L/S L/S L/S L/S L/S L/S L/S CPU 1 … CPU 1 … CPU n Interconnection Network 15 1. Intro: What we need for one ExaFlop/s Networks are pervasive and critical components in Supercomputers, Datacenters, Servers and Mobile Computers. Complexity is moving from system networks towards on-chip networks: less nodes but more complex 16

  9. Outline 1. Introduction 2. Network Basis Crossbars & Routers Direct vs Indirect Networks 3 . System networks 4. On-chip networks (NoCs) 5. Some current research 17 2. Network Basis All networks based on Crossbar switches • Switch complexity increases quadratically with the number of crossbar input/output ports, N , i.e., grows as O( N 2 ) • Has the property of being non-blocking (N! I/O permutations) • Bidirectional for exploiting communication locality • Minimize latency & maximize throughput 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 18

  10. 2. Blocking vs. Non-blocking • Reduction cost comes at the price of performance – Some networks have the property of being blocking (Not N!) – Contention is more likely to occur on network links › Paths from different sources to different destinations share one or more links 0 1 2 3 4 5 6 7 X 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 non-blocking topology blocking topology 19 2. Swith or Router Microarchitecture Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Physical channel Physical channel Output buffers Input buffers Control DEMUX Link Control DEMUX MUX MUX Link CrossBar Physical channel Physical channel Output buffers Input buffers Control Control DEMUX Link DEMUX MUX Link MUX Routing Control Unit Arbitration Header Unit Crossbar Flit Control Output Forwarding Port # Table IB (Input Buffering) RC (Route Computation) SA (Switch Arb) ST (Switch Traversal) OB (Output Buffering) Matching the throughput Packet header IB RC SA ST OB of the internal switch Payload fragment IB IB IB ST OB datapath to the external Payload fragment IB IB IB ST OB link BW is the goal Payload fragment IB IB IB ST OB 20

  11. 2. Network Organization Indirect (Centralized) and Direct (Distributed) Networks End Nodes Switches 21 2. Previous Myrinet core switches (Indirect, Centralized) 22

  12. 2. IBM BG/Q (Direct, Distributed) 23 2. Network Organization • As crossbars do not scale they need to be interconnected for servicing an increasing number of endpoints. • Direct (Distributed) vs Indirect (Centralized) Networks • Concentration can be used to reduce network costs – “c” end nodes connect to each switch – Allows larger systems to be built from fewer switches and links – Requires larger switch degree 32-node system with 8-port switches 64-node system with 8-port switches, c = 4 24

  13. Outline 1. Introduction 2. Network Basis 3 . System networks Folded Clos Tori Dragonflies 4. On-chip networks (NoCs) 5. Some current research 25 3. MareNostrum BSC, Infiniband FDR10 non-blocking Folded Clos (up to 40 racks) Mellanox Mellanox Mellanox Mellanox Mellanox Infiniband Mellanox 648-port 648-port 648-port 648-port 648-port 648-port 648-port Latency: 0,7 μs IB IB IB IB IB IB FDR Bandwidth: 40Gb/s Core Switch Core Switch Core Switch Core Switch Core Switch Core Switch Core switch 560 560 560 560 560 560 FDR10 links 3 links to 3 links to 3 links to 3 links to 2 links to 3 links to 3 links to 3 links to 3 links to 2 links to each each each each each each each each each each core core core core core 18 18 18 18 12 18 18 18 18 12 Leaf switches core core core core core 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 18 18 18 18 12 18 18 18 18 12 40 iDataPlex racks / 3360 dx360 M4 nodes 26

  14. 3. Network Topology Centralized Switched ( Indirect ) Networks 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 port Crossbar network 27 3. Network Topology Centralized Switched ( Indirect ) Networks 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 port, 3-stage Clos network 28

  15. 3. Network Topology Centralized Switched ( Indirect ) Networks 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 port, 5-stage Clos network 29 3. Network Topology Centralized Switched ( Indirect ) Networks 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 port, 7 stage Clos network = Benes topology 30

  16. 3. Network Topology Centralized Switched ( Indirect ) Networks Network Bisection 0 1 2 • Bidirectional MINs 3 • Increase modularity 4 • Reduce hop count, d 5 • 6 Folded Clos network 7 – Nodes at tree leaves 8 – Switches at tree vertices 9 – Total link bandwidth is 10 constant across all tree 11 levels, with full bisection 12 bandwidth 13 14 15 Folded Clos = Folded Benes <> Fat tree network !!! 31 3. Other DIRECT System Network Topologies Distributed Switched (Direct) Networks 2D mesh or grid of 16 nodes 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4 , so n = 4) Network Bisection ≤ full bisection bandwidth! 32

  17. 3. IBM BlueGene/L/P Network Prismatic 32x32x64 Torus (mixed-radix networks) BlueGene/P: 32x32x72 in maximum configuration Mixed-radix prismatic Tori also used by Cray 33 3. IBM BG/Q 34

  18. 3. IBM BG/Q 35 3 .BG Network Routing Y Wires X Wires Z Wires Adaptive Bubble Routing ATC-UC Research Group 36

  19. 3. Fujitsu Tofu Network 37 3. More Recent Network Topologies Distributed Switched ( Direct ) Networks • Fully-connected network : all nodes are directly connected to all other nodes using bidirectional dedicated links 0 1 7 6 2 5 3 4 38

  20. 3. IBM PERCS 39 3. IBM PERCS 40

  21. 3. IBM PERCS 41 3. Dragonfly Interconnection Network Organized as groups of routers Parameters: • a : Routers per group • p : Node per router • h : Global link per router • Well-balanced dragonfly [1] Inter-group a = 2p =2h • Global links •Complete graph Intra-group • Local links •Complete graph

  22. 3. Dragonfly Interconnection Network Destination Destination Node Group i+N Minimal routing • Longest path 3 hops: local - global - local • Good performance under UN traffic Adversarial traffic [1] • ADV+N: Nodes in group i SATURATION send traffic to group i+N • Saturation of the global link [1] J. Kim, W. Dally, S. Scott, and D. Abts. “Technology-driven, highly-scalable dragonfly topology.” ISCA ‘08. Source Source Group Node i 3. Dragonfly Interconnection Network Destination Node Valiant Routing [2] • Randomly selects an intermediate group to misroute packets • Avoids saturated channel • Longest path 5 hops Intermediate Group local - global - local - global - local [2] L. Valiant, “A scheme for fast parallel communication," SIAM journal on com- puting, vol. 11, p. 350, 1982. Source Node

  23. 3. Cray Cascade, electrical supernode 45 3. Cray Cascade, system and routing 46

Recommend


More recommend