Optimized Routing for Large- Scale InfiniBand Networks Torsten - PowerPoint PPT Presentation

Optimized Routing for Large- Scale InfiniBand Networks Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University 1

Effect of Network Congestion CHiC Supercomputer: • 566 nodes, full bisection IB fat-tree • effective Bisection Bandwidth: 0.699 Microbenchmarks (NetPIPE, IMB ping pong Lower Bound! Netgauge one_one) Reality? 3 2 1 0 Congestion Factor 2

Full Bisection Bandwidth != Full Bandwidth expensive topologies do not guarantee high bandwidth  deterministic oblivious routing cannot reach full bandwidth!  see Valiant’s lower bound  random routing is asymptotically optimal but looses locality  but deterministic routing has many advantages  completely distributed  very simple implementation  InfiniBand routing:  deterministic oblivious, destination-based  linear forwarding table (LFT) at each switch  lid mask control (LMC) enables multiple addresses per port  3

InfiniBand Routing Continued  offline route computation (OpenSM)  different routing algorithms:  MINHOP (finds minimal paths, balances number of routes local at each switch)  UPDN (uses Up*/Down* turn-control, limits choice but routes contain no credit loops)  FTREE (fat-tree optimized routing, no credit loops)  DOR (dimension order routing for k-ary n-cubes, might generate credit loops)  LASH (uses DOR and breaks credit-loops with virtual lanes) 4

Why do Credits Loop?  IB uses credit-based p2p flow-control egress messages sent only if receive-buffer available  very similar to deadlocks in wormhole-routed systems  5

How to deal with Credit Loops?  prevent (UP*/Down*, turn-based routing)  resolve (LASH, use VLs to break cycles)  ignore (MINHOP, DOR, not as bad as it sounds, might deadlock but can be “resolved” with packet timeouts) discouraged by IB spec  6

Some Theoretical Background  model network as G =( V P [ V C , E )  path r(u,v) is a path between u , v 2 V P  routing R consists of P ( P -1) paths  edge load l ( e ) = number of paths on e 2 E  edge forwarding index ¼ ( G , R )= max e 2 E l ( e )  ¼ ( G , R ) is a trivial upper bound to congestion!  goal is to find R that minimizes ¼ ( G , R )  shown to be NP-hard in the general case 7

Two heuristics based on SSSP  we propose two heuristics:  P-SSSP  P 2 -SSSP  P-SSSP starts a SSSP run at each node  finds paths with minimal edge-load l ( e )  updates routing tables in reverse essentially SDSP   updates l ( e ) between runs  let’s discuss an example … 8

P-SSSP Routing (1/3) Step 1: Source-node 0: 9

P-SSSP Routing (2/3) Step 2: Source-node 1: 10

P-SSSP Routing (3/3) Step 3: Source-node 2: ¼ ( G , R )=2 11

P 2 -SSSP  simply run a single SSSP for each route  better (expensive) heuristic, lower ¼ ( G , R ) ¼ ( G , R )=1 12

How to Assess a Routing?  edge forwarding index is a trivial upper bound  ability to route permutations is more important bisect P into two equally-sized partitions  choose exactly one random partner for each node  £ (P!/(P/2)!) combinations!   our simulation approach: pick N (=5000) random bisections/matchings  compute average bandwidth  shown to be rather precise (Cluster’08)  13

Comparison to Real Systems  ibdiagnet , ibnetdiscover , and ibsim  we extracted topology and routing from:  Thunderbird (SNL) – 4390 LIDs thanks to: Adam Moody & Ira Weiny   Ranger (TACC) – 4080 LIDs thanks to: Christopher Maestas   Atlas (LLNL) – 1142 LIDs thanks to: Len Wisniewsky   Deimos (TUD) – 724 LIDs thanks to: Guido Juckeland and Michael Kluge   Odin (IU) – 128 LIDs 14

Real-world Results Real-World Runtime Real-World Bandwidth 15

Some more Topologies Fat-tree topologies k-ary 2,3-cube topologies (torus) (filled switches with endpoints) 16

Even more Topologies 2-ary n-cube topologies (hypercube) (filled switches with endpoints) random topologies (12 nodes per switch) 17

Simulations are good, but still Simulations we implemented our routing with OpenSM’s file method  tested it on the Deimos and Odin clusters ( needs exclusive  admin access to whole machine – many thanks to Guido Juckeland ) Odin is standard fat-tree, Deimos’ topology:  18

Benchmark Results Odin Simulation Benchmark (Netgauge Pattern eBB) Simulation predicts 5% improvement Benchmark shows 18% improvement! 19

Benchmark Results Deimos Simulation Benchmark (Netgauge Pattern eBB) Simulation predicts 23% improvement Benchmark shows 40% improvement! 20

Summing up and Future Work!  we proposed two new routing heuristics for deterministic oblivious routing (IB)  simulation shows increase in effective bisection bandwidth over standard OpenSM routing e.g., Odin 5%, Deimos 23%, Atlas 15%, Thunderbird 6%   benchmarks show even higher improvements Odin 18%, Deimos 40%   Credit-loops remain, but solution is obvious (LASH-like VL principle) 21

Reproduce our Results!  talk to us!  play with our ORCS simulator  http://www.unixer.de/ORCS  benchmark your cluster (and talk to us)  Netgauge pattern “ebb”  http://www.unixer.de/research/netgauge  ask questions – now! 22

Backup Slides Backup Slides 23

Credit Loops Continued … Source Network and Routes Buffer Dependency Graph 24

Lower ¼ ( G , R ) and lower bandwidth!?  Yes!  ¼ ( G , R ) is just an upper bound  example: no worries, I will not explain it here (refer to article for details)  25

Optimized Routing for Large- Scale InfiniBand Networks Torsten - PowerPoint PPT Presentation

Optimized Routing for Large- Scale InfiniBand Networks Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University 1 Effect of Network Congestion CHiC Supercomputer: 566 nodes, full bisection IB fat-tree

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

Implementing and extending the Optimized Link State Routing Protocol Master presentation by

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

CS 557 Landmark Routing The Landmark Hierarchy: A New Hierarchy For Routing in Very Large

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Outline Integer Programming DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Vehicle Routing

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

Deutsche Telekom Laboratories W3C SIV Workshop (Menlo Park, March 5-6, 2009) Ingmar Kliche,

Peterson s lock: Establishing A derivation mutual exclusion Claim: If the following invariant

Hough Transform 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University motivation

Credit Ratings vs. ESG Evaluations 1 Structured Finance Analytical Framework 2 Examples Of ESG

Objectives Tail Recursion Identify expressions that have subexpressions in tail position. Dr.

Structured PVA 1 Vital rates (Processes that contribute to change in population size) Birth and

CS 101: Computer Programming About These Slides Based on Chapter 1 of the book An

Autonomous Intelligent Robotics Instructor: Shiqi Zhang