I/O Congestion Avoidance via Routing and Object Placement David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang 1
Motivation ● Goal: 240 GB/s routed ● Direct-attached vs. Center-wide ● Limited allocations ● INCITE averages 27 million hours ● Prefer to spend time computing ● Performance issues at scale 2
Spider Resources ● 48 DDN 9900 Couplets ● 13,440 SATA 1 TB hard drives ● DDR InfiniBand connectivity ● 192 Dell PowerEdge 1950 OSS ● 16 GB memory ● 2x Quad-core Xeons @ 2.3 GHz ● 4 Cisco 7024D 288 port DDR IB switches ● 48 Flextronics 24 port DDR IB switches 3
Wiring up SION 32 links 96 links 96 links 64 links 64 links 8 links 96 links 96 links 4
Direct-attached Traffic Flow Fabric Client OSS Storage 5
Direct-attached Raw I/O Baseline 6
Direct-attached Lustre Baseline 7
Writer Skew 8
SeaStar Bandwidth 9
Link Oversubscription ● Each link can sustain ~3100 MB/s (unidir) ● Each OST can contribute 180 MB/s with a balanced load presented to the DDN 9900 ● 260 MB/s individually ● Therefore, each link can support 17 client-OST pairs at saturation ● 11 client-OST pairs @ 260 MB/s 10
Link Oversubscription ● 70% of tests had more than one link with 18 client-OST pairs ● 42% had more than 34 pairs ● 21% had more than 60 ● 3% had over 70% But that's only part of the issue 11
Imbalanced Sharing 12
Placing I/O in the Torus ● We want to minimize link congestion ● Prefer no more than 11 client-OST pairs ● Easiest method is to place active clients topologically close to servers ● Use hop count as our metric 13
Placing I/O in the Torus ● For each OST to be used ● Calculate hop count from to OSS from each client ● Pick the client with the lowest count ● Remove that client from further consideration 14
Placing I/O in the Torus Fabric Client OSS Storage 15
Placement Results 16
Improved Writer Skew 17
Does it work in a smaller space? 18
LNET Routing ● Allows us to separate storage from compute platform ● Very simple in nature ● List of routers for each remote LNET ● Routers can have different weights ● 1024 character max for route option ● use lctl add_route for larger configs 19
Simple LNET Routing ● 196 routers in the torus ● Client uses each router in a weight class in a round-robin manner ● 8 back-to-back messages to a single destination will use 8 different routers ● Congestion in both torus and InfiniBand ● No opportunity to improve placement to control congestion 20
Simple LNET Routing Fabric Client Router Storage 21
InfiniBand Congestion 22
Improved Routing Configurations ● Aim to eliminate InfiniBand congestion ● Aim to reduce torus congestion ● Provide ability for application to determine which router will be used for a particular OST ● given OST-to-OSS mapping ● given client-to-router mappings 23
Nearest Neighbor ● 32 sets of routers ● one set for each leaf module ● 6 OSS servers in each set ● 6 routers in each set ● Each client chooses the nearest router to talk to the OSSes in a set ● Variable performance ● by job size ● by job location 24
Nearest Neighbor Fabric Client 1 1 1 1 2 2 2 2 Router (Group A) 1 1 1 1 2 2 2 2 Storage (Group A) 1 1 1 1 2 2 2 2 Router (Group B) 1 1 1 1 2 2 2 2 Storage (Group B) 25
Round Robin ● Again, 32 sets of routers ● Ordered list of routers for each set ● Client chooses router (nid % 6) for the set ● Throws I/O traffic around the torus 26
Round Robin Fabric Client 1 2 1 2 1 2 1 2 Router (Group A) 2 1 1 1 2 1 2 1 Storage (Group A) 1 1 1 2 1 2 1 2 Router (Group B) 2 1 2 1 2 1 2 1 Storage (Group B) 27
Projection ● 192 LNET networks ● one for each OSS ● One primary router for each LNET ● add higher weights for backup routers ● Clients experience variable latency to OSSes based on location ● Placement calculations similar to direct- attached 28
Projection Fabric Client Router Storage 29
Routed Placement Results (IOR) 30
Conclusions ● Goals exceeded: 244 GB/s on routed storage ● “Projected” configuration in production ● Working with library developers to bring this to users 31
Questions? ● Contact info: David Dillow 865-241-6602 dillowda@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. Notice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. 32
Recommend
More recommend