T owards An Application Objective-Aware Network Interface Sangeetha Abdu Jyothi Sayed Hadi Hashemi Roy Campbell Brighten Godfrey HotCloud’20
Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Network Fabric 2
Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Flow Flow Completion Time Network Fabric 2
Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Flow Flow Completion Time Network Fabric Coflow Coflow Completion Time 2
What is the ultimate goal of an ANI? Translating application requirements to actionable network requirements Are current ANIs sufficient? 3
Understanding an Application’s Objective • Applications have complex interdependencies f 2 c 2 f 1 between computation and communication A C • Prioritizing flows based on computations in f 1 c 1 succeeding stage is critical f 2 B c 3 Coflow-Optimized Performance-Optimized f 1 Network f 1 f 2 Current abstractions fail f 2 to capture application objective effectively c 1 c 2 c 3 Compute c 1 c 2 c 3 0 0.5 1 1.5 2 0 1 1.5 2 2.5 4
An Example Application: Distributed Deep Learning Parameter Server Worker Worker Worker • Gigabytes of data transferred in each iteration Update A op1’ which lasts milliseconds (e.g., VGG-16 send ~1GB data every 200ms) op2’ op3’ Update B Update C • Parameters consumed in a particular order op4’ Update D • Parameter updates from PS to workers send in op4 Read D the best order can accelerate training Read B op2 op3 Read C op1 Read A Input Data Sample TensorFlow Model: One Iteration 5
Other Applications Req 1 • User-facing partition-aggregation workloads (remote dependency resolution at a Web proxy) Req n Client Proxy • Graph processing systems • Iterative analytics with deadlines (eg: Naiad) and so on … Gather Scatter Update 6
Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7
Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7
Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7
Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective • Metrics may be priority, deadline, weight, etc. CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7
Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective • Metrics may be priority, deadline, weight, etc. CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7
Defining CCT flexibility ratio f 2 c 2 f 1 A When computation is the bottleneck, • C f 1 c 1 CadentFlow with deadlines provide flexibility for delaying some flows without affecting f 2 B c 3 application performance Performance-Optimized Performance-Optimized In the example, best Coflow Completion Time • (CCT) is 1s, but upto 1.5s is tolerable without f 1 f 2 f 1 f 2 any impact CCT flexibility ratio = Max tolerable CCT • c 1 c 2 c 3 c 1 c 2 c 3 Min CCT 2.5 0 0.5 1 1.5 2 0 0.5 1.5 2 c1 takes 0.5s c1 takes 1s 8
Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D p 3 op4 Read D p 2 Read B op2 op3 Read C p 2 p 1 op1 Read A Input Data Sample TensorFlow Model: One Iteration 9
Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D d=12ms t=2ms • Deadline-based op4 Read D • Assign deadlines based on per-op computation Read B op2 op3 Read C d=3ms d=3ms t=5ms t=4ms time op1 Read A d=0ms t=3ms Input Data • Objective: Minimize max i (endTime i − deadline i i ) Sample TensorFlow Model: One Iteration • 9
Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D d=12ms t=2ms • Deadline-based op4 Read D • Assign deadlines based on per-op computation Read B op2 op3 Read C d=3ms d=3ms t=5ms t=4ms time op1 Read A d=0ms t=3ms Input Data • Objective: Minimize max i (endTime i − deadline i i ) Sample TensorFlow Model: One Iteration delay of flow i • 9
Quantifying benefits achievable with a better network abstraction • Representative application: distributed deep learning • Methodology Update A op1’ • Tracing distributed deep learning workloads to obtain Update B op2’ op3’ Update C dependencies and computation/communication times op4’ Update D • Simulate various network control schemes op4 Read D 1. TCP (max-min fairness across flows sharing Read B op2 op3 Read C a link) op1 Read A 2. Minimum Allocation for Desired Duration (MADD) [Coflow control in Varys] Input Data 3. CadentFlow-optimized scheme Sample TensorFlow Model: One Iteration 10
Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized AlexNet-v2 CifarNet Inception-v1 Inception-v3 MobileNet-v2 ResNet-v1-50 ResNet-v1-152 ResNet-v1-200 Up to 25% improvement in iteration time ResNet-v2-101 with CadentFlow ResNet-v2-152 VGG-19 0 0.2 0.4 0.6 0.8 1 1.2 Iteration time (relative to TCP) 8 workers, 8 PS 11
Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized Coflow optimization may delay AlexNet-v2 CifarNet completion time because smaller Inception-v1 parameters are delayed Inception-v3 MobileNet-v2 ResNet-v1-50 ResNet-v1-152 ResNet-v1-200 ResNet-v2-101 ResNet-v2-152 VGG-19 0 0.2 0.4 0.6 0.8 1 1.2 Iteration time (relative to TCP) 8 workers, 8 PS 11
Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized AlexNet-v2 -v2 CifarNet et Inception-v1 -v1 Inception-v3 -v3 MobileNet-v2 -v2 ResNet-v1-50 50 ResNet-v1-152 52 ResNet-v1-200 00 ResNet-v2-101 01 ResNet-v2-152 52 VGG-19 19 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 Iteration time Iteration time (relative to TCP) (relative to TCP) 8 workers, 8 PS 16 workers,16 PS 11
Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized -v2 AlexNet-v2 -v2 -v2 et CifarNet et et -v1 Inception-v1 -v1 -v1 -v3 Inception-v3 -v3 -v3 -v2 MobileNet-v2 -v2 -v2 -50 ResNet-v1-50 50 -50 52 ResNet-v1-152 52 52 00 ResNet-v1-200 00 00 01 ResNet-v2-101 01 01 52 ResNet-v2-152 52 52 -19 VGG-19 -19 19 0 0.4 0.8 1.2 1.6 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 CCT fl exibility ratio Iteration time Iteration time CCT fl exibility ratio (max feasible CCT/ min CCT) (relative to TCP) (relative to TCP) (max feasible CCT/ min CCT) 8 workers, 8 PS 16 workers,16 PS 11
Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized -v2 AlexNet-v2 -v2 -v2 et CifarNet et et -v1 Inception-v1 -v1 -v1 -v3 Inception-v3 -v3 -v3 -v2 MobileNet-v2 -v2 -v2 -50 ResNet-v1-50 50 -50 52 ResNet-v1-152 52 52 00 ResNet-v1-200 00 00 01 ResNet-v2-101 01 01 52 ResNet-v2-152 52 52 -19 VGG-19 -19 19 0 0.4 0.8 1.2 1.6 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 CCT fl exibility ratio Iteration time Iteration time CCT fl exibility ratio (max feasible CCT/ min CCT) (relative to TCP) (relative to TCP) (max feasible CCT/ min CCT) When gain in iteration time is lower, CCT flexibility ratio is higher 8 workers, 8 PS 16 workers,16 PS 11
Recommend
More recommend