distributed learning over unreliable networks
play

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, - PowerPoint PPT Presentation

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu Presenter: Chen Yu AllReduce SGD Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3 n x t


  1. Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu Presenter: Chen Yu

  2. AllReduce SGD Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3 n x t +1 = x t − 1 ∑ ∇ F ( x t ; ξ ( i ) t ) n i =1 AllReduce

  3. Unreliable Network Server 1 Server 2 Server 2 2.5 95% Arrival 2 99% Arrival Training loss Baseline 1.5 1 0.5 0 0 20 40 60 80 100 120 140 160 Worker 1 Worker 2 Worker 3 Epoch Sharing Gradients Won’t Work

  4. Reliable Parameter Server (RPS) High Level: Share Models Local Partition: = ( ( v ( i ,1) ⊤ ) ⊤ , ( v ( i ,2) ⊤ , ⋯ , ( v ( i , n ) ⊤ v ( i ) = x ( i ) t − γ g ( i ) v ( i ) ) ) ) t , . t t t t t Robust Averaging: 1 t | ∑ v ( i , j ) v ( i ) ˜ t = t | 𝒪 ( i ) j ∈𝒪 ( j ) t Model Update: t +1 = { j ∈ ˜ v ( j ) 𝒪 ( i ) ˜ t t x ( i , j ) . j ∉ ˜ v ( i , j ) 𝒪 ( i ) t t

  5. Convergence Rate Assumptions: Non Convex, with L-Lipschitz Gradient; f ( x ) 𝔽 ξ ∼𝒠 i ∥∇ F ( x ; ξ ) − ∇ f i ( x ) ∥ 2 ≤ σ 2 , ∀ i , ∀ x ; Bounded Data Variance n 1 2 ⩽ ζ 2 , ∑ Bounded Dataset Di ff erence ∇ f i ( x ) − ∇ f ( x ) ∀ i , ∀ x . n i =1 T : Total Iterations p : Package Dropping Rate ( σ + ζ ) ( 1 + p (1 − p ) ) T 1 2 ≲ + 1 ∑ 𝔽 ∇ f ( x t ) T T (1 − p ) nT t =1

  6. Experiments 16 NVIDIA TITAN Xp GPUs, ResNet-110 on CIFAR-10 RPS is Robust Standard SGD is Vulnerable 2.5 2.5 95% Arrival 60% Arrival 2 2 99% Arrival 80% Arrival Training loss Training loss Baseline 90% Arrival 1.5 1.5 95% Arrival 99% Arrival Baseline 1 1 0.5 0.5 0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Epoch Epoch

  7. Thanks Welcome to Pacific Ballroom #97 to see the poster for more detail

Recommend


More recommend