decentralized stochastic optimization and gossip
play

Decentralized Stochastic Optimization and Gossip Algorithms with - PowerPoint PPT Presentation

ICML 2019 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi EPFL, Switzerland mlo.epfl.ch June 11, 2019 S. U. Stich CHOCO-SGD 1 Decentralized


  1. ICML 2019 Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi EPFL, Switzerland mlo.epfl.ch June 11, 2019 S. U. Stich CHOCO-SGD 1

  2. Decentralized Stochastic Optimization � � n f ( x ) := 1 � min f i ( x ) n x ∈ R d i =1 ← devices ← communication links f j ( x ) f i ( x ) each device has oracle access to stochastic gradients g i ( x ) , E g i ( x ) = ∇ f i ( x ) , Var[ g i ] ≤ σ 2 i S. U. Stich CHOCO-SGD 2

  3. Decentralized Stochastic Optimization Applications: servers, mobile devices, sensors, hospitals, ... Advantages: • no central coordinator • local communication vs. all-reduce • data distributed (storage & privacy aspects) This work: bandwidth restricted setting where communication is a bottleneck S. U. Stich CHOCO-SGD 3

  4. Data Compression for Efficient Communication Communication Compression: Compress models/model updates before sending over the network. This work: Arbitrary compressors, supporting the main SOTA techniques! General Compressor: Q : R d → R d can be biased! E Q � x − Q ( x ) � 2 ≤ (1 − δ ) � x � 2 ∀ x ∈ R d Examples: Quantization, rounding, sign, top- k , rank- k S. U. Stich CHOCO-SGD 4

  5. Main Contribution: CHOCO-SGD We propose CHOCO-SGD: a decentralized SGD algorithm with communication compression. Main result: CHOCO-SGD converges at the rate � ¯ � σ 2 1 x T ) − f ⋆ = O f (¯ + µnT µ 2 δ 2 ρ 4 T 2 � �� � � �� � linear speedup higher order term, accounting matches centralized baseline for topology and compression σ = 1 n σ 2 f µ -strong convex, variance ¯ i , spectral gap of topology ρ > 0 • first scheme with linear speedup for arbitrary compressors • improves over previous approach [Tang et al., Neurips 18] S. U. Stich CHOCO-SGD 5

  6. Key Technique: CHOCO-Gossip We propose CHOCO-Gossip: a new algorithm with communication compression for the average consensus problem: n x = 1 � ¯ x i n i =1 classic gossip averaging compression with error feedback + [Xiao & Boyd, 04] [Stich et al., NeurIPS 18] • linear convergence for arbitrary compressors • all previous gossip schemes with compression did not converge linearly (or not at all) for arbitrary compressors S. U. Stich CHOCO-SGD 6

  7. Experimental Results Example: quantization to 4bits epochs transmitted data Logistic regression on epsilon dataset, ring topology with n = 9 nodes. S. U. Stich CHOCO-SGD 7

Recommend


More recommend