first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis Lê Nguyên Hoang Sébastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020
The Big Picture Machine learning (ML) tackles critical tasks ... 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust using Literature: robust when the model training 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 4y ago 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 4y ago 1
The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 4y ago Genuinely distributed, Byzantine ML 1
Machine learning (ML) Boat Goat ... 2
Machine learning (ML) Boat Goat ~1 to 100 millions ... 2
Machine learning (ML) Krust ZrOm ~1 to 100 millions ... 2
Machine learning (ML) Brust GOrm ~1 to 100 millions ... 2
Machine learning (ML) Bost GOat ~1 to 100 millions ... 2
Machine learning (ML) Boat Goat ~1 to 100 millions ... 2
Stochastic Gradient Descent (SGD) 4.2 0.5 1.0 0.8 Training loop: 1. Estimate gradient 5.7 0.3 2. Turn potentiometers ~1 to 100 following the gradient millions 3. Loop back to step 1. -.- 3
Stochastic Gradient Descent (SGD) 4.2 Training loop: -0.5 1. Estimate gradient -1.0 0.8 2. Turn potentiometers -5.7 following the gradient 0.3 3. Loop back to step 1. 3
Stochastic Gradient Descent (SGD) 4.2 Training loop: -0.5 1. Estimate gradient -1.0 0.8 2. Turn potentiometers -5.7 following the gradient 0.3 3. Loop back to step 1. 3
Distributed SGD parameter server ~1 to 100 millions worker network 4
Distributed SGD 4.2 4.1 -0.5 -0.5 -1.0 -1.0 0.8 0.7 -5.7 -5.7 0.4 0.3 parameter server 4.3 4.3 -0.5 -0.5 -0.9 -1.0 0.7 0.9 -5.7 -5.7 0.3 0.4 ~1 to 100 4.2 4.1 millions -0.5 -0.5 -1.0 -1.0 0.9 0.8 -5.7 -5.7 0.2 0.3 worker network 4
Distributed SGD parameter server 4.2 -0.5 4.1 -1.0 -0.5 4.3 0.8 -1.0 -0.5 -5.7 0.7 -0.9 0.4 -5.7 0.7 0.3 -5.7 0.3 ~1 to 100 millions worker network 4
Distributed SGD parameter server ~1 to 100 millions worker network 4
Distributed, Byzantine SGD parameter server ~1 to 100 millions worker network 5
Distributed, Byzantine SGD 4.2 -537 -0.5 -752 -1.0 349 0.8 412 -5.7 824 0.4 -153 parameter server 4.3 -537 -0.5 -752 -0.9 349 0.7 412 -5.7 824 0.3 -153 ~1 to 100 4.2 4.1 millions -0.5 -0.5 -1.0 -1.0 0.9 0.8 -5.7 -5.7 0.2 0.3 worker network 5
Distributed, Byzantine SGD parameter server 4.2 -0.5 4.1 -1.0 -0.5 -537 0.8 -1.0 -752 -5.7 0.7 349 0.4 -5.7412 0.3 824 -153 ~1 to 100 millions worker network 5
Distributed, Byzantine SGD parameter server ~1 to 100 millions worker network 5
Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 6
Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 MDA Median 4.2 -0.5 4.1 4.1 -1.0 -0.5 -0.5 -537 0.8 Krum -1.0 -1.0 ≈ -752 -5.7 0.7 0.7 349 0.4 -5.7412 -5.7 Bulyan 0.3 0.3 824 -153 GeoMed 6
Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 4.2 -0.5 4.1 4.1 -1.0 -0.5 -0.5 -537 0.8 MDA -1.0 -1.0 ≈ -752 -5.7 0.7 0.7 349 0.4 -5.7412 -5.7 0.3 0.3 824 -153 6
Problem single point of failure 7
Problem… solution 7
Problem… solution a n z t i y n B e s C u o s n n s e 7
Problem… solution… nope a n z t i y n B e s C u o s n n s e asynchronous network 8
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
Key problem: divergence A 1 B 2 3 C D 9
The goal Can we keep the ~1 to 100 millions ~1 to 100 millions ~1 to 100 millions "close" to each other... ...despite network asynchrony ... ...and Byzantine behaviors? 10
Key approach Can we bring the ~1 to 100 millions ~1 to 100 millions ~1 to 100 millions back closer to each other... ...despite network asynchrony ... ...and Byzantine behaviors? 11
Key approach: +1 round A 1 B 2 3 C D 11
Key approach: toy example 1 2 3 4 = 1-parameter model: & one 12
Key approach: toy example 1 2 3 4 diameter & one 12
Key approach: toy example 1 2 3 4 reduced diameter & one 12
Key approach: toy example 1 1 2 2 3 3 4 4 & one 12
Key approach: toy example 1 1 2 2 3 3 4 4 & one 12
Key approach: toy example 1 1 2 2 3 3 4 4 & one 12
Key approach: toy example 1 1 2 2 3 3 4 4 & one 12
Key approach: last remark 1 1 2 2 3 3 4 4 & one 13
Key approach: last remark ×2 1 1 ×2 2 2 2 ×2 3 3 3 ×2 4 4 4 & one 13
Key approach: last remark ×2 1 1 ×2 2 2 2 ×2 3 3 3 ×2 4 4 4 & one 13
Recommend
More recommend