mocha federated multi task learning
play

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith - PowerPoint PPT Presentation

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai Chiang USC Maziar Sanjabi USC Ameet Talwalkar CMU MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization


  1. MOCHA: Federated 
 Multi-Task Learning NIPS ‘17 Virginia Smith 
 Stanford / CMU Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU

  2. MACHINE LEARNING WORKFLOW data & problem machine learning model n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

  3. ^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem machine learning model systems setting n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

  4. how can we perform fast distributed optimization?

  5. BEYOND THE DATACENTER Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure

  6. BEYOND THE DATACENTER Massively Distributed Systems Challenges Node Heterogeneity Unbalanced Statistical Challenges Non-IID Underlying Structure

  7. ^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem machine learning model systems setting n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

  8. ^ IN PRACTICE MACHINE LEARNING WORKFLOW data & problem systems setting machine learning model n optimization algorithm X min ` ( w , x i ) + g ( w ) w i =1

  9. OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

  10. OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

  11. A GLOBAL APPROACH W [MMRHA, AISTATS 16]

  12. A LOCAL APPROACH W 12 W 1 W 11 W 10 W 2 W 8 W 3 W 9 W 4 W 6 W 7 W 5

  13. OUR APPROACH: PERSONALIZED MODELS W 12 W 1 W 11 W 10 W 2 W 8 W 3 W 9 W 4 W 6 W 7 W 5

  14. OUR APPROACH: PERSONALIZED MODELS W 12 W 1 W 11 W 10 W 2 W 8 W 3 W 9 W 4 W 6 W 7 W 5

  15. MULTI-TASK LEARNING n t m X X ` t ( w t , x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 losses regularizer models task relationship All tasks Outlier related tasks Clusters Asymmetric / groups relationships [ZCY, SDM 2012]

  16. FEDERATED DATASETS Google 
 Human 
 Glass Activity Vehicle 
 Land 
 Sensor Mine

  17. PREDICTION ERROR Global Local MTL Human 
 2.23 
 1.34 
 0.46 
 Activity (0.30) (0.21) (0.11) Google 
 5.34 
 4.92 
 2.02 
 Glass (0.26) (0.26) (0.15) Land 
 27.72 
 23.43 
 20.09 
 Mine (1.08) (0.77) (1.04) Vehicle 
 13.4 
 7.81 
 6.59 
 Sensor (0.26) (0.13) (0.21)

  18. OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

  19. OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

  20. 
 GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion 𝛁 can be updated centrally W needs to be solved in federated setting Challenges : Communication is expensive Statistical & systems heterogeneity Stragglers Fault tolerance

  21. 
 
 GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) Idea : W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion Modify a communication-efficient method 𝛁 can be updated centrally for the data center setting to handle: W needs to be solved in federated setting ✔ Multi-task learning Challenges : ✔ Stragglers Communication is expensive Statistical & systems heterogeneity ✔ Fault tolerance 
 Stragglers Fault tolerance

  22. COCOA: COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION mini-batch 
 methods key idea: 
 control communication one-shot 
 communication

  23. COCOA: PRIMAL-DUAL FRAMEWORK ≥ PRIMAL DUAL K X g ∗ ( X [ k ] , α [ k ] ) ˜ � n n 1 α ∈ R n − 1 k =1 X X ` ( w T x i ) + � g ( w ) ` ∗ ( − ↵ i ) − � g ∗ ( X, α ) min max n w ∈ R d n i =1 i =1 α k(t+1) α k � α k(t)

  24. COCOA: PRIMAL-DUAL FRAMEWORK ≥ PRIMAL DUAL K X g ∗ ( X [ k ] , α [ k ] ) ˜ � challenge #1: n n 1 α ∈ R n − 1 k =1 X X ` ( w T x i ) + � g ( w ) ` ∗ ( − ↵ i ) − � g ∗ ( X, α ) min max extend to MTL setup n w ∈ R d n i =1 i =1 α k(t+1) α k � α k(t)

  25. COCOA: COMMUNICATION PARAMETER Main assumption: 
 each subproblem is solved to accuracy 𝚺 amount of local 
 ≈ computation 
 ∈ [0, 1) 𝚺 vs. 
 communication exactly inexactly solve solve

  26. COCOA: COMMUNICATION PARAMETER Main assumption: 
 each subproblem is solved to accuracy 𝚺 challenge #2: make communication 
 amount of local 
 more flexible ≈ computation 
 ∈ [0, 1) 𝚺 vs. 
 communication exactly inexactly solve solve

  27. MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION m n t X X ` t ( w T t x i min t ) + R ( W , Ω ) W , Ω t =1 i =1 Solve for W, 𝛁 in an alternating fashion Modify CoCoA to solve W in federated setting n t m X X t ( − α i min ` ∗ t ) + R ∗ ( X α ) α t =1 i =1 n t t ) + h w t ( α ) , X t ∆ α t i + � 0 2 k X t ∆ α t k 2 X t ( � α i t � ∆ α i ` ⇤ min M t ∆ α t i =1

  28. MOCHA: PER-DEVICE, PER-ITERATION APPROXIMATIONS θ h New assumption: t ∈ [0 , 1] each subproblem is solved to accuracy θ ∈ [0 , 1) Stragglers (Statistical heterogeneity) Difficulty of solving subproblem Size of local dataset Stragglers (Systems heterogeneity) Hardware (CPU, memory) Network connection (3G, LTE, …) Power (battery level) Fault tolerance Devices going offline

  29. CONVERGENCE New assumption: each subproblem is solved to accuracy θ h t ∈ and assume: P [ θ h t := 1] < 1 ` t Theorem 1. Let be ` t Theorem 2. Let be 
 -smooth, then (1 /µ ) -Lipschitz, then L 1 µ + n log n ✓ 8 L 2 n 2 1 ◆ T ≥ + ˜ T ≥ c (1 − ¯ (1 − ¯ Θ ) µ Θ ) ✏ ✏ 1/ ε rate linear rate

  30. MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION Algorithm 1 Mocha : Federated Multi-Task Learning Framework 1: Input: Data X t stored on t = 1 , . . . , m devices 2: Initialize α (0) := 0 , v (0) := 0 3: for iterations i = 0 , 1 , . . . do for iterations h = 0 , 1 , · · · , H i do 4: for devices t 2 { 1 , 2 , . . . , m } in parallel do 5: call local solver, returning θ h t -approximate solution ∆ α t 6: update local variables α t α t + ∆ α t 7: reduce: v v + P t X t ∆ α t 8: Update Ω centrally using w ( v ) := r R ∗ ( v ) 9: 10: Compute w ( v ) := r R ∗ ( v ) 11: return: W := [ w 1 , . . . , w m ]

  31. STATISTICAL HETEROGENEITY Wifi LTE Human Activity: Statistical Heterogeneity (WiFi) Human Activity: Statistical Heterogeneity (LTE) 10 2 10 2 MOCHA MOCHA CoCoA CoCoA Mb-SDCA Mb-SDCA 10 1 10 1 Mb-SGD Mb-SGD Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 3G 10 6 10 6 Estimated Time Estimated Time Human Activity: Statistical Heterogeneity (3G) 10 2 MOCHA MOCHA & COCOA CoCoA Mb-SDCA 10 1 Mb-SGD PERFORM Primal Sub-Optimality MOCHA IS ROBUST TO PARTICULARLY WELL 10 0 STATISTICAL IN HIGH- HETEROGENEITY 10 -1 COMMUNICATION SETTINGS 10 -2 10 -3 0 0.5 1 1.5 2 10 7 Estimated Time

  32. 
 SYSTEMS HETEROGENEITY MOCHA SIGNIFICANTLY OUTPERFORMS ALL COMPETITORS [BY 2 ORDERS OF MAGNITUDE] Low High Vehicle Sensor: Systems Heterogeneity (Low) Vehicle Sensor: Systems Heterogeneity (High) 10 2 10 2 MOCHA MOCHA CoCoA CoCoA Mb-SDCA Mb-SDCA 10 1 10 1 Mb-SGD Mb-SGD Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 10 6 10 6 Estimated Time Estimated Time

  33. FAULT TOLERANCE W-Step Full Method Google Glass: Fault Tolerance, W Step Google Glass: Fault Tolerance, Full Method 10 2 10 2 10 1 10 1 Primal Sub-Optimality Primal Sub-Optimality 10 0 10 0 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 10 6 10 7 Estimated Time Estimated Time MOCHA IS ROBUST TO DROPPED NODES

  34. OUTLINE Unbalanced Statistical Challenges Non-IID Underlying Structure Massively Distributed Systems Challenges Node Heterogeneity

  35. WWW.SYSML.CC Virginia Smith Stanford / CMU CODE & PAPERS cs.berkeley.edu/~vsmith

Recommend


More recommend