federated learning
play

Federated Learning Min Du Postdoc, UC Berkeley Outline q - PowerPoint PPT Presentation

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems Outline q Preliminary: deep learning and SGD q Federated


  1. Federated Learning Min Du Postdoc, UC Berkeley

  2. Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

  3. Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

  4. The goal of deep learning β€’ Find a function, which produces a desired output given a particular input. Example task Given input Desired output Image classification 8 π‘₯ is the set of parameters contained by Next-word-prediction Looking forward to your ? reply the function Playing GO Next move

  5. Finding the function: model training β€’ Given one input sample pair 𝑦 # , 𝑧 # , the goal of deep learning model training is to find a set of parameters π‘₯ , to maximize the probability of outputting 𝑧 # given 𝑦 # . Given input: 𝑦 # Maximize: π‘ž(5|𝑦 # , π‘₯)

  6. Finding the function: model training β€’ Given a training dataset containing π‘œ input-output pairs 𝑦 , , 𝑧 , , 𝑗 ∈ 1, π‘œ , the goal of deep learning model training is to find a set of parameters π‘₯ , such that the average of π‘ž(𝑧 , ) is maximized given 𝑦 , . Given input: Output:

  7. Finding the function: model training β€’ Given a training dataset containing π‘œ input-output pairs 𝑦 , , 𝑧 , , 𝑗 ∈ 1, π‘œ , the goal of deep learning model training is to find a set of parameters π‘₯ , such that the average of π‘ž(𝑧 , ) is maximized given 𝑦 , . β€’ That is, 7 1 𝑛𝑏𝑦𝑗𝑛𝑗𝑨𝑓 π‘œ 4 π‘ž(𝑧 , |𝑦 , , π‘₯) ,56 Which is equivalent to 7 A basic component for loss function 1 π‘›π‘—π‘œπ‘—π‘›π‘—π‘¨π‘“ π‘œ 4 βˆ’log(π‘ž(𝑧 , |𝑦 , , π‘₯)) π‘š(𝑦 , , 𝑧 , , π‘₯) given sample 𝑦 , , 𝑧 , : ,56 Let 𝑔 , π‘₯ = π‘š(𝑦 , , 𝑧 , , π‘₯) denote the loss function.

  8. Deep learning model training For a training dataset containing π‘œ samples (𝑦 , , 𝑧 , ), 1 ≀ 𝑗 ≀ π‘œ , the training objective is: where 𝑔 π‘₯ ≝ 6 7 7 βˆ‘ ,56 Cβˆˆβ„ E 𝑔(π‘₯) min 𝑔 , (π‘₯) 𝑔 , π‘₯ = π‘š(𝑦 , , 𝑧 , , π‘₯) is the loss of the prediction on example 𝑦 , , 𝑧 , No closed-form solution : in a typical deep learning model, π‘₯ may contain millions of parameters. Non-convex : multiple local minima exist. 𝑔(π‘₯) π‘₯

  9. Solution: Gradient Descent Loss 𝑔(π‘₯) How to stop? – when the update is small enough – converge. βˆ₯ π‘₯ IJ6 βˆ’ π‘₯ I βˆ₯≀ πœ— Randomly initialized weight π‘₯ or βˆ₯ βˆ‡π‘”(π‘₯ I ) βˆ₯≀ πœ— At the local minimum, βˆ‡π‘”(π‘₯) is close to 0. Compute gradient βˆ‡π‘”(π‘₯) π‘₯ IJ6 = π‘₯ I βˆ’ πœƒβˆ‡π‘”(π‘₯) π‘₯ (Gradient Descent) Learning rate πœƒ controls the step size Problem : Usually the number of training samples n is large – slow convergence

  10. Solution: Stochastic Gradient Descent (SGD) At each step of gradient descent, instead of compute for all training ● samples, randomly pick a small subset (mini-batch) of training samples 𝑦 N , 𝑧 N . π‘₯ IJ6 ← π‘₯ I βˆ’ πœƒβˆ‡π‘” π‘₯ I ; 𝑦 N , 𝑧 N Compared to gradient descent, SGD takes more steps to converge, but ● each step is much faster.

  11. Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

  12. The importance of data for ML β€œ The biggest obstacle to using advanced data analysis isn’t skill base or technology; ” it’s plain old access to the data. -Edd Wilder-James, Harvard Business Review

  13. β€œData is the New Oil”

  14. image classification: e.g. to predict which photos are most likely to be viewed Google, Apple, multiple times in the future; ...... language models: ML model e.g. voice recognition, next-word-prediction, and auto-reply in Gmail Private data: all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.

  15. Addressing privacy: Model parameters will never contain more information than the raw training data Google, Apple, ...... MODEL Addressing network overhead: ML model AGGREGATION The size of the model is generally smaller than the size of the raw training data ML ML ML model model model Instead of uploading the raw data, train a model locally and upload the model .

  16. Federated optimization Characteristics (Major challenges) ● Non-IID β—‹ The data generated by each user are quite different β–  Unbalanced β—‹ Some users produce significantly more data than others β–  Massively distributed β—‹ # mobile device owners >> avg # training samples on each device β–  Limited communication β—‹ Unstable mobile network connections β– 

  17. A new paradigm – Federated Learning a synchronous update scheme that proceeds in rounds of communication McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017 .

  18. Federated learning – overview Local Local data data Model M(i) Model M(i) Gradient updates Gradient updates for M(i) for M(i) Global model M(i) Model M(i) Model M(i) Gradient updates Gradient updates Central Server for M(i) for M(i) Local Local data data In round number i… Deployed by Google, Apple, etc.

  19. Federated learning – overview Local Local data data Updates of M(i) Updates of M(i) Gradient updates Gradient updates for M(i) for M(i) Model Aggregation M(i+1) Updates of M(i) Updates of M(i) Gradient updates Gradient updates Central Server for M(i) for M(i) Local Local data data In round number i…

  20. Federated learning – overview Local Local Model M(i+1) data data Model M(i+1) Global model M(i+1) Model M(i+1) Model M(i+1) Central Server Local Local data data Round number i+1 and continue…

  21. Federated learning – detail For efficiency, at the beginning of each round, a random fraction C of clients is selected, and the server sends the current model parameters to each of these clients.

  22. Federated learning – detail Recall in traditional deep learning model training ● For a training dataset containing π‘œ samples (𝑦 , , 𝑧 , ), 1 ≀ 𝑗 ≀ π‘œ , the training β—‹ objective is: 6 7 7 βˆ‘ ,56 Cβˆˆβ„ E 𝑔(π‘₯) min where 𝑔 π‘₯ ≝ 𝑔 , (π‘₯) 𝑔 , π‘₯ = π‘š(𝑦 , , 𝑧 , , π‘₯) is the loss of the prediction on example 𝑦 , , 𝑧 , Deep learning optimization relies on SGD and its variants, through mini-batches β—‹ π‘₯ IJ6 ← π‘₯ I βˆ’ πœƒβˆ‡π‘” π‘₯ I ; 𝑦 N , 𝑧 N

  23. Federated learning – detail In federated learning ● Suppose π‘œ training samples are distributed to 𝐿 clients, where 𝑄 N is the set of β—‹ indices of data points on client 𝑙 , and π‘œ N = 𝑄 N . For training objective: min Cβˆˆβ„ E 𝑔(π‘₯) β—‹ 7 U 6 T 𝑔 π‘₯ = βˆ‘ N56 7 U βˆ‘ ,∈W U 𝑔 7 𝐺 N (π‘₯) where 𝐺 N (π‘₯) ≝ , (π‘₯)

  24. A baseline – FederatedSGD (FedSGD) A randomly selected client that has π‘œ N training data samples in federated ● learning β‰ˆ A randomly selected sample in traditional deep learning Federated SGD (FedSGD): a single step of gradient descent is done per ● round Recall in federated learning, a C -fraction of clients are selected at each ● round. C =1: full-batch (non-stochastic) gradient descent β—‹ C <1: stochastic gradient descent (SGD) β—‹

  25. A baseline – FederatedSGD (FedSGD) Learning rate: πœƒ ; total #samples: π‘œ ; total #clients: 𝐿 ; #samples on a client k : π‘œ N ; clients fraction 𝐷 = 1 In a round t: ● The central server broadcasts current model π‘₯ I to each client; each client k computes β—‹ gradient: 𝑕 N = βˆ‡πΊ N (π‘₯ I ) , on its local data. Approach 1: Each client k submits 𝑕 N ; the central server aggregates the gradients to generate a β–  new model: ` a _ 7 U Recall f w = βˆ‘ ^56 ` F ^ (w) T π‘₯ IJ6 ← π‘₯ I βˆ’ πœƒβˆ‡π‘” π‘₯ I = π‘₯ I βˆ’ πœƒ βˆ‘ N56 7 𝑕 N . ● N ← π‘₯ I βˆ’ πœƒπ‘• N ; the central server performs Approach 2: Each client k computes: π‘₯ IJ6 β–  aggregation: For multiple times ⟹ FederatedAveraging (FedAvg) 7 U T N π‘₯ IJ6 ← βˆ‘ N56 7 π‘₯ IJ6 ●

  26. Federated learning – deal with limited communication Increase computation ● Select more clients for training between each communication round β—‹ Increase computation on each client β—‹

  27. Federated learning – FederatedAveraging (FedAvg) Learning rate: πœƒ ; total #samples: π‘œ ; total #clients: 𝐿 ; #samples on a client k : π‘œ N ; clients fraction 𝐷 In a round t: ● The central server broadcasts current model π‘₯ I to each client; each client k computes β—‹ gradient: 𝑕 N = βˆ‡πΊ N (π‘₯ I ) , on its local data. Approach 2: β–  N Each client k computes for E epochs : π‘₯ IJ6 ← π‘₯ I βˆ’ πœƒπ‘• N ● 7 U T N The central server performs aggregation: π‘₯ IJ6 ← βˆ‘ N56 7 π‘₯ IJ6 ● 7 U Suppose B is the local mini-batch size, #updates on client k in each round: 𝑣 N = 𝐹 e . ●

  28. Federated learning – FederatedAveraging (FedAvg) Model initialization Two choices: ● On the central server β—‹ On each client β—‹ Shared initialization works better in practice. The loss on the full MNIST training set for models generated by πœ„π‘₯ + (1 βˆ’ πœ„)π‘₯ i

Recommend


More recommend