Federated Learning Min Du Postdoc, UC Berkeley
Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
The goal of deep learning β’ Find a function, which produces a desired output given a particular input. Example task Given input Desired output Image classification 8 π₯ is the set of parameters contained by Next-word-prediction Looking forward to your ? reply the function Playing GO Next move
Finding the function: model training β’ Given one input sample pair π¦ # , π§ # , the goal of deep learning model training is to find a set of parameters π₯ , to maximize the probability of outputting π§ # given π¦ # . Given input: π¦ # Maximize: π(5|π¦ # , π₯)
Finding the function: model training β’ Given a training dataset containing π input-output pairs π¦ , , π§ , , π β 1, π , the goal of deep learning model training is to find a set of parameters π₯ , such that the average of π(π§ , ) is maximized given π¦ , . Given input: Output:
Finding the function: model training β’ Given a training dataset containing π input-output pairs π¦ , , π§ , , π β 1, π , the goal of deep learning model training is to find a set of parameters π₯ , such that the average of π(π§ , ) is maximized given π¦ , . β’ That is, 7 1 πππ¦ππππ¨π π 4 π(π§ , |π¦ , , π₯) ,56 Which is equivalent to 7 A basic component for loss function 1 πππππππ¨π π 4 βlog(π(π§ , |π¦ , , π₯)) π(π¦ , , π§ , , π₯) given sample π¦ , , π§ , : ,56 Let π , π₯ = π(π¦ , , π§ , , π₯) denote the loss function.
Deep learning model training For a training dataset containing π samples (π¦ , , π§ , ), 1 β€ π β€ π , the training objective is: where π π₯ β 6 7 7 β ,56 Cββ E π(π₯) min π , (π₯) π , π₯ = π(π¦ , , π§ , , π₯) is the loss of the prediction on example π¦ , , π§ , No closed-form solution : in a typical deep learning model, π₯ may contain millions of parameters. Non-convex : multiple local minima exist. π(π₯) π₯
Solution: Gradient Descent Loss π(π₯) How to stop? β when the update is small enough β converge. β₯ π₯ IJ6 β π₯ I β₯β€ π Randomly initialized weight π₯ or β₯ βπ(π₯ I ) β₯β€ π At the local minimum, βπ(π₯) is close to 0. Compute gradient βπ(π₯) π₯ IJ6 = π₯ I β πβπ(π₯) π₯ (Gradient Descent) Learning rate π controls the step size Problem : Usually the number of training samples n is large β slow convergence
Solution: Stochastic Gradient Descent (SGD) At each step of gradient descent, instead of compute for all training β samples, randomly pick a small subset (mini-batch) of training samples π¦ N , π§ N . π₯ IJ6 β π₯ I β πβπ π₯ I ; π¦ N , π§ N Compared to gradient descent, SGD takes more steps to converge, but β each step is much faster.
Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
The importance of data for ML β The biggest obstacle to using advanced data analysis isnβt skill base or technology; β itβs plain old access to the data. -Edd Wilder-James, Harvard Business Review
βData is the New Oilβ
image classification: e.g. to predict which photos are most likely to be viewed Google, Apple, multiple times in the future; ...... language models: ML model e.g. voice recognition, next-word-prediction, and auto-reply in Gmail Private data: all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.
Addressing privacy: Model parameters will never contain more information than the raw training data Google, Apple, ...... MODEL Addressing network overhead: ML model AGGREGATION The size of the model is generally smaller than the size of the raw training data ML ML ML model model model Instead of uploading the raw data, train a model locally and upload the model .
Federated optimization Characteristics (Major challenges) β Non-IID β The data generated by each user are quite different β Unbalanced β Some users produce significantly more data than others β Massively distributed β # mobile device owners >> avg # training samples on each device β Limited communication β Unstable mobile network connections β
A new paradigm β Federated Learning a synchronous update scheme that proceeds in rounds of communication McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017 .
Federated learning β overview Local Local data data Model M(i) Model M(i) Gradient updates Gradient updates for M(i) for M(i) Global model M(i) Model M(i) Model M(i) Gradient updates Gradient updates Central Server for M(i) for M(i) Local Local data data In round number iβ¦ Deployed by Google, Apple, etc.
Federated learning β overview Local Local data data Updates of M(i) Updates of M(i) Gradient updates Gradient updates for M(i) for M(i) Model Aggregation M(i+1) Updates of M(i) Updates of M(i) Gradient updates Gradient updates Central Server for M(i) for M(i) Local Local data data In round number iβ¦
Federated learning β overview Local Local Model M(i+1) data data Model M(i+1) Global model M(i+1) Model M(i+1) Model M(i+1) Central Server Local Local data data Round number i+1 and continueβ¦
Federated learning β detail For efficiency, at the beginning of each round, a random fraction C of clients is selected, and the server sends the current model parameters to each of these clients.
Federated learning β detail Recall in traditional deep learning model training β For a training dataset containing π samples (π¦ , , π§ , ), 1 β€ π β€ π , the training β objective is: 6 7 7 β ,56 Cββ E π(π₯) min where π π₯ β π , (π₯) π , π₯ = π(π¦ , , π§ , , π₯) is the loss of the prediction on example π¦ , , π§ , Deep learning optimization relies on SGD and its variants, through mini-batches β π₯ IJ6 β π₯ I β πβπ π₯ I ; π¦ N , π§ N
Federated learning β detail In federated learning β Suppose π training samples are distributed to πΏ clients, where π N is the set of β indices of data points on client π , and π N = π N . For training objective: min Cββ E π(π₯) β 7 U 6 T π π₯ = β N56 7 U β ,βW U π 7 πΊ N (π₯) where πΊ N (π₯) β , (π₯)
A baseline β FederatedSGD (FedSGD) A randomly selected client that has π N training data samples in federated β learning β A randomly selected sample in traditional deep learning Federated SGD (FedSGD): a single step of gradient descent is done per β round Recall in federated learning, a C -fraction of clients are selected at each β round. C =1: full-batch (non-stochastic) gradient descent β C <1: stochastic gradient descent (SGD) β
A baseline β FederatedSGD (FedSGD) Learning rate: π ; total #samples: π ; total #clients: πΏ ; #samples on a client k : π N ; clients fraction π· = 1 In a round t: β The central server broadcasts current model π₯ I to each client; each client k computes β gradient: π N = βπΊ N (π₯ I ) , on its local data. Approach 1: Each client k submits π N ; the central server aggregates the gradients to generate a β new model: ` a _ 7 U Recall f w = β ^56 ` F ^ (w) T π₯ IJ6 β π₯ I β πβπ π₯ I = π₯ I β π β N56 7 π N . β N β π₯ I β ππ N ; the central server performs Approach 2: Each client k computes: π₯ IJ6 β aggregation: For multiple times βΉ FederatedAveraging (FedAvg) 7 U T N π₯ IJ6 β β N56 7 π₯ IJ6 β
Federated learning β deal with limited communication Increase computation β Select more clients for training between each communication round β Increase computation on each client β
Federated learning β FederatedAveraging (FedAvg) Learning rate: π ; total #samples: π ; total #clients: πΏ ; #samples on a client k : π N ; clients fraction π· In a round t: β The central server broadcasts current model π₯ I to each client; each client k computes β gradient: π N = βπΊ N (π₯ I ) , on its local data. Approach 2: β N Each client k computes for E epochs : π₯ IJ6 β π₯ I β ππ N β 7 U T N The central server performs aggregation: π₯ IJ6 β β N56 7 π₯ IJ6 β 7 U Suppose B is the local mini-batch size, #updates on client k in each round: π£ N = πΉ e . β
Federated learning β FederatedAveraging (FedAvg) Model initialization Two choices: β On the central server β On each client β Shared initialization works better in practice. The loss on the full MNIST training set for models generated by ππ₯ + (1 β π)π₯ i
Recommend
More recommend