privacy-preserving decentralized learning of personalized models and collaboration graphs Aurélien Bellet (Inria) Includes work with: M. Tommasi, P. Vanhaesebrouck (University of Lille & Inria) R. Guerraoui, M. Taziki (EPFL) V. Zantedeschi (University of Saint-Etienne) Workshop on Optimization for Machine Learning Centre International de Rencontres Mathématiques, Marseille March 10, 2020
connected devices: pervasive or invasive? • Connected devices are spreading rapidly and collect increasingly personal data • Ex: browsing logs, health, speech, accelerometer, geolocation... • Opportunity to provide personalized services but also a potential threat to privacy • A first step to try and reconcile the two: keep and process data on the user device • Training on the edge: train ML model on data from many devices 2
training on the edge: challenges • How to deal with imbalanced and non-i.i.d. local datasets • How to provide formal privacy guarantees • ... 3 • How to scale to a large number of devices
federated vs fully decentralized training Standard federated learning • Coordination by a central server • Single point of failure, server may become a bottleneck Fully decentralized learning • Device-to-device communication in a sparse network graph • Naturally scales to many devices 4 See [Kairouz et al., 2019] for a detailed overview of federated/decentralized ML
global model vs personalized models Global model predictions for all devices all users • Large model may be needed to capture the specificities of each user Personalized models • One model per device bla blablbalbalablab that user and from similar users • Smaller models may be sufficient 5 • One-size-fits-all: same model makes • Model should be trained on data from • Model should be trained on data from
our approach We propose to learn personalized models in a fully decentralized setting: • Learn “who to communicate with” by inferring a graph of similarities between users • Collaboratively learn personalized models over this graph • Jointly optimize the models and the graph, in an alternating fashion 6
problem formulation
users and local datasets m i m i 8 • A set of n users (devices) with common feature space X and label space Y • User i has local dataset S i = { ( x j i , y j i ) } m i j = 1 drawn from personal distribution and wants to learn a model θ i ∈ R p which generalizes well to future local data • Let ℓ : R p × X × Y → R be a loss function, differentiable in first argument • In isolation, user i can learn a model by minimizing a local objective L i ( θ ; S i ) , e.g., ∑ L i ( θ ; S i ) = 1 ℓ ( θ ; x j i , y j i ) + λ i ∥ θ ∥ 2 , with λ i ≥ 0 j = 1 • This will generalize poorly when local data is scarce → need to collaborate
decentralized setting • Asynchronous time model : each user becomes active at random times, asynchronously and in parallel (we use global counter t to denote the t -th activation) • Communication model : all users can exchange messages, but we want to restrict communication to pairs of most similar users • We model this by a collaboration graph: a sparse weighted graph with edge weight 9 w ij ≥ 0 reflecting similarity between the learning tasks of users i and j
joint optimization problem n local models and a shared model per connected component 2 10 as solutions to min • Learn personalized models Θ ∈ R n × p and graph weights w ∈ R n ( n − 1 ) / 2 ≥ 0 ∑ ∑ d i c i L i ( θ i ; S i ) + µ w ij ∥ θ i − θ j ∥ 2 + λ g ( w ) , J (Θ , w ) = Θ ∈ R n × p i = 1 i < j w ∈ R n ( n − 1 ) / 2 ≥ 0 • c i ∈ ( 0 , 1 ] ∝ m i : “confidence” of user i , d i = ∑ j ̸ = i w ij : degree of i • Trade-off between accurate models on local data and smooth models over the graph • Term g ( w ) : avoid trivial collaboration graph, encourage sparsity • Flexible relationships: hyperparameter µ ≥ 0 interpolates between learning purely
outline of the proposed algorithm 1. A decentralized algorithm to learn the models given the graph 2. A decentralized algorithm to learn a graph given the models 11 We design an alternating optimization procedure over Θ and w :
learning models given the graph
properties of objective function i i i 13 • For fixed graph weights w , denote f (Θ) := J (Θ , w ) • Assume local loss L i has L loc i -Lipschitz continuous gradient • Then ∇ f is L i -Lipschitz w.r.t. block θ i with L i = d i ( µ + c i L loc i ) • Can also assume that L i is σ loc -strongly convex where σ loc > 0 • Then f is σ -strongly convex with σ ≥ min 1 ≤ i ≤ n [ d i c i σ loc ] > 0
decentralized algorithm i • This is an instance of block coordinate descent! d i w ij 14 1 • Denote neighborhood of user i by N i = { j : w ij > 0 } • Initialize models Θ( 0 ) ∈ R n × p • At step t ≥ 0, a random user i becomes active: 1. user i updates its model based on its local dataset S i and the information from neighbors: ( ) ∑ θ i ( t + 1 ) = θ i ( t ) − c i ∇L i ( θ i ( t ); S i ) − µ θ j ( t ) µ + c i L loc j ∈ N i 2. user i sends its updated model θ i ( t + 1 ) to its neighborhood N i
convergence rate Proposition ( [Bellet et al., 2018] ) • Makes the algorithm naturally scalable to many users nL max 15 convex, we have: For any T > 0 , let (Θ( t )) T t = 1 be the sequence of iterates generated by the algorithm run- ning for T iterations from an initial point Θ( 0 ) . When the local losses L i are strongly ( ) T σ E [ f (Θ( T )) − f ⋆ ] ≤ ( f (Θ( 0 )) − f ∗ ) . 1 − where L max = max i L i and σ are smoothness and strong convexity parameters. • Constant number of per-user updates → optimality gap roughly constant in n
what about privacy? • In some applications, data may be sensitive and users may not want to reveal it sequences of models computed from data • Consider an adversary observing all the information sent over the network (but not the internal memory of users) • Goal: formally quantify how much information is leaked about the local dataset 16 • In our algorithms, users never communicate their local data but they exchange
differential privacy • Information-theoretic (no computational assumptions) 17 ϵ -Differential Privacy [Dwork, 2006] Let M be a randomized mechanism taking a dataset as input, and let ϵ > 0. We say that M is ϵ -differentially private if for all datasets S , S ′ differing in a single data point and for all sets of possible outputs O ⊆ range ( M ) , we have: Pr ( M ( S ) ∈ O ) ≤ e ϵ Pr ( M ( S ′ ) ∈ O ) . • Output of M almost the same regardless of whether a particular data point was used • Robust to background knowledge that adversary may have • Composition property: the combined output of two ϵ -DP mechanisms run on the same dataset is 2 ϵ -DP
differentially private algorithm i d i w ij 1. Replace the update of the algorithm by c i 18 1 ( ) ∑ ( ) θ i ( t + 1 ) = � � ∇L i ( � � θ i ( t ) − θ i ( t ); S i ) + η i − µ θ j ( t ) µ + c i L loc j ∈ N i where η i ∼ Laplace ( 0 , s i ) p ∈ R p 2. User i then broadcasts noisy iterate � θ i ( t + 1 ) to its neighbors
privacy guarantee Theorem ( [Bellet et al., 2018] ) L 0 • Follows from sensitivity analysis of the update • Can be improved by strong composition [Kairouz et al., 2015] (under relaxed DP) 19 Let i ∈ � n � and assume • ℓ ( · ; x , y ) L 0 -Lipschitz w.r.t. the L 1 -norm for all ( x , y ) ∈ X × Y • User i wakes up T i times and use noise scale s i = ϵ i m i • Mechanism M i ( S i ) : releases the sequence of user i’s models For any � Θ( 0 ) independent of S i , M i ( S i ) is ¯ ϵ i -DP with ¯ ϵ i = T i ϵ i .
privacy/utility trade-off nL max • See paper for details on warm start strategy and how to scale noise across iterations • A good (differentially private) warm start can help a lot • Users with less data add more noise but their contribution to the error is smaller nL max n Theorem ( [Bellet et al., 2018] ) 1 nL min 20 For any T > 0 , let ( � Θ( t )) T t = 1 be the sequence of iterates generated by T iterations. For σ -strongly convex f, we have: [ Θ( T )) − f ⋆ ] ( ) T ( Θ( 0 )) − f ⋆ ) ( ) t [ T − 1 ∑ ∑ ] 2 , σ σ f ( � f ( � E ≤ 1 − + 1 − d i c i s i ( t ) t = 0 i = 1 where L min = min 1 ≤ i ≤ n L i . • T rules a trade-off between optimization error and noise error
extension: personalized l1-adaboost n • More details in [Zantedeschi et al., 2020] 2 exp d i c i log 21 min • Consider a set of base models H = { h k : X → R } K k = 1 (e.g., pre-trained on proxy data) • Find personalized ensembles α 1 , . . . , α n ∈ R K as solutions to: ( m i )) ∑ ∑ ∑ ( + µ w ij ∥ θ i − θ j ∥ 2 + λ g ( w ) − ( A i θ i ) j ∥ θ 1 ∥ 1 ≤ β,..., ∥ θ K ∥ 1 ≤ β i = 1 j = 1 i < j w ∈ R n ( n − 1 ) / 2 ≥ 0 • A i ∈ R m i × K : margins of base models on each data point of user i • Use block coordinate Frank Wolfe → communication cost logarithmic in K
learning the graph given models
Recommend
More recommend