Distributed Learning for Cooperative Inference C´ esar A. Uribe . Collaboration with: Alex Olshevsky and Angelia Nedi´ c LCCC - Focus Period on Large-Scale and Distributed Optimization June 5th, 2017
Optimization � �� � Distributed Learning for Cooperative Inference � �� � � �� � Statistical Estimation Consensus − Based C´ esar A. Uribe . Collaboration with: Alex Olshevsky and Angelia Nedi´ c LCCC - Focus Period on Large-Scale and Distributed Optimization June 5th, 2017
The three components for estimation Data: X ∼ P ∗ is a r.v. with a sample space ( X , X ) . P ∗ is unknown . Model: ◮ P a collection of probability measures P : X → [0 , 1] . ◮ Parametrized by Θ ; ∃ an injective map Θ → P : θ → P θ . ◮ Dominated: ∃ λ s.t. P θ ≪ λ with p θ = dP θ /dλ . (Point) Estimator: A map ˆ P : X → P . The best guess ˆ P ∈ P for P ∗ based on X , e.g. ˆ θ ( X ) = sup p θ ( X ) θ ∈ Θ
Bayesian Methods The parameter is a r.v. ϑ taking values in (Θ , T ) . There is a probability measure on X × Θ with F = σ ( X × T ) , Π : F → [0 , 1] , Model: The distribution of X conditioned on ϑ , Π X | ϑ . Prior: The marginal of Π on ϑ , Π : T → [0 , 1] . Posterior: The distribution Π ϑ | X : T × X → [0 , 1] . In particular, � B p θ ( X ) d Π( θ ) Π( ϑ ∈ B | X ) = Θ p θ ( X ) d Π( θ ) . �
One can construct the MAP or MMSE estimators as: ˆ θ MAP ( X ) = arg max Π( θ | X ) θ ∈ Θ � ˆ θ MMSE ( X ) = θd Π( θ | X ) θ ∈ Θ
The Belief Notation We are interested in computing posterior distributions. Thus, lets define the belief density on a hypothesis θ ∈ Θ at time k as dµ k ( θ ) = d Π( θ | X 1 , . . . , X k ) k � ∝ p θ ( X i ) d Π( θ ) i =1 = p θ ( X k ) dµ k − 1 ( θ ) This defines a iterative algorithm dµ k +1 ( θ ) ∝ dµ k ( θ ) p θ ( x k +1 ) We will say that, we learn a parameter θ ∗ if k →∞ µ k ( θ ∗ ) = 1 lim a.s. (usually) We hope that P θ ∗ is the closest to P ∗ (in a sense defined later).
Example: Estimating the Mean of a Gaussian Model Data: Assume we receive a sample x 1 , . . . , x k , where X k ∼ N ( θ ∗ , σ 2 ) . σ 2 is known and we want to estimate θ ∗ . Model: The collection of all Normal distributions with variance σ 2 , i.e. P θ = {N ( θ, σ 2 ) } . Prior: Our prior is the standard Normal distribution dµ 0 ( θ ) = N (0 , 1) . Posterior: The posterior is defined as k � dµ k ( θ ) ∝ dµ 0 ( θ ) p θ ( x t ) t =1 �� k � σ 2 t =1 x t = N σ 2 + k , σ 2 + k
dµ k ( · ) k = 0 Θ θ ∗ θ 0
dµ k ( · ) k = 1 Θ θ ∗ θ 0
dµ k ( · ) k = 2 Θ θ ∗ θ 0
dµ k ( · ) k = 3 Θ θ ∗ θ 0
dµ k ( · ) k = 4 Θ θ ∗ θ 0
dµ k ( · ) k = 5 Θ θ ∗ θ 0
dµ k ( · ) k = 6 Θ θ ∗ θ 0
Geometric Interpretation for Finite Hypotheses dµ 0 Variance dµ 1 Mean θ ∗
Bayes’ Theorem Belongs to Stochastic Approximations
Consider the following optimization problem min θ ∈ Θ F ( θ ) = D KL ( P � P θ ) , (1) We can rewrite Eq. (1) as θ ∈ Θ D KL ( P � P θ ) = min π ∈ ∆ Θ E π D KL ( P � P θ ) where θ ∼ π min � � − log dP θ = min , π ∈ ∆ Θ E π E P dP Moreover, D KL ( P � P θ ) = arg min E π E P [ − log p θ ( X )] , θ ∼ π, X ∼ P arg min θ ∈ Θ π ∈ ∆ Θ = arg min E P E π [ − log p θ ( X )] , θ ∼ π, X ∼ P. π ∈ ∆ Θ
Consider the following optimization problem min x ∈ Z E [ F ( x, Ξ)] , The stochastic mirror descent approach constructs a sequence { x k } as follows: � � �∇ F ( x, ξ k ) , x � + 1 x k +1 = arg min D w ( x, x k ) , α k x ∈ Z Recall our original problem π ∈ ∆ Θ E P E π [ − log p θ ( X )] , θ ∼ π, X ∼ P. min (2) For Eq. (2), Stochastic Mirror Descent generates a sequence of densities { dµ k } , as follows: � � �− log p θ ( x k +1 ) , π � + 1 dµ k +1 = arg min D w ( π, dµ k ) , θ ∼ π. α k π ∈ ∆ Θ (3)
{�− log p θ ( x k +1 ) , π � + D KL ( π � dµ k ) } , θ ∼ π. dµ k +1 = arg min π ∈ ∆ Θ � Choose w ( x ) = x log x , then the corresponding Bregman distance is the Kullback-Leibler (KL) divergence D KL . Additionally, by selecting α k = 1 then for each θ ∈ Θ , dµ k +1 ( θ ) ∝ p θ ( x k +1 ) dµ k ( θ ) � �� � Bayesian Posterior
Distributed Inference Setup
Distributed Inference Setup ◮ n agents: V = { 1 , 2 , · · · , n } ◮ Agent i observes X i k : Ω → X i , X i k ∼ P i ◮ Agent i has a model about P i , P i = { P i θ | θ ∈ Θ } ◮ Agent i has a local belief density dµ i k ( θ ) ◮ Agents share beliefs over the network (connected, fixed, undirected) ◮ a ij ∈ (0 , 1) is how agent i weights agent j information, � a ij = 1 Agents want to collectively solve the following optimization problem n � D KL ( P i � P i θ ∈ Θ F ( θ ) � D KL ( P � P θ ) = min θ ) . (4) i =1 Consensus Learning: dµ i ∞ ( θ ∗ ) = 1 for all i .
Our approach Include beliefs of other agents in the regularization term: Distributed Stochastic Entropic Mirror-descent n � � � � � � ��� dµ i π � dµ j p i x i − E π k +1 = arg min a ij D KL log k θ k +1 π ∈ ∆ Θ j =1 n � � � k ( θ ) a ij p i dµ i dµ j x i k +1 ( θ ) ∝ (5) θ k +1 j =1 Q1. Does (5) achieves consensus learning? Q2. If Q1 is positive, at what rate does this happens?
A finite set Θ Extensive literature for finite parameter sets Θ ◮ The network is static/time-varying. ◮ The network is directed/undirected. ◮ Prove consistency of the algorithm. ◮ Prove asymptotic/non-asymptotic convergence rates. Shahrampour, Rahimian, Jadbabaie, Lalitha, Sarwate, Javidi, Su, Vaidya, Qipeng, Bandyopadhyay, Sahu, Kar, Sayed, Chazelle, Olshevsky, Nedi´ c, U.
Geometric Interpretation for Finite Hypotheses P P θ 3 P θ 1 P θ 2
Distributed Source Localization θ 7 θ 8 θ 9 10 2 3 1 y-position θ 4 θ 5 θ 6 0 Agents Source θ 1 θ 2 θ 3 −10 −10 0 10 x-position (a) Network of Agents (b) Hypothesis Distributions
Distributed Source Localization
Distributed Source Localization
Our results for three different problems 1. Time-varying undirected graphs (Nedi´ c,Olshevsky,U to appear TAC) ◮ A k is doubly-stochastic with [ A k ] ij > 0 if ( i, j ) ∈ E k . 2. Time-varying directed graphs (Nedi´ c,Olshevsky,U in ACC16) � 1 if j ∈ in N i d j k ◮ [ A k ] ij = k 0 if otherwise d i k is the out degree of node i at time k . in N i k is the set of in neighbors of node i . 3. Acceleration in static graphs (Nedi´ c,Olshevsky,U to appear TAC) � 1 if ( i, j ) ∈ E , ¯ max { d i ,d j } A ij = ◮ ∈ E , 0 if ( i, j ) / d i degree of the node i . 2 ¯ A = 1 2 I + 1 A,
Time-Varying k +1 ( θ ) ∝ � n j =1 µ j µ i k ( θ ) [ A k ] ij p i θ ( x i k +1 ) Undirected n k ( θ ) (1+ σ ) ¯ µ j Aij p i θ ( x i � k +1 ) µ i j =1 k +1 ( θ ) ∝ Fixed Undirected n σ ¯ Aij j =1 ( µ j k − 1 ( θ ) p j θ ( x j k ) ) � y j k +1 = � y i Time-Varying k d j j ∈ N i k k 1 yj yi k k +1 � dj � � µ j µ i k p i x i Directed k +1 ( θ ) ∝ k ( θ ) θ k +1 j ∈ N i k
General form of Theorems N YY µ i k +1 ( θ ) ≤ exp( − kγ 2 + γ 1 ) Under appropriate assumptions, a group of agents following algorithm X . There is a time N ( n, λ, ρ ) such that with ∈ Θ ∗ , probability 1 − ρ for all k ≥ N ( n, λ, ρ ) for all θ / µ i k ( θ ) ≤ exp ( − kγ 2 + γ 1 ) for all i = 1 , . . . , n,
After a time N ( n, λ, ρ ) such that with probability 1 − ρ for all ∈ Θ ∗ , k ≥ N ( n, λ, ρ ) , for all θ / µ i k +1 ( θ ) ≤ exp ( − kγ 2 + γ 1 ) for all i = 1 , . . . , n. Graph N γ 1 γ 2 δ Time-Varying O ( n 2 O (log 1 /ρ ) η log n ) O (1) Undirected O ( n 2 log n ) · · · + Metropolis O (log 1 /ρ ) O (1) Time-Varying O ( n n log n ) 1 1 δ ≥ δ 2 O (log 1 /ρ ) O (1) n n Directed O ( n 3 log n ) · · · + regular O (log 1 /ρ ) O (1) 1 Fixed O (log 1 /ρ ) O ( n log n ) O (1) Undirected
10 5 10 5 10 4 Mean number of Iterations Mean number of Iterations Mean number of Iterations 10 4 10 4 10 3 10 3 10 3 10 2 10 2 10 2 10 1 10 1 10 1 0 100 200 300 60 100 140 0 100 200 300 Number of nodes Number of nodes Number of nodes (a) Path Graph (b) Circle Graph (c) Grid Graph Figure: Empirical mean over 50 Monte Carlo runs of the number of iterations required for µ i ∈ Θ ∗ . All agents k ( θ ) < ǫ for all agents on θ / but one have all their hypotheses to be observationally equivalent. Dotted line for the algorithm proposed by Jadbabaie et al. Dashed line no acceleration and solid line for acceleration.
A particularly bad graph
Recommend
More recommend