an asymptotic analysis of nonparametric divide and
play

An asymptotic analysis of nonparametric divide-and-conquer methods - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szab and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density


  1. An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szabó and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017.

  2. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density estimation Gaussian white noise model Data-driven distribute methods 3 Distributed methods: fundamental limits Communication constraints Data-driven methods with limited communication 4 Summary, ongoing work

  3. Distributed methods

  4. Applications • Volunteer computing (NASA, CERN, SETI,... projects) • Massive multiplayer online games (peer network) • Aircraft control systems • Meteorology, Astronomy • Medical data from different hospitals

  5. Distributed setting

  6. Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings?

  7. Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings? • Several approach in the literature (Consensus MC, WASP, Fast-KRR, Distributed GP,...) • Limited theoretical underpinning • No unified framework to compare methods • Statistical models for illustration: • Kernel density estimation, • Gaussian white noise model, • Random design nonparametric regression.

  8. Kernel density estimation I iid ∼ f 0 with f 0 ∈ H β ( L ) . • Model: Observe X 1 , ..., X n • Distributed setting: distribute data randomly over m machines. • Method: • Local machines: Kernel density estimation in each n / m � x − X ( i ) 1 � f ( i ) j ˆ � h ( x ) = K . hn / m h j = 1 • Central machine: average local estimators m f h ( x ) = 1 ˆ f ( i ) ˆ � h ( x ) . m i = 1

  9. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) .

  10. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) .

  11. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing.

  12. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing. • In practice β is unknown: distributed data-driven methods?

  13. Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] .

  14. Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] . Distributed case: m observer � m dY ( i ) n dW ( i ) = f 0 ( t ) + t , t ∈ [ 0 , 1 ] , i ∈ { 1 , ..., m } , t W ( i ) are independent Brownian motions. t Assumption: f 0 ∈ S β ( L ) , for β > 0.

  15. Distributed Bayesian approach • Endow f 0 in each local problem with GP prior of the form ∞ � j − 1 / 2 − α Z j φ j , f | α ∼ j = 1 where Z j are iid N ( 0 , 1 ) and ( φ j ) j the Fourrier basis. • Compute locally the posterior (or a modification of it) • Aggregate the local posteriors into a global one. • Can we get optimal recovery and reliable uncertainty quantification?

  16. Benchmark: Non-distributed setting I • One server: m = 1. 2 β • Squared bias (of posterior mean): � f 0 − E ˆ 2 � n − f α � 2 1 + 2 α 2 α • Variance, posterior spread: Var (ˆ | Y ≍ n − f α ) ≍ σ 2 1 + 2 α . • Optimal bias-variance trade-off: at α = β .

  17. Benchmark: Non-distributed setting II Posterior from non−distributed data 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  18. Distributed naive method • We have m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Take α = β . • Local posteriors: B p f ( Y ( i ) ) d Π β ( f ) � Π ( i ) β ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) d Π β ( f ) . � • Aggregate the local posteriors by averaging the draws taken from them. Result: Sub-optimal contraction, misleading uncertainty quantification. 2 β 2 β 1 2 � ( n / m ) − | Y ≍ m − 1 + 2 β n − � f 0 − E ˆ 1 + 2 β , Var (ˆ 1 + 2 β . f � 2 f ) ≍ σ 2

  19. Distributed naive method II Posterior from naive distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  20. The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

  21. The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction, but bad uncertainty quantification. 2 β 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ f ) ≍ n − | Y ≍ m − 1 n − f � 2 1 + 2 β , 1 + 2 β , , σ 2 1 + 2 β .

  22. The likelihood approach II Posterior from likelihood distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  23. The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

  24. The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction and uncertainty quantification. 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ | Y ≍ n − f � 2 1 + 2 β , f ) ≍ σ 2 1 + 2 β .

  25. The prior rescaling approach II Posterior from rescaled distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  26. Other approaches Methods posterior contraction rate coverage naive, average sub-optimal no naive,Wasserstein sub-optimal yes likelihood, average minimax no likelihood, Wasserstein (WASP) minimax yes scaling, average (consensus MC) minimax yes scaling, Wasserstein minimax yes minimax yes undersmoothing (on a range of β , m ) (on a range of β , m ) PoE sub-optimal no gPoE sub-optimal yes BCM minimax yes rBCM sub-optimal yes

  27. Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter.

  28. Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter. Benchmark: In the non-distributed case ( m = 1) • Hierarchical Bayes: endow α with hyperprior. • Empirical Bayes: estimate α from the data (marginal maximum likelihood estimator). • Adaptive minimax posterior contraction rate. • Coverage of credible sets (under polished tail/self-similarity assumption, using blow-up factors).

  29. Empirical Bayes posterior 0.4 0.2 0.0 f(t) −0.2 −0.4 0.0 0.2 0.4 0.6 0.8 1.0 t

  30. Marginal likelihood 0 −50 −100 likelihood −150 −200 0 2 4 6 8 10 alpha

  31. Data driven distributed methods Proposed methods: • Naive EB: local MMLE � α ( i ) = arg max p f ( Y ( i ) ) d Π α ( f ) . ˆ α • Interactive EB Deisenroth and Ng (2015): m � � p f ( Y ( i ) ) d Π α ( f ) . α = arg max ˆ log α i = 1 α ( i ) or cross-validation (in the context of ridge • Other EB: Lepskii’s method ˜ regression Zhang, Duchi, Wainwright (2015))

Recommend


More recommend