An asymptotic analysis of nonparametric divide-and-conquer methods - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szabó and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017.

Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density estimation Gaussian white noise model Data-driven distribute methods 3 Distributed methods: fundamental limits Communication constraints Data-driven methods with limited communication 4 Summary, ongoing work

Distributed methods

Applications • Volunteer computing (NASA, CERN, SETI,... projects) • Massive multiplayer online games (peer network) • Aircraft control systems • Meteorology, Astronomy • Medical data from different hospitals

Distributed setting

Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings?

Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings? • Several approach in the literature (Consensus MC, WASP, Fast-KRR, Distributed GP,...) • Limited theoretical underpinning • No unified framework to compare methods • Statistical models for illustration: • Kernel density estimation, • Gaussian white noise model, • Random design nonparametric regression.

Kernel density estimation I iid ∼ f 0 with f 0 ∈ H β ( L ) . • Model: Observe X 1 , ..., X n • Distributed setting: distribute data randomly over m machines. • Method: • Local machines: Kernel density estimation in each n / m � x − X ( i ) 1 � f ( i ) j ˆ � h ( x ) = K . hn / m h j = 1 • Central machine: average local estimators m f h ( x ) = 1 ˆ f ( i ) ˆ � h ( x ) . m i = 1

Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) .

Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) .

Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing.

Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing. • In practice β is unknown: distributed data-driven methods?

Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] .

Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] . Distributed case: m observer � m dY ( i ) n dW ( i ) = f 0 ( t ) + t , t ∈ [ 0 , 1 ] , i ∈ { 1 , ..., m } , t W ( i ) are independent Brownian motions. t Assumption: f 0 ∈ S β ( L ) , for β > 0.

Distributed Bayesian approach • Endow f 0 in each local problem with GP prior of the form ∞ � j − 1 / 2 − α Z j φ j , f | α ∼ j = 1 where Z j are iid N ( 0 , 1 ) and ( φ j ) j the Fourrier basis. • Compute locally the posterior (or a modification of it) • Aggregate the local posteriors into a global one. • Can we get optimal recovery and reliable uncertainty quantification?

Benchmark: Non-distributed setting I • One server: m = 1. 2 β • Squared bias (of posterior mean): � f 0 − E ˆ 2 � n − f α � 2 1 + 2 α 2 α • Variance, posterior spread: Var (ˆ | Y ≍ n − f α ) ≍ σ 2 1 + 2 α . • Optimal bias-variance trade-off: at α = β .

Benchmark: Non-distributed setting II Posterior from non−distributed data 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

Distributed naive method • We have m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Take α = β . • Local posteriors: B p f ( Y ( i ) ) d Π β ( f ) � Π ( i ) β ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) d Π β ( f ) . � • Aggregate the local posteriors by averaging the draws taken from them. Result: Sub-optimal contraction, misleading uncertainty quantification. 2 β 2 β 1 2 � ( n / m ) − | Y ≍ m − 1 + 2 β n − � f 0 − E ˆ 1 + 2 β , Var (ˆ 1 + 2 β . f � 2 f ) ≍ σ 2

Distributed naive method II Posterior from naive distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction, but bad uncertainty quantification. 2 β 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ f ) ≍ n − | Y ≍ m − 1 n − f � 2 1 + 2 β , 1 + 2 β , , σ 2 1 + 2 β .

The likelihood approach II Posterior from likelihood distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction and uncertainty quantification. 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ | Y ≍ n − f � 2 1 + 2 β , f ) ≍ σ 2 1 + 2 β .

The prior rescaling approach II Posterior from rescaled distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

Other approaches Methods posterior contraction rate coverage naive, average sub-optimal no naive,Wasserstein sub-optimal yes likelihood, average minimax no likelihood, Wasserstein (WASP) minimax yes scaling, average (consensus MC) minimax yes scaling, Wasserstein minimax yes minimax yes undersmoothing (on a range of β , m ) (on a range of β , m ) PoE sub-optimal no gPoE sub-optimal yes BCM minimax yes rBCM sub-optimal yes

Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter.

Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter. Benchmark: In the non-distributed case ( m = 1) • Hierarchical Bayes: endow α with hyperprior. • Empirical Bayes: estimate α from the data (marginal maximum likelihood estimator). • Adaptive minimax posterior contraction rate. • Coverage of credible sets (under polished tail/self-similarity assumption, using blow-up factors).

Empirical Bayes posterior 0.4 0.2 0.0 f(t) −0.2 −0.4 0.0 0.2 0.4 0.6 0.8 1.0 t

Marginal likelihood 0 −50 −100 likelihood −150 −200 0 2 4 6 8 10 alpha

Data driven distributed methods Proposed methods: • Naive EB: local MMLE � α ( i ) = arg max p f ( Y ( i ) ) d Π α ( f ) . ˆ α • Interactive EB Deisenroth and Ng (2015): m � � p f ( Y ( i ) ) d Π α ( f ) . α = arg max ˆ log α i = 1 α ( i ) or cross-validation (in the context of ridge • Other EB: Lepskii’s method ˜ regression Zhang, Duchi, Wainwright (2015))

An asymptotic analysis of nonparametric divide-and-conquer methods - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szab and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density

Divide-Conquer-Glue Algorithms Divide-and-conquer. Divide up problem into several subproblems.

An Introduction to Asymptotic Theory Ping Yu School of Economics and Finance The University of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Divide and conquer 1 The main idea for the divide and conquer is trying to divide a problem into

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

Divide-Conquer-Glue Algorithms Divide-and-conquer. Mergesort and Counting Inversions Divide

Cut-points in asymptotic cones of groups Mark Sapir With J. Behrstock, C. Drut u, S. Mozes,

Tirgul 2 Asymptotic Analysis Asymptotic Analysis Motivation: Suppose you want to evaluate

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Week 2 Growth of Functions Divide-and- Divide and Conquer Conquer Min-Max- Problem Tutorial

Divide and Conquer Algorithm Design Techniques Greedy Divide and Conquer Dynamic Programming

Divide and Conquer Algorithms Divide-and-Conquer The most-well known algorithm design strategy:

CSC 151 Spring 2020 Topic: Merge Sort May 4, 2020 Day 39 Self Checks Divide and Conquer

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

Neutron Acceptance Diagram Shading Phil Bentley ILL Monday, 17 August 2009 Why? New

Introducing Shared-Memory Concurrency Race Conditions and Atomic Blocks Laura Effinger-Dean

tr s tr

Knowledge Representation for the Semantic Web Lecture 7: Answer Set Programming II Daria

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

FYSM 1000-04: science & environmental commmunication University of Colorado-Boulder Fall

P4 Science Science teacher ??? Content Joy of Learning Topical Coverage P4 Level

UKIAH'S NEW BUDGET STORY: USING TECH TO SIMPLIFY AND IMPROVE Dan Buffalo, MPA, CPA, CGMA Finance

An asymptotic analysis of nonparametric divide-and-conquer methods - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szab and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density

Divide-Conquer-Glue Algorithms Divide-and-conquer. Divide up problem into several subproblems.

An Introduction to Asymptotic Theory Ping Yu School of Economics and Finance The University of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Divide and conquer 1 The main idea for the divide and conquer is trying to divide a problem into

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

Divide-Conquer-Glue Algorithms Divide-and-conquer. Mergesort and Counting Inversions Divide

Cut-points in asymptotic cones of groups Mark Sapir With J. Behrstock, C. Drut u, S. Mozes,

Tirgul 2 Asymptotic Analysis Asymptotic Analysis Motivation: Suppose you want to evaluate

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Week 2 Growth of Functions Divide-and- Divide and Conquer Conquer Min-Max- Problem Tutorial

Divide and Conquer Algorithm Design Techniques Greedy Divide and Conquer Dynamic Programming

Divide and Conquer Algorithms Divide-and-Conquer The most-well known algorithm design strategy:

CSC 151 Spring 2020 Topic: Merge Sort May 4, 2020 Day 39 Self Checks Divide and Conquer

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

Neutron Acceptance Diagram Shading Phil Bentley ILL Monday, 17 August 2009 Why? New

Introducing Shared-Memory Concurrency Race Conditions and Atomic Blocks Laura Effinger-Dean

tr s tr

Knowledge Representation for the Semantic Web Lecture 7: Answer Set Programming II Daria

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

FYSM 1000-04: science &amp; environmental commmunication University of Colorado-Boulder Fall

P4 Science Science teacher ??? Content Joy of Learning Topical Coverage P4 Level

UKIAH'S NEW BUDGET STORY: USING TECH TO SIMPLIFY AND IMPROVE Dan Buffalo, MPA, CPA, CGMA Finance

FYSM 1000-04: science & environmental commmunication University of Colorado-Boulder Fall