a sequential split conquer combine approach for gaussian
play

A Sequential Split-Conquer-Combine Approach for Gaussian Process - PowerPoint PPT Presentation

A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments Chengrui Li Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1


  1. A Sequential Split-Conquer-Combine Approach for Gaussian Process Modeling in Computer Experiments Chengrui Li Department of Statistics and Biostatistics, Rutgers University Joint work with Ying Hung and Min-ge Xie 2017 QPRC JUNE 13, 2017 1

  2. Outline � Introduction � A Unified Framework with Theoretical Supports � Simulation Study � Real Data Example � Summary 2

  3. Introduction

  4. Motivating example: Data center thermal management • A data center is an integrated facility housing multiple-unit servers, providing application services or management for data processing. • Goal : Design a data center with an efficient heat removal mechanism. • Computational Fluid Dynamics (CFD) simulation ( n = 26820, p = 9) Figure 1: Heat map for IBM T. J. Watson Data Center 3

  5. ✵ Gaussian process model • Gaussian process (GP) model: y = X β + Z ( x ) , • y : n × 1 vector of observations (e.g., room temperatures) • X : n × p design matrix • β : p × 1 unknown parameters 4

  6. Gaussian process model • Gaussian process (GP) model: y = X β + Z ( x ) , • y : n × 1 vector of observations (e.g., room temperatures) • X : n × p design matrix • β : p × 1 unknown parameters • Z ( x ) is a GP process with mean ✵ and covariance σ 2 Σ( θ ) • Σ( θ ) : n -by- n Correlation matrix with correlation parameters θ The ij th element of Σ is defined by a power exponential function p corr ( Z ( x i ) , Z ( x j )) = exp( − θ T | x i − x j | ) = exp( − θ k | x ik − x jk | ) . � k = 1 • Remark : Assume σ is known for simplicity in this talk. 4

  7. ✵ ✵ ✵ ✵ ✵ ✵ ✵ ✵ Estimation and prediction • Likelihood inference l ( β , θ , σ ) = − 1 2 σ 2 ( y − X β ) ⊤ Σ − 1 ( θ )( y − X β ) − 1 2 log | Σ( θ ) | − n 2 log( σ 2 ) So, { l ( β | θ , σ 2 ) } = ( X ⊤ Σ − 1 ( θ ) X ) − 1 X ⊤ Σ − 1 ( θ ) y � β | θ = ❛r❣ ♠❛① β β , σ 2 = ❛r❣ ♠❛① β , σ 2 ) } θ | � � { l ( θ | � θ 5

  8. Estimation and prediction • Likelihood inference l ( β , θ , σ ) = − 1 2 σ 2 ( y − X β ) ⊤ Σ − 1 ( θ )( y − X β ) − 1 2 log | Σ( θ ) | − n 2 log( σ 2 ) So, { l ( β | θ , σ 2 ) } = ( X ⊤ Σ − 1 ( θ ) X ) − 1 X ⊤ Σ − 1 ( θ ) y � β | θ = ❛r❣ ♠❛① β β , σ 2 = ❛r❣ ♠❛① β , σ 2 ) } θ | � � { l ( θ | � θ • GP prediction, say y ✵ , at a new point x ✵ , given parameters ( β , θ ) , follows a normal distribution with mean p ✵ ( β , θ ) and variance m ✵ ( β , θ ) , where, ✵ β + γ ( θ ) ⊤ Σ − 1 ( θ )( y − X β ) p ✵ ( β , θ ) = x ⊤ m ✵ ( β , θ ) = σ 2 ( 1 − γ ( θ ) ⊤ Σ − 1 ( θ ) γ ( θ )) , and γ ( θ ) is a n × 1 vector of i th element equals to φ ( || x i − x ✵ || ; θ ) . 5

  9. Two challenges in GP modeling � Computational issue: • Estimation and prediction involve Σ − 1 and | Σ | with order of O ( n 3 ) : • Not feasible when n is large 6

  10. Two challenges in GP modeling � Computational issue: • Estimation and prediction involve Σ − 1 and | Σ | with order of O ( n 3 ) : • Not feasible when n is large � Uncertainty quantification of GP predictor • Plug-in predictive distribution is widely used • It underestimates the uncertainty 6

  11. Existing methods � For the computational issue: • Change the model to one that is computationally convenient: Rue and Held (2005), Cressie and Johannesson (2008). • Approximate the likelihood function: Stein et al. (2004), Furrer et al. (2006), Fuentes (2007), Kaufman et al. (2008). • Not focus on uncertainty quantification and bring in addition uncertainty � For uncertainty quantification of GP predictor • Bayesian predictive distribution • Bootstrap approach (Luna and Young 2003) • Intensive computation 7

  12. Solve both problems by a unified framework? • Yes! 8

  13. A Unified Framework

  14. Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): • Point estimate • Interval estimate • Distribution estimate Example : X 1 , . . . , X n i.i.d. follows N ( µ, 1 ) � n x n = 1 • Point estimate: ¯ i = 1 x i n x n − 1 . 96 / √ n , ¯ x n + 1 . 96 / √ n ) • Interval estimate: (¯ x n , 1 • Distribution estimate: N (¯ n ) 9

  15. Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): • Point estimate • Interval estimate • Distribution estimate Example : X 1 , . . . , X n i.i.d. follows N ( µ, 1 ) � n x n = 1 • Point estimate: ¯ i = 1 x i n x n − 1 . 96 / √ n , ¯ x n + 1 . 96 / √ n ) • Interval estimate: (¯ x n , 1 • Distribution estimate: N (¯ n ) The idea of the CD approach is to use a sample-dependent distribution (or density) function to estimate the parameter of interest. • Wide range of examples: bootstrap distribution, (normalized) likelihood function, p -value functions, fiducial distributions, some informative priors and Bayesian posteriors, among others (Xie and Singh 2013) 9

  16. Overview: Sequential Split-Conquer-Combine  D D D D Data 1 2 3 m Split and Conquer ˆ  D * : Step 1: 1 1 ˆ *  Step 2: D : 2 2 ˆ *  Step 3: D : 3 3   ˆ *  Step m : D : m m  Combine ˆ ˆ ˆ ˆ     m 2 3 1 ˆ  c Figure 2: Sequential Split-Conquer-Combine Approach 10

  17. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D 11

  18. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets 11

  19. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets � Combine estimators 11

  20. Ingredients � Split the entire dataset into subsets (correlated) based on compact support correlation assumption for 1-D � Perform a sequential updating to create independent subsets and estimate on each updated subsets � Combine estimators � Quantify prediction uncertainty 11

  21. Split � Split the entire dataset into subsets y = { y a } , a = 1 , ..., m . Denote the size of y a by n a , i.e. � n a = n . • Assumption: compactly supported correlation   O O Σ 11 Σ 12 · · ·  ...    O Σ 21 Σ 22 · · ·     . . ... ... ... . .   Σ t = . . ,     ...   O   · · · Σ ( m − 1 )( m − 1 ) Σ ( m − 1 ) m O O · · · Σ m ( m − 1 ) Σ mm n × n (after index sorting according to X 1 values) 12

  22. Sequentially update data � Transform y to y ∗ by sequentially updating: y ∗ a = y a − L a ( a − 1 ) y ∗ a − 1 , where L ( a + 1 ) a = Σ t ( a + 1 ) a D − 1 a , D a = Σ aa − L a ( a − 1 ) D ( a − 1 ) L ⊤ a ( a − 1 ) . • Sequential updates are computationally efficient . • The updated block y ∗ a ’s are independent. 13

  23. Estimation from each subset Given θ , we have • MLE of the a th subset: l ( a ) a D − 1 a C a ) − 1 C ⊤ a D − 1 � t ( β | θ ) = ( C ⊤ a y ∗ β a = ❛r❣ ♠❛① a . β | θ • An individual CD for the a th updated subset is (cf., Xie and Singh 2013): N p ( � β a , Cov ( � β a )) . Given β , we have θ a = ❛r❣ ♠❛① θ l ( a ) • MLE of the a th subset: � t ( θ | β ) . • Given β , an individual CD for the a th updated subset is N ( � θ a , Cov ( � θ a )) . Significant computational reduction because D a is much smaller than the original covariance matrix. 14

  24. CD combining • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD is N p ( β c , S c ) , where β c = ( � W a ) − 1 ( � W a � a C a ) − 1 and β a ) with W a = ( C ⊤ a D − 1 � S c = Cov ( � β c ) . • Similar framework can be applied to all the parameters ( β , θ ) . 15

  25. CD combining • Following Singh, Xie and Strawderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD is N p ( β c , S c ) , where β c = ( � W a ) − 1 ( � W a � a C a ) − 1 and β a ) with W a = ( C ⊤ a D − 1 � S c = Cov ( � β c ) . • Similar framework can be applied to all the parameters ( β , θ ) . Theorem 1 Under some regularity assumptions, when τ > O p ( n 1 / 2 ) and n → ∞ , the SSCC estimator � θ c ) is asymptotically as efficient as MLE λ c = ( � β c , � θ mle ) . � λ mle = ( � β mle , � 15

  26. GP predictive distribution • GP predictor at a new point x ✵ , given parameters ( β , θ ) , follows a normal distribution with mean p ✵ ( β , θ ) and variance m ✵ ( β , θ ) , where, ✵ β + γ ( θ ) ⊤ Σ − 1 ( θ )( y − X β ) p ✵ ( β , θ ) = x ⊤ m ✵ ( β , θ ) = σ 2 ( 1 − γ ( θ ) ⊤ Σ − 1 ( θ ) γ ( θ )) , and γ ( θ ) is a n × 1 vector of i th element equals to φ ( || x i − x ✵ || ; θ ) . 16

Recommend


More recommend