asymptotic bayesian generalization error when training
play

Asymptotic Bayesian Generalization Error when Training and Test - PowerPoint PPT Presentation

Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Mller 2), 3) 1) Tokyo Institute of Technology 2)


  1. Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different Keisuke Yamazaki 1) Motoaki Kawanabe 2) Sumio Watanabe 1) Masashi Sugiyama 1) Klaus-Robert Müller 2), 3) 1) Tokyo Institute of Technology 2) Fraunhofer FIRST, IDA 3) Technical University of Berlin

  2. Summary of This Talk � Our target situation is non-regular models under the covariate shift. regular non-regular standard statistics algebraic geometry covariate shift importance weight Non-regular model is a class of practical parametric models such as Gaussian mixtures, neural networks, hidden Markov models, etc. The covariate shift is the setting, where the training and test input distributions are different. 2

  3. Summary of Our Theoretical Results � Analytic expression of generalization error in large sample cases � Small order terms, which can be ignored in the absence of covariate shift, play an important role. � Small order terms are difficult to analyze in practice. � Upper bound of generalization error in small sample cases � Our bound is computable for any sample size. � The worst case generalization error is elucidated. 3

  4. Contents 1. Explanation of the table regular non-regular standard covariate shift First, I’ll explain the table, 2. Our results then, show our results. 4

  5. Contents 1. Explanation of the table regular non-regular standard covariate shift 2. Our results 5

  6. Regression Problem � Training phase: learn input-output relation ( | ) from training samples r y x q(x) Input density generates Model is fitted to training input points. training samples. x y y Training True function produces training output values. x x 6

  7. Regression Problem � Test phase: predict test output values at given test input points q(x) Input density again Model is used for estimating generates test input data test output values. x We evaluate the test error (performance) y y Test x x 7

  8. Input Distribution in Standard Setting � The training and test distributions are same q(x) Generation of input data x y y Test Training x x 8

  9. Input Distribution in Practical Situations � The training and test input distributions are … q(x) Training dist. x y x 9

  10. Input Distribution in Practical Situations � The training and test input distributions are different!!! q(x) Covariate shift Test dist. � Bioinformatics � [Baldi et al., 1998] Training dist. � Econometrics � [Heckman, 1979] x � Brain-computer interface � [Wolpaw et al., 2002] etc. y x 10

  11. Input Distribution in Practical Situations � The training and test input distributions are different!!! q(x) Covariate shift Test dist. � Bioinformatics � [Baldi et al., 1998] Training dist. � Econometrics � [Heckman, 1979] x � Brain-computer interface � [Wolpaw et al., 2002] etc. y d o e s n ' t c h a n g e . ( | ) r y x ⇒ ( ) ( ) q x q x 0 1 x 11

  12. Input Distribution in Practical Situations � The training and test input distributions are different!!! Covariate shift Test dist. Training dist. x Due to the change of data region, the performance also changes. y A standard technique does NOT work. x 12

  13. Contents 1. Explanation of the table regular non-regular standard covariate shift 2. Our results 13

  14. Classes of Learning Models � Non-/ Semi-parametric models � SVM etc. � Parametric models � Non-regular � Regular � Neural network � Polynomial regression � Gaussian mixture � Linear model � Hidden Markov model etc. � Bayesian network � Stochastic CFG etc. Non-regular models have hierarchical structure or hidden variables. It is important to analyze non-regular models. 14

  15. Our Learning Method is Bayesian frequentists’ Bayesian Maximum MAP Bayes Likelihood � Bayesian Learning [ Parametric ] The Bayesian learning constructs y the predictive distribution as the average of models. x 15

  16. Contents 1. Parametric Bayesian framework regular non-regular standard covariate shift Here, the interest is the generalization 2. Our results performance in each setting. Before looking at each case, let us define how to measure the generalization performance. 16

  17. How to Measure Generalization Performance � Kullback divergence (or log-loss) ( ) p x ∫ = 1 ( || ) ( ) log D p p p x dx 1 2 1 ( ) p x 2 It shows the distance between densities. = ⇔ = ( ) ( ) ( || ) 0 p x p x D p p 1 2 1 2 ≠ ⇔ > ( ) ( ) ( || ) 0 p x p x D p p 1 2 1 2 D(true function || predictive distribution) Kullback divergence from the true distribution to the predictive distribution. 17

  18. Expected Kullback Divergence Is Our Generalization Error ⎡ ⎤ ( | ) r y x ∫ = 0 0 ( ) ( | ) ( ) log ⎢ ⎥ G n E r y x q x dxdy 0 n n ⎣ ⎦ ( | , , ) n n p y x X Y , X Y y We take the expectation over all training samples. x It is the function w.r.t. the training sample size. n 18

  19. What Do We Want to Know? � Learning curve: generalization error as a function of sample size When the sample size n is sufficiently large, ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n 0 n ( ) G the value of errors Speed of convergence 0 > ≠ 0 S B ( | ) ( | , ) r y x p y x w 0 0 = = 0 ˆ B ( | ) ( | , ) r y x p y x w Bias 0 n :sample size 19

  20. Contents 1. Parametric Bayesian framework regular non-regular standard covariate shift Now, we take a careful look 2. Our results at each case separately. 20

  21. Regular Models in the Standard Input Dist. � In statistics, the analysis has a long history. � Learning curve is well studied. ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n B Distance from the true function to the optimal model 0 S (Dimension of parameter space)/2 0 True function B 0 S Optimal model 0 21

  22. Contents 1. Parametric Bayesian framework regular non-regular standard statistics covariate shift 2. Our results 22

  23. Regular Models under Covariate Shift ( ) q x 1 Importance Weight = ( ) q x 0 ( ) q x ∫ ∫ × × = × 1 ( ) ( ) ( ) ( ) q x IW Loss x dx q x Loss x dx 0 0 ( ) q x 0 ∫ = ( ) ( ) q x Loss x dx 1 � The importance weight improves the generalization error. [Shimodaira, 2000] B Distance to the optimal model following the test data . 0 S Original speed + A factor from the importance weight. 0 ⎛ ⎞ 1 S = + + ⎜ ⎟ 0 0 ( ) G n B o 0 ⎝ ⎠ n n 23

  24. Contents 1. Parametric Bayesian framework regular non-regular standard statistics covariate shift importance weight 2. Our results 24

  25. Non-Regular Models without Covariate Shift Stochastic Complexity: the average of marginal likelihood ⎡− + ⎤ 1 n ∫∏ = ϕ 0 0 ( ) log ( | , ) ( ) U n E ⎢ p Y X w w dw ⎥ n n i i , X Y ⎣ ⎦ = 1 model a prior i Marginal likelihood is used for the model selection or the optimization of the prior. n : the training data size An asymptotic form of the stochastic complexity is = + + 0 ( ) log (log ) U n a n b n o n 0 0 25

  26. Analysis of Generalization Error in the Absence of Covariate Shift According to the definition, = + − 0 0 0 ( ) ( 1 ) ( ) G n U n U n + = + + + + 0 ( 1 ) ( 1 ) log( 1 ) (log ) U n a n b n o n 0 0 = + + 0 ( ) log (log ) U n a n b n o n 0 0 ⎛ ⎞ 1 b = + + by very simple 0 ⎜ ⎟ 0 ( ) G n a o 0 ⎝ ⎠ n n subtraction. Generalization Error ⎛ ⎞ 1 b = + + ⎜ ⎟ 0 0 ( ) , G n a o which includes regular cases. 0 ⎝ ⎠ n n 26

  27. Contents 1. Parametric Bayesian framework regular non-regular standard statistics stochastic complexity covariate shift importance weight The analysis for non- 2. Our results regular models under the covariate shift is A) Large sample cases still open !!! B) Finite sample cases 27

  28. Kullback Divergence w.r.t. Test Distribution ⎡ ⎤ ( | ) r y x ∫ = 0 i ( ) ( | ) ( ) log ⎢ ⎥ G n E r y x q x dxdy i n n ⎣ ⎦ ( | , , ) n n p y x X Y , X Y y y x x ( ) ( ) q 0 x q 1 x x x 0 n 1 n ( ) ( ) G G : standard case : covariate shift 28

  29. Stochastic Complexity under Covariate Shift We define shifted stochastic complexity: ⎡− ⎤ + 1 n ∫∏ + = ϕ 0 i i ( 1 ) log ( | , ) ( ) U n E E ⎢ p Y X w w dw ⎥ , X Y n n i i , + + X Y ⎣ ⎦ n 1 n 1 = 1 i The expectation of test data is different. The previous definition: ⎡− + ⎤ 1 n ∫∏ = ϕ 0 0 ( ) log ( | , ) ( ) U n E ⎢ p Y X w w dw ⎥ n n i i , X Y ⎣ ⎦ = 1 i n : the training data size 29

  30. Following the previous study, Assumption: An asymptotic form of the stochastic complexity is ≅ + + + + + i L L ( ) log / U n a n b n c d n i i i i The previous assumption: = + + 0 ( ) log (log ) U n a n b n o n 0 0 n : the training data size 30

Recommend


More recommend