Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation for Covariate Shift (1) (2) Masashi Sugiyama , Benjamin Blankertz , (2,3) (2) Matthias Krauledat , Guido Dornhege , (3,2) Klaus-Robert Müller (1) Tokyo Institute of Technology, Tokyo, Japan (2) Fraunhofer FIRST.IDA, Berlin, Germany (3) Technical University Berlin, Berlin, Germany
2 Common Assumption Common Assumption in Supervised Learning in Supervised Learning � Goal: from given training samples, predict output of unseen test samples � To do so, we always assume Training and test samples are drawn from the same distribution � Is this assumption really true?
3 Not Always True! Not Always True! � Less women in face dataset than reality. � More criticisms in survey sampling than reality. � Tend to collect easy-to-gather samples for training. � Sample generation mechanism varies over time. Brain activity data The Yale Face Database B
4 Covariate Shift Covariate Shift � However, no chance for generalization if training and test samples have nothing in common. � Covariate shift: � Input distribution changes � Functional relation remains unchanged
5 Examples of Covariate Shift Examples of Covariate Shift (Weak) extrapolation: Predict output values outside training region Training samples Test samples
6 Examples (cont.) Examples (cont.) � Possible applications: � Non-stationarity compensation in brain- computer interface � Online system adaptation in robot motor control � Correcting sample selection bias in survey sampling � Active learning (experimental design) Sugiyama (JMLR2006)
7 Covariate Shift Covariate Shift � To illustrate the effect of covariate shift, let’s focus on linear extrapolation Training samples Test samples True function Learned function
8 Ordinary Least-Squares Ordinary Least-Squares � If model is correct: OLS minimizes bias asymptotically � If model is misspecified: OLS does not minimize bias even asymptotically. We don’t have correct model in practice, so we need to reduce bias!
9 Law of Large Numbers Law of Large Numbers � Sample average converges to the population mean: � We want to estimate the expectation over test input points only using training input points .
10 Key Trick: Key Trick: Importance-Weighted Average Importance-Weighted Average � Importance : Ratio of test and training input densities � Importance-weighted average: (cf. importance sampling)
11 Importance-Weighted LS Importance-Weighted LS :Assumed known and strictly positive � Even for misspedified models, IWLS minimizes bias asymptotically.
12 Importance-Weighted LS (cont.) Importance-Weighted LS (cont.) � However, variance of IWLS is larger than OLS (cf. BLUE) We want to reduce variance We reduce variance by adding small bias to IWLS (e.g., changing weight, regularization)
13 Adaptive IWLS Adaptive IWLS (Shimodaira, 2000) Large bias Small bias Intermediate Small variance Large variance
14 Model Selection Model Selection � We want to determine so that generalization error (bias+var) is minimized. � However, gen. error is inaccessible. � We use a gen. error estimator instead.
15 Cross-Validation Cross-Validation � A standard method for gen. error estimation � Divide training samples into groups. � Train a learning machine with groups. � Validate the trained machine using the rest. � Repeat this for all combinations and output the mean validation error. Group 1 Group 2 … Group k-1 Group k Training Validation
16 CV under Covariate Shift CV under Covariate Shift 0.45 0.4 True gen. error � CV is almost unbiased 0.35 0.3 without covariate shift. 0.25 0.2 � However, it is heavily 0.15 0.1 biased under covariate 0.05 0 0.2 0.4 0.6 0.8 1 shift. Cross validation 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
17 Goal of This Talk Goal of This Talk � We propose a better generalization error estimator under covariate shift!
18 Importance-Weighted CV (IWCV) Importance-Weighted CV (IWCV) � When testing the classifier in CV process, we also importance-weight the test error. Set 1 Set 2 Set k-1 Set k … Training Testing IWCV gives almost unbiased estimates of gen. error even under covariate shift
19 Example of IWCV Example of IWCV True gen. error 0.4 Obtained 0.3 generalization error 0.2 0.1 0 0.2 0.4 0.6 0.8 1 Ordinary CV 0.356(0.086) Ordinary CV 1.4 IWCV 0.077(0.020) 1.2 1 0.8 Mean(Std.) 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 � IWCV is nicely unbiased IWCV 0.4 � Model selection by IWCV 0.3 0.2 outperforms CV! 0.1 0 0.2 0.4 0.6 0.8 1
20 Relation to Existing Methods Relation to Existing Methods IWAIC (Shimodaira, JSPI 2000) IWSIC (Sugiyama & Müller, Stat. & Deci. 2005) IWAIC IWSIC IWCV Asymptotic Finite Unbiasedness Asymptotic & Finite sample Loss Smooth Squared Arbitrary Model Regular Linear Arbitrary Parameter Smooth Linear Arbitrary learning Computation Fast Fast Slow IWCV is the first method that is applicable to classification with covariate shift!
21 Application: Application: Brain-Computer Interface Brain-Computer Interface Brain activity in different mental states is transformed into control signals
22 Non-Stationarity in EEG Features Non-Stationarity in EEG Features � Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals. Bandpower differences between Features extracted from brain activity training and test phases during training and test phases
23 Adaptive Importance-Weighted Adaptive Importance-Weighted Linear Discriminant Analysis Linear Discriminant Analysis � Standard classification method in BCI: LDA (after appropriate feature extraction) � We use its variant: AIWLDA � : Ordinary LDA (standard method) � : IWLDA (consistent) � is tuned by proposed IWCV
24 BCI Results BCI Results Sub- Ordinary AIWLDA Trial ject LDA +10IWCV � Proposed method 1 9.3 % 10.0 % outperforms existing 2 8.8 % 8.8 % 1 3 4.3 % 4.3 % one in 5 cases! 1 40.0 % 40.0 % 2 2 39.3 % 38.7 % 3 25.5 % 25.5 % 1 36.9 % 34.4 % 3 2 21.3 % 19.3 % 3 22.5 % 17.5 % 1 21.3 % 21.3 % 4 2 2.4 % 2.4 % 3 6.4 % 6.4 % 1 21.3 % 21.3 % 5 2 15.3 % 14.0 %
25 BCI Results BCI Results KL divergence from training Sub- Ordinary AIWLDA Trial KL ject LDA +10IWCV to test input distributions 1 9.3 % 10.0 % 0.76 � When KL is large, 2 8.8 % 8.8 % 1.11 1 3 4.3 % 4.3 % 0.69 IWCV is better. 1 40.0 % 40.0 % 0.97 � When KL is small, 2 2 39.3 % 38.7 % 1.05 3 25.5 % 25.5 % 0.43 no difference. 1 36.9 % 34.4 % 2.63 � Non-stationarity in 3 2 21.3 % 19.3 % 2.88 3 22.5 % 17.5 % 1.25 EEG could be 1 21.3 % 21.3 % 9.23 successfully 4 2 2.4 % 2.4 % 5.58 modeled by 3 6.4 % 6.4 % 1.83 1 21.3 % 21.3 % 0.79 covariate shift! 5 2 15.3 % 14.0 % 2.01
26 Conclusions Conclusions � Covariate shift: input distribution varies but functional relation remains unchanged. � Importance weight plays a central role in compensating covariate shift. � IW cross-validation: unbiased and general � IWCV improves the performance of BCI. � Class-prior change: a variant of IWCV works � Latent distribution shift: Storkey & Sugiyama (to be presented at NIPS2006)
Recommend
More recommend