PRLab TUDelft NL
LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data Shift, Concept Drift… Marco Loog Pattern Recognition Laboratory Delft University of Technology PRLab TUDelft NL
PRLab TUDelft NL
PRLab TUDelft NL
Covariate Shift Assumption � Covariate shift via posterior or via label function � P(Y|X) = Q(Y|X) vs. ℓ(X|P) = ℓ(X|Q) = ℓ(X) � Equal to assumption of missing at random � P(S=1|X,Y) = P(S=1|X) � Standard setting : P(S=1|X,Y) = P(S=1) PRLab TUDelft NL
Graphically Speaking � Covariate shift P(S=1|X,Y) = P(S=1|X) � So change of priors is not covariate shift… P(S=1|X,Y) = P(S=1|Y) PRLab TUDelft NL
The Canonical Example � How much does it help, really, when hypothesis considered are very nonparametric? PRLab TUDelft NL
Importance Weighting : Basic Idea � Expected risk on test : ∫∫ L(x,y|θ) P(x,y) dx dy � Rewrite : ∫∫ L(x,y|θ) P(x)/Q(x) Q(x,y) dx dy � Empirical loss [on training] : ∑ L(x i ,y i |θ) P(x i )/Q(x i ) � Importance weights : P(x i )/Q(x i ) PRLab TUDelft NL
Estimation of Importance : E.g. � Estimate P(x) and Q(x) [normal distributions, Parzen densities, whatever] and calculate weights through w = P/Q � Sugiyama suggests to estimate weights directly � Find w such that KL(Q||w P) is minimal [KLIEP] � Q and P are modelled by Parzen densities � More well-founded suggestions have been given by Huang, Smola, Cortes, Mohri, Mansour, et al. � Yet another approach is based on a very simple [Laplace smoothed] nearest neighbor estimate PRLab TUDelft NL
Again! A Shameless Plug… � But only a short one this time… � Nearest neighbor weighting [NNeW] � The idea… PRLab TUDelft NL
P “Optimal” Weights Q � Linear regression example � Find the coefficient θ that relates y to x via y = θ x + ɛ Q � Optimal θ = 1 � Squared loss � Assume one knows the true P(X) and Q(X) � For particular weighting, solution P can be found by means of weighted regression PRLab TUDelft NL
Learning Curve for “Optimal” Weights � Using the true weights Q/P, what behavior do we expect for increasing sample sizes? � Let us consider relative improvements : MSE(Q)/MSE(P) � 1 training sample? � Many [say ∞] training samples? � And in between? PRLab TUDelft NL
As a Side Remark � Can we solve semi-supervised learning by importance weighting? � [Earlier references to Sokolovska and Kawakita] PRLab TUDelft NL
[Further] Questions, Remarks, etc. � What problems can be modelled as covariate shift? � What if P(S=1|X,Y) cannot be simplified? � Bickel et al. take Sugiyama et al. a step further and discrepancy minimization makes yet another step � Weighted version can deteriorate even if “true” weights are used � Correction by weighting might have hardly any influence when nonparametric hypothesis considered � When to use weighting in the first place? PRLab TUDelft NL
References - Ben-David, Blitzer, Crammer, Kulesza, Pereira, Vaughan, “A theory of learning from different domains,” ML, 2010 - Ben-David, Lu, Pál, “Impossibility theorems for domain adaptation,” AISTATS, 2010 - Ben-David, Urner, “On the hardness of domain adaptation and the utility of unlabeled target samples,” ALT, 2012 - Bickel, Brückner, “Scheffer, Discriminative learning under covariate shift”, JMLR, 2009 - Cortes, Mohri, “Domain adaptation and sample bias correction theory and algorithm for regression,” Theoretical CS, 2014 - Daumé III, “Frustratingly easy domain adaptation,” ACL, 2009 - Dinh, Duin, Piqueras-Salazar, Loog, “FIDOS: A generalized Fisher based feature extraction method for domain shift,” PR, 2013 - Gama, Zliobaite, Bifet, Pechenizkiy, Bouchachia, “A survey on concept drift adaptation,” ACM CSUR, 2014 - Jiang, “A literature survey on domain adaptation of statistical classifiers,” 2008 - Loog, “Nearest neighbor-based importance weighting,” MLSP, 2012 - Lu, Behbood, Hao, Zuo, Xue, Zhang, “Transfer Learning using Computational Intelligence: A Survey,” KBS, 2015 - Mansour, Mohri, Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” COLT, 2009 - Margolis, “A literature review of domain adaptation with unlabeled data,” University of Washington, TR 35, 2010 - Pan, Tsang, Kwok, Yang, “Domain adaptation via transfer component analysis,” IEEE TNN, 2011 - Pan, Yang, “A survey on transfer learning,”, IEEE TKDE, 2010 - Quionero-Candela, Sugiyama, Schwaighofer, Lawrence, “Dataset shift in machine learning,” The MIT Press, 2009 - Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” J. Stat. Plan. Inference, 2000 - Sugiyama, Krauledat, & Müller, “Covariate shift adaptation by importance weighted cross validation,” JMLR, 2007 - Torrey, Shavlik, “Transfer learning,” Handbook of Research on ML Applications and Trends, 2009 PRLab TUDelft NL
Recommend
More recommend