Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The University of Tokyo / RIKEN AIP)
Research interests ■ 2 Classification from Pairwise Similarity and Unlabeled Data [BNS18] (ICML2018) Weak supervision: how to learn without labels ■ Unsupervised Domain Adaptation Based on Source-guided Discrepancy [KCBHSS19] (AAAI2019) Domain adaptation: how to learn when training ≠ test ■ [WCBTS19] (ICML2019) Imitation Learning from Imperfect Demonstration Reinforcement learning with low-cost data ■ [Bao & Sugiyama 19] (in submission) class-imbalance Learning theory: how to handle performance metrics for supervised learning + real-world constraints today’ s topic
Inference in Real-world https://www.270towin.com/2016_Election/ [Brownback & Novotny 2018] Hard to obtain real answers! ■ Prediction of President Election ▶ cf. social desirability bias ▶ tend to answer in the ways “what others desire” ▶ unexpected results in 2016 US president election Brownback, A., & Novotny, A. (2018). Social desirability bias and polling errors in the 2016 presidential election. Journal of Behavioral and Experimental Economics , 74 , 38-56.
Inference in Real-world [Wachinger & Reuter 2016] Data distribution may differ! ? ■ Integration of hospital databases ▶ CAD (Computer-Aided Diagnosis) prevailing ▶ each hospital has limited amount of data ▶ want to unify among hospitals as much as possible Wachinger, C., & Reuter, M. Alzheimer's Disease Neuroimaging Initiative. (2016). Domain adaptation for Alzheimer's disease diagnostics. Neuroimage , 139 , 470-479.
What’s transfer learning? ■ 5 Many terminologies: transfer learning, covariate shift adaptation, domain adaptation, multi-task learning, etc. training data training distribution test data test distribution ■ Usual machine learning ■ Transfer learning
Unsupervised Domain Adaptation (target) (source) no access 6 ■ Input abundant { x i , y i } ∼ p S ▶ training labeled data: { x ′ � j } ∼ p T scarce ▶ test unlabeled data: ■ Goal ▶ obtain a predictor that performs well on test data Err T ( g ) = 𝔽 T [ ℓ ( Y , g ( X ))] argmin g ▶ Q. How to estimate the target risk?
Outline 7 ■ Introduction ̶ Transfer Learning ■ History/Comparison of Existing Approaches ■ Proposed Method ■ Experiments and Future Work
Potential Solutions 8 making them similar mapping into shared representations ■ Importance Weighting ■ Representation Learning
Potential Solutions 9 making them similar mapping into shared representations It’s important to measure closeness of distributions! ■ Importance Weighting ■ Representation Learning supp( q ) ⊆ supp( p S ) D ( q , p T ) min min φ D ( φ ( p S ), φ ( p T ))
Divergences Wasserstein -divergence -divergence Renyí divergence Tsallis-divergence Jensen-Shannon Cramer Kernel Stein Discrepancy Energy distance MMD Hellinger -divergence KL TV Metric (IPM) Integral Probability 10 f -divergence χ 2 β γ
Divergences 11 Integral Probability Metric (IPM) f -divergence p p − q q
What is a good measure? of source dist. expectation over marginal labeling func. loss func. (e.g. 1-Lipschitz for Wasserstein) : real-valued function class IPM 12 (parallel notation for target domain as well) distances between distributions are small ■ Postulate: classification risks should be closer if Err T ( g ) − Err S ( g ) ≤ D ( p T , p S ) + C 𝔽 T [ ℓ ( g )] − 𝔽 S [ ℓ ( g )] ■ IPM could be a more suitable family! Γ 𝔽 p [ γ ] − 𝔽 q [ γ ] D Γ ( p , q ) = sup ▶ IPM: γ ∈Γ ▶ represented in difference of expectations f -div. p p − q Err S [ g ] = 𝔽 p S [ ℓ ( g ( X ), f S ( X ))] q
Simple Approach: Total Variation are distributions we can make a distribution with arbitrarily large TV [Kifer+ VLDB2004] over 13 p , q p ( A ) − q ( A ) D TV ( p , q ) = 2 sup ■ Total Variation 𝒴 A :mes ′ � ble ■ classification risk bound Err T ( g ) − Err S ( g ) ≤ D TV ( p S , p T ) + min{ 𝔽 S [ | f S − f T | ], 𝔽 T [ | f S − f T | ]} ■ Problems ▶ TV is overly pessimistic ▶ TV is hard estimate within finite sample Kifer, D., Ben-David, S., & Gehrke, J. (2004, August). Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pp. 180-191). VLDB Endowment.
First Attempt: Definition ( -divergence) -divergence can be computed by ERM in (omitted) ̂ by def. ⇒ could be less pessimistic Let ; Lemma (finite-sample convergence) ̂ . Then, with prob. at least , empirical estimator ̂ [Kifer+ VLDB2004; Blitzer+ NeurIPS2008] 14 ℋΔℋ ℋ p ( g ( X ) = 1) − q ( g ( X ) = 1) ℋ ⊂ {±1} 𝒴 D ℋ ( p , q ) = 2 sup g ∈ℋ D ℋ ( p , q ) ≤ D TV ( p , q ) ▶ ℋ D ℋ ( p , q ) ▶ estimator O p ( d = VCdim( ℋ ) 1 − δ min{ n S , n T } ) 1 D ℋ ( p S , p T ) + ˜ D ℋ ( p S , p T ) ≤ n S ∑ x ∈ S 1 { g ( x )=1} − 1 1 n T ∑ x ∈ T 1 { g ( x )=1} D ℋ ( p S , p T ) = 2 sup Kifer, D., Ben-David, S., & Gehrke, J. (2004, August). Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pp. 180-191). VLDB Endowment. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2008). Learning bounds for domain adaptation. In Advances in neural information processing systems (pp. 129-136).
First Attempt: where (joint minimizer) -divergence cannot be accessed) for some , for any , ( XOR) ( is intractable ; though is tractable Issues ̂ ̂ . Then, with prob. at least Theorem (domain adaptation bound) 15 [Kifer+ VLDB2004; Blitzer+ NeurIPS2008] Definition (symmetric difference hypothesis ) ̂ Let ℋΔℋ ℋΔℋ g ∈ ℋΔℋ ⟺ g = h ⊕ h ′ � h , h ′ � ∈ ℋ ⊕ : d = VCdim( ℋ ) 1 − δ g O p ( min{ n S , n T } ) + λ Err T ( g ) ≤ Err S ( g ) + 1 1 D ℋΔℋ ( p S , p T ) + ˜ 2 λ = min h ∈ℋ Err S ( h ) + Err T ( h ) D ℋΔℋ D ℋ ▶ λ ▶ is intrinsically impossible to estimate ; assume to be small ∵ Err T
Extension: discrepancy measure ̂ ̂ ̂ ̂ ̂ ; : empirical estimator of Then, with prob. at least resp.). Assume is Lipschitz cont. ( are bounded by Let Rademacher averages of on the distribution ( resp.) , 16 Lemma (finite-sample convergence) loss is generalized ; Definition (discrepancy) [Mansour+ COLT2009] Err( g , g ′ � ) = ∫ ℓ ( g ( X ), g ′ � ( X )) dp Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup g , g ′ � ∈ℋ ▶ intuition: seeking for potential labelings maximizing diff. of losses D disc, ℓ D disc, ℓ Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup ▶ g , g ′ � ∈ℋ ℋ p S p T O p ( n − 1/2 ) O p ( n − 1/2 ) ℓ T S 1 − δ D disc, ℓ ( p S , p T ) + O p ( min{ n S , n T } ) 1 D disc, ℓ ( p S , p T ) ≤ Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Proceedings of Computational Learning Theory.
Extension: discrepancy measure Let Rademacher averages of on the distribution ( resp.) are bounded ̂ (tractable in simple cases) is generally intractable ; needs joint sup of and (joint minimizer) where for any , , resp.). Assume is symmetric. Then, with prob. at least ( by Issues 17 ̂ [Mansour+ COLT2009] Theorem (domain adaptation bound) ̂ ℋ p S p T O p ( n − 1/2 ) O p ( n − 1/2 ℋ 1 − δ ) T S g D disc,01 ( p S , p T ) + O p ( min{ n S , n T } ) + λ 1 Err T ( g , f T ) − Err* T ≤ Err S ( g , g * S ) + = Err T ( g * T , f T ) λ = Err T ( g * S , g * T ) g ′ � D disc, ℓ g ▶ λ ▶ is intrinsically impossible to estimate ; assume to be small Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Proceedings of Computational Learning Theory.
Comparison of Existing Measures ??? intractable DA bound hard to estimate pessimistic [MMR09] [KBG04][BBCP06] discrepancy -divergence Variation Total 18 Q. Can we construct a tractable/tighter measure? D ℋΔℋ
Outline 19 ■ Introduction ̶ Transfer Learning ■ History/Comparison of Existing Approaches ■ Proposed Method ■ Experiments and Future Work
Proposed: Source-guided Discrepancy fix one function 20 Idea: supremum with one variable should be tractable Definition (Source-guided Discrepancy) where (source risk minimizer) by definition (S-disc is finer) cf. (discrepancy) Err( g , g ′ � ) = ∫ ℓ ( g ( X ), g ′ � ( X )) dp S ) − Err q ( g , g * D sd, ℓ ( p , q ) = sup Err p ( g , g * S ) ; g ∈ℋ g * S = argmin Err S ( g ) g ∈ℋ Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup g , g ′ � ∈ℋ D sd, ℓ ( p , q ) ≤ D disc, ℓ ( p , q ) ▶
Recommend
More recommend