Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 - PowerPoint PPT Presentation

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The University of Tokyo / RIKEN AIP)

Research interests ■ 2 Classification from Pairwise Similarity and Unlabeled Data [BNS18] (ICML2018)   Weak supervision: how to learn without labels   ■ Unsupervised Domain Adaptation Based on Source-guided Discrepancy [KCBHSS19] (AAAI2019)   Domain adaptation: how to learn when training ≠ test   ■ [WCBTS19] (ICML2019) Imitation Learning from Imperfect Demonstration Reinforcement learning with low-cost data   ■ [Bao & Sugiyama 19] (in submission) class-imbalance   Learning theory: how to handle performance metrics for supervised learning + real-world constraints today’ s topic

Inference in Real-world https://www.270towin.com/2016_Election/ [Brownback & Novotny 2018] Hard to obtain real answers! ■ Prediction of President Election ▶ cf. social desirability bias ▶ tend to answer in the ways “what others desire” ▶ unexpected results in 2016 US president election Brownback, A., & Novotny, A. (2018). Social desirability bias and polling errors in the 2016 presidential election. Journal of Behavioral and Experimental Economics , 74 , 38-56.

Inference in Real-world [Wachinger & Reuter 2016] Data distribution may differ! ? ■ Integration of hospital databases ▶ CAD (Computer-Aided Diagnosis) prevailing ▶ each hospital has limited amount of data ▶ want to unify among hospitals as much as possible Wachinger, C., & Reuter, M. Alzheimer's Disease Neuroimaging Initiative. (2016). Domain adaptation for Alzheimer's disease diagnostics. Neuroimage , 139 , 470-479.

What’s transfer learning? ■ 5 Many terminologies: transfer learning, covariate shift adaptation, domain adaptation, multi-task learning, etc. training data training distribution test data test distribution ■ Usual machine learning ■ Transfer learning

Unsupervised Domain Adaptation (target) (source) no access 6     ■ Input abundant { x i , y i } ∼ p S ▶ training labeled data: { x ′ � j } ∼ p T scarce ▶ test unlabeled data: ■ Goal ▶ obtain a predictor that performs well on test data   Err T ( g ) = 𝔽 T [ ℓ ( Y , g ( X ))] argmin g ▶ Q. How to estimate the target risk?

Outline 7 ■ Introduction ̶ Transfer Learning ■ History/Comparison of Existing Approaches ■ Proposed Method ■ Experiments and Future Work

Potential Solutions 8 making them similar mapping into shared representations ■ Importance Weighting ■ Representation Learning

Potential Solutions 9 making them similar mapping into shared representations It’s important to measure closeness of distributions! ■ Importance Weighting ■ Representation Learning supp( q ) ⊆ supp( p S ) D ( q , p T ) min min φ D ( φ ( p S ), φ ( p T ))

Divergences Wasserstein -divergence -divergence Renyí divergence Tsallis-divergence Jensen-Shannon Cramer Kernel Stein Discrepancy Energy distance MMD Hellinger -divergence KL TV Metric (IPM) Integral Probability 10 f -divergence χ 2 β γ

Divergences 11 Integral Probability Metric (IPM) f -divergence p p − q q

What is a good measure? of source dist. expectation over marginal labeling func. loss func. (e.g. 1-Lipschitz for Wasserstein) : real-valued function class IPM 12 (parallel notation for target domain as well) distances between distributions are small     ■ Postulate: classification risks should be closer if Err T ( g ) − Err S ( g ) ≤ D ( p T , p S ) + C 𝔽 T [ ℓ ( g )] − 𝔽 S [ ℓ ( g )] ■ IPM could be a more suitable family! Γ 𝔽 p [ γ ] − 𝔽 q [ γ ] D Γ ( p , q ) = sup ▶ IPM: γ ∈Γ ▶ represented in difference of expectations f -div. p p − q Err S [ g ] = 𝔽 p S [ ℓ ( g ( X ), f S ( X ))] q

Simple Approach: Total Variation are distributions we can make a distribution with arbitrarily large TV [Kifer+ VLDB2004] over 13 p , q p ( A ) − q ( A ) D TV ( p , q ) = 2 sup ■ Total Variation 𝒴 A :mes ′ � ble ■ classification risk bound   Err T ( g ) − Err S ( g ) ≤ D TV ( p S , p T ) + min{ 𝔽 S [ | f S − f T | ], 𝔽 T [ | f S − f T | ]} ■ Problems ▶ TV is overly pessimistic ▶ TV is hard estimate within finite sample Kifer, D., Ben-David, S., & Gehrke, J. (2004, August). Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pp. 180-191). VLDB Endowment.

First Attempt: Definition ( -divergence) -divergence can be computed by ERM in (omitted) ̂ by def. ⇒ could be less pessimistic Let ; Lemma (finite-sample convergence) ̂ . Then, with prob. at least , empirical estimator ̂ [Kifer+ VLDB2004; Blitzer+ NeurIPS2008] 14 ℋΔℋ ℋ p ( g ( X ) = 1) − q ( g ( X ) = 1) ℋ ⊂ {±1} 𝒴 D ℋ ( p , q ) = 2 sup g ∈ℋ D ℋ ( p , q ) ≤ D TV ( p , q ) ▶ ℋ D ℋ ( p , q ) ▶ estimator O p ( d = VCdim( ℋ ) 1 − δ min{ n S , n T } ) 1 D ℋ ( p S , p T ) + ˜ D ℋ ( p S , p T ) ≤ n S ∑ x ∈ S 1 { g ( x )=1} − 1 1 n T ∑ x ∈ T 1 { g ( x )=1} D ℋ ( p S , p T ) = 2 sup Kifer, D., Ben-David, S., & Gehrke, J. (2004, August). Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pp. 180-191). VLDB Endowment. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2008). Learning bounds for domain adaptation. In Advances in neural information processing systems (pp. 129-136).

First Attempt: where (joint minimizer) -divergence cannot be accessed) for some , for any , ( XOR) ( is intractable ; though is tractable Issues ̂ ̂ . Then, with prob. at least Theorem (domain adaptation bound) 15 [Kifer+ VLDB2004; Blitzer+ NeurIPS2008] Definition (symmetric difference hypothesis ) ̂ Let ℋΔℋ ℋΔℋ g ∈ ℋΔℋ ⟺ g = h ⊕ h ′ � h , h ′ � ∈ ℋ ⊕ : d = VCdim( ℋ ) 1 − δ g O p ( min{ n S , n T } ) + λ Err T ( g ) ≤ Err S ( g ) + 1 1 D ℋΔℋ ( p S , p T ) + ˜ 2 λ = min h ∈ℋ Err S ( h ) + Err T ( h ) D ℋΔℋ D ℋ ▶ λ ▶ is intrinsically impossible to estimate ; assume to be small ∵ Err T

Extension: discrepancy measure ̂ ̂ ̂ ̂ ̂ ; : empirical estimator of Then, with prob. at least resp.). Assume is Lipschitz cont. ( are bounded by Let Rademacher averages of on the distribution ( resp.) , 16 Lemma (finite-sample convergence) loss is generalized ; Definition (discrepancy) [Mansour+ COLT2009] Err( g , g ′ � ) = ∫ ℓ ( g ( X ), g ′ � ( X )) dp Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup g , g ′ � ∈ℋ ▶ intuition: seeking for potential labelings maximizing diff. of losses D disc, ℓ D disc, ℓ Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup ▶ g , g ′ � ∈ℋ ℋ p S p T O p ( n − 1/2 ) O p ( n − 1/2 ) ℓ T S 1 − δ D disc, ℓ ( p S , p T ) + O p ( min{ n S , n T } ) 1 D disc, ℓ ( p S , p T ) ≤ Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Proceedings of Computational Learning Theory.

Extension: discrepancy measure Let Rademacher averages of on the distribution ( resp.) are bounded ̂ (tractable in simple cases) is generally intractable ; needs joint sup of and   (joint minimizer) where for any , , resp.). Assume is symmetric. Then, with prob. at least ( by Issues 17 ̂ [Mansour+ COLT2009] Theorem (domain adaptation bound) ̂ ℋ p S p T O p ( n − 1/2 ) O p ( n − 1/2 ℋ 1 − δ ) T S g D disc,01 ( p S , p T ) + O p ( min{ n S , n T } ) + λ 1 Err T ( g , f T ) − Err* T ≤ Err S ( g , g * S ) + = Err T ( g * T , f T ) λ = Err T ( g * S , g * T ) g ′ � D disc, ℓ g ▶ λ ▶ is intrinsically impossible to estimate ; assume to be small Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In Proceedings of Computational Learning Theory.

Comparison of Existing Measures ？？？ intractable DA bound hard to estimate pessimistic [MMR09] [KBG04][BBCP06] discrepancy -divergence Variation Total 18 Q. Can we construct a tractable/tighter measure? D ℋΔℋ

Outline 19 ■ Introduction ̶ Transfer Learning ■ History/Comparison of Existing Approaches ■ Proposed Method ■ Experiments and Future Work

Proposed: Source-guided Discrepancy fix one function 20 Idea: supremum with one variable should be tractable Definition (Source-guided Discrepancy) where (source risk minimizer) by definition (S-disc is finer) cf. (discrepancy) Err( g , g ′ � ) = ∫ ℓ ( g ( X ), g ′ � ( X )) dp S ) − Err q ( g , g * D sd, ℓ ( p , q ) = sup Err p ( g , g * S ) ; g ∈ℋ g * S = argmin Err S ( g ) g ∈ℋ Err p ( g , g ′ � ) − Err q ( g , g ′ � ) D disc, ℓ ( p , q ) = sup g , g ′ � ∈ℋ D sd, ℓ ( p , q ) ≤ D disc, ℓ ( p , q ) ▶

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 - PowerPoint PPT Presentation

Unsupervised Domain Adaptation Based on Source-guided Discrepancy 23 th Sep. Han Bao (The University of Tokyo / RIKEN AIP) Research interests 2 Classification from Pairwise Similarity and Unlabeled Data [BNS18] (ICML2018) Weak

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Lightweight Unsupervised Domain Adaptation by Convolutional Filter Reconstruction Rahaf Aljundi,

Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems Stephen

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Unsupervised learning of multimodal image registration using domain adaptation with projected

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Implicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation 1,2 1,4 Xiang

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

CS 221 Tuesday 8 November 2011 Agenda 1. Announcements 2. Review: Solving Equations (Text

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

Formalizing the Informal, From Equations to . . . Precisiating the Imprecise: Divergence: A

Agreement: Implications of Proposals to date Xolisa Ngwadla, Marianne Karlsen CCXG Global Forum

Exercise 1: Energy Deposition FLUKA Advanced Course Exercise 1a Study case Beam dump of a

Information-Theoretic Metric Learning Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit