✠ A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign What is domain adaptation? �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 1
✁ � Example: named entity recognition persons, locations, organizations, etc. train test (labeled) (unlabeled) standard NER supervised learning 85.5% Classifier New York Times New York Times �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Example: named entity recognition persons, locations, organizations, etc. train test (labeled) (unlabeled) non-standard NER (realistic) setting 64.1% Classifier labeled data not available New York Times Reuters New York Times �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 2
✂ � ✁ Domain difference performance drop train test ideal setting NER 85.5% NYT NYT Classifier New York Times New York Times realistic setting NER 64.1% NYT Reuters Classifier Reuters New York Times �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Another NER example train test ideal setting gene 54.1% name recognizer mouse mouse realistic setting gene 28.1% name recognizer fly mouse �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 3
✁ ✂ � ✂ ✄ ✁ � ✂ ✁ � ✆ � Other examples Spam filtering: Public email collection personal inboxes Sentiment analysis of product reviews Digital cameras cell phones Movies books Can we do better than standard supervised learning? Domain adaptation: to design learning methods that are aware of the training and test domain difference. �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ How do we solve the problem in general? �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 4
� Observation 1 domain-specific features wingless daughterless eyeless apexless … �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Observation 1 domain-specific features • describing phenotype wingless daughterless • in fly gene nomenclature eyeless • feature “-less” weighted high apexless … CD38 feature still PABPC5 useful for other … organisms? No! �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✁ ✡☛ 5
✻ ✁ ✁ ✁ ✁ ✁ ✻ ● ✁ ✁ ● ✁ ✁ ✛ ✁ Observation 2 generalizable features ✻❂❁✤✖❃✻ ❄❆❅❈❇❊❉ ✰✵✭❋✦❃★✿✪✘✥✧✦✟✭✯✭✮✦✚✙ ✂☎✄✝✆✟✞✡✠✝✄☞☛✍✌✎✞✡✠☎✏✑✄✓✒✕✔✑✆ ●✼❍■● ✴❏✻❂❁■✗✤✦✿✸✘✥✧✴✝✗✤✭❑✖✿✗✤✙▲✶✓✳▼✰◆✖✘✳ ✔✜☛✓✒✕✏✑✄✣✢✤✢ ✖✘✗✚✙ ✖✍✥✧✦ ❖P✦✍✳✲✳✵✭ ✻❂❁✟✖✼✻ ◗✝❘❚❙✕◗❊❄❱❯ ✰✵✭ ✦✩★✫✪✘✥✬✦✟✭✮✭✯✦✟✙✱✰✲✗ ✦✩★✫✪✘✥✬✦✟✭✮✭✯✦✟✙✱✰✲✗❳❲❨✦❃✻❩✖✘✳ ✥✧✖✍✰✲✗ ✖✘✗✚✖✘✳✵✴✘✶✷✴✹✸✩✭✺✪✟✖✼✻✽✻✾✦✿✥❀✗✩✭ ✖✘✗✚✙✱✰❬✗❭✖❪✥✽✖✘✗✩✶✘✦❭✴✚❲❫✖✤✙✣✸✘✳ in each ✻❴✰✵✭✯✭❵✸✤✦✟✭✩❛ �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Observation 2 generalizable features ✄✍❞❫✠☞❡♥✄❣✢✤✢❃✄✝✂ ✻❂❁✤✖❃✻ ✐✡❥❧❦✟♠ ✰✵✭ ✙✘✦✟❖P✖✍✪✩✦✍✗✼✻❩✖✘✪✿✳✵✦✟✶✝✰✵❖ ●✼❍■● ✴✤✻❂❁▲✗✤✦✿✸✘✥✧✴✝✗✤✭❑✖✿✗✤✙■✶✝✳✲✰♦✖✘✳ ✖✘✗✚✙ ❜❝✰❬✗✤✶✝✳✵✦✚✭✮✭ ✖✘✥✧✦ ❖P✦✿✳▼✳✵✭ ✻❂❁✟✖✼✻ ♣✼qsr✝♣✤✐☎t ✰✵✭ ✄✍❞❫✠☎❡❢✄❣✢✤✢❏✄✓✂ ✰❤✗ ✄✍❞❫✠☎❡❢✄❣✢✤✢❏✄✓✂ ✰❤✗✺❲❨✦❏✻✉✖✍✳ ✥✽✖✘✰❬✗ ✖✘✗✚✖✘✳✵✴✘✶✷✴✹✸✩✭✺✪✟✖✼✻✽✻✾✦✿✥❀✗✩✭ ✖✘✗✟✙✱✰❤✗✈✖✇✥❨✖✘✗✤✶✘✦❭✴✚❲❫✖✤✙✣✸✘✳ in each ✻❴✰✵✭✮✭✼✸✤✦✟✭❃❛ feature “X be expressed” �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✁ ✑✠ 6
✁ ✁ � ✁ General idea: two-stage approach domain-specific features Source Target Domain Domain generalizable features features �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Goal Source Target Domain Domain features �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 7
✁ � ✁ ✂ Regular classification Source Target Domain Domain features �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Generalization: to emphasize generalizable features in the trained model Source Target Domain Domain features Stage 1 �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 8
✄ ✁ Adaptation: to pick up domain-specific features for the target domain Source Target Domain Domain features Stage 2 �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✁ ✡✆ Regular semi-supervised learning Source Target Domain Domain features �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 9
✠ � ✁ ✁ � � ✁ � ☛ ✁ Comparison with related work We explicitly model generalizable features. Previous work models it implicitly [Blitzer et al. 2006, Ben- David et al. 2007, Daumé III 2007]. We do not need labeled target data but we need multiple source (training) domains. Some work requires labeled target data [Daumé III 2007]. We have a 2 nd stage of adaptation, which uses semi-supervised learning. Previous work does not incorporate semi-supervised learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007]. �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Implementation of the two- stage approach with logistic regression classifiers �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ 10
✗ ✘ ✖ ✠ ✁ ✖ ✖ Logistic regression classifiers 0.2 0 -less 4.5 1 w x T 5 0 exp( ) y x w = p y ( | , ) -0.3 0 w x T exp( ' ) 3.0 1 ∑ y p binary features y : : ' : : X be expressed 2.1 0 -0.9 1 … and wingless are 0.4 0 expressed in… T x w y w y x �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ Learning a logistic regression classifier 0.2 0 regularization term 4.5 1 ( 5 0 w w 2 = λ ˆ arg min -0.3 0 w 3.0 1 penalize large weights : : w x T N exp( ) 1 : : − y log control model complexity ∑ w T x N 2.1 0 exp( ) i = 1 ∑ y ' -0.9 1 y ' 0.4 0 T x log likelihood of w y training data �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✠✕✠ 11
Generalizable features in weight vectors D 1 D 2 D K K source domains 0.2 3.2 0.1 4.5 0.5 0.7 5 4.5 4.2 domain-specific features -0.3 -0.1 0.1 3.0 3.5 3.2 : : : … generalizable : : : features 2.1 0.1 1.7 -0.9 -1.0 0.1 0.4 -0.2 0.3 w K w 1 w 2 �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✠ ✁� We want to decompose w in this way h non-zero entries for h 0.2 0 0.2 4.5 0 4.5 generalizable 5 4.6 0.4 features -0.3 0 -0.3 = + 3.0 3.2 -0.2 : : : : : : 2.1 0 2.1 -0.9 0 -0.9 0.4 0 0.4 �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✠ ✁✂ 12
Feature selection matrix A matrix A selects h generalizable features 0 1 0 0 1 0 0 … 0 0 0 0 0 0 0 1 … 0 1 0 h : : 1 : : : 0 0 0 0 0 … 1 : 0 0 z = A x A 1 0 x �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✠ ✁� Decomposition of w weights for domain-specific features weights for generalizable features 0.2 0 0.2 0 4.5 1 4.5 1 4.6 0 5 0 0.4 0 -0.3 0 3.2 1 -0.3 0 + = : : 3.0 1 -0.2 1 : : : : : : 3.6 0 : : : : 2.1 0 2.1 0 -0.9 1 -0.9 1 0.4 0 0.4 0 w T x v T z + u T x = �✂✁☎✄✝✆✟✞✡✠☞☛☎☛☎✆ ✌✎✍✑✏✓✒✔✠☞☛☎☛☎✆ ✠ ✁✂ 13
Recommend
More recommend