strong baselines for neural semi supervised learning
play

Strong Baselines for Neural Semi-supervised Learning under Domain - PowerPoint PPT Presentation

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara Plank Learning under Domain Shift 2 Learning under Domain Shift State-of-the-art domain adaptation approaches 2 Learning under Domain Shift


  1. Multi-task 
 Tri-training Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. y = 1 y = 1 x 1 11

  2. Multi-task 
 Tri-training Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. 3. Restrict final layers to 
 y = 1 y = 1 use different 
 x representations. 1 11

  3. Multi-task 
 Tri-training Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. 3. Restrict final layers to 
 y = 1 y = 1 use different 
 x representations. 4. Train third objective 
 1 function only on 
 pseudo labeled to 
 bridge domain shift. 11

  4. Multi-task 
 Tri-training 12

  5. Multi-task 
 Tri-training BiLSTM (Plank et al., 2016) 12

  6. Multi-task 
 Tri-training BiLSTM (Plank et al., 2016) char w 2 BiLSTM 12

  7. Multi-task 
 Tri-training BiLSTM BiLSTM (Plank et al., 2016) char char w 1 w 2 BiLSTM BiLSTM 12

  8. Multi-task 
 Tri-training BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  9. Multi-task 
 Tri-training m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  10. Multi-task 
 Tri-training m 2 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  11. Multi-task 
 Tri-training m 2 m 3 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  12. Multi-task 
 Tri-training m 2 m 3 m 1 m 2 m 3 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  13. Multi-task 
 Tri-training m 2 m 3 m 2 m 3 m 1 m 2 m 3 m 1 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  14. Multi-task 
 Tri-training L orth = ∥ W ⊤ m 1 W m 2 ∥ 2 orthogonality constraint (Bousmalis et al., 2016) F m 2 m 3 m 2 m 3 m 1 m 2 m 3 m 1 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM 12

  15. ⃗ Multi-task 
 Tri-training L orth = ∥ W ⊤ m 1 W m 2 ∥ 2 orthogonality constraint (Bousmalis et al., 2016) F m 2 m 3 m 2 m 3 m 1 m 2 m 3 m 1 m 1 BiLSTM BiLSTM BiLSTM (Plank et al., 2016) char char char w 1 w 2 w 3 BiLSTM BiLSTM BiLSTM L ( θ ) = − ∑ i ∑ log P m i ( y | h ) + γ L orth Loss: 1,.., n 12

  16. Data & Tasks 13

  17. Data & Tasks Two tasks: Domains: 13

  18. Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006) 13

  19. Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012) 13

  20. Sentiment Analysis Results 82 80.25 Accuracy 78.5 76.75 75 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14

  21. Sentiment Analysis Results 82 80.25 Accuracy 78.5 76.75 75 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14

  22. Sentiment Analysis Results 82 80.25 Accuracy 78.5 76.75 75 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14

  23. Sentiment Analysis Results 82 80.25 Accuracy 78.5 76.75 75 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14

  24. Sentiment Analysis Results 82 80.25 Accuracy 78.5 76.75 75 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) ‣ Multi-task tri-training slightly outperforms tri-training, but has higher variance. 14

  25. POS Tagging Results Trained on 10% labeled data (WSJ) 89.8 89.525 Accuracy 89.25 88.975 88.7 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri 15

  26. POS Tagging Results Trained on 10% labeled data (WSJ) 89.8 89.525 Accuracy 89.25 88.975 88.7 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri 15

  27. POS Tagging Results Trained on 10% labeled data (WSJ) 89.8 89.525 Accuracy 89.25 88.975 88.7 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri 15

  28. POS Tagging Results Trained on 10% labeled data (WSJ) 89.8 89.525 Accuracy 89.25 88.975 88.7 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri 15

  29. POS Tagging Results Trained on 10% labeled data (WSJ) 89.8 89.525 Accuracy 89.25 88.975 88.7 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training with disagreement works best with little data. 15

  30. POS Tagging Results Trained on full labeled data (WSJ) 92 91.25 Accuracy 90.5 89.75 89 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri * result from Schnabel & Schütze (2014) 16

  31. POS Tagging Results Trained on full labeled data (WSJ) 92 91.25 Accuracy 90.5 89.75 89 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri * result from Schnabel & Schütze (2014) 16

  32. POS Tagging Results Trained on full labeled data (WSJ) 92 91.25 Accuracy 90.5 89.75 89 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri * result from Schnabel & Schütze (2014) 16

  33. POS Tagging Results Trained on full labeled data (WSJ) 92 91.25 Accuracy 90.5 89.75 89 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri * result from Schnabel & Schütze (2014) ‣ Tri-training works best in the full data setting. 16

  34. POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens 11 80 Accuracy on OOV tokens 8.25 72.5 % OOV tokens 5.5 65 2.75 57.5 0 50 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri 17

  35. POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens 11 80 Accuracy on OOV tokens 8.25 72.5 % OOV tokens 5.5 65 2.75 57.5 0 50 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri 17

  36. POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens 11 80 Accuracy on OOV tokens 8.25 72.5 % OOV tokens 5.5 65 2.75 57.5 0 50 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri 17

  37. POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens 11 80 Accuracy on OOV tokens 8.25 72.5 % OOV tokens 5.5 65 2.75 57.5 0 50 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. 17

  38. POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens 11 80 Accuracy on OOV tokens 8.25 72.5 % OOV tokens 5.5 65 2.75 57.5 0 50 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. ‣ MT-Tri does worse than source-only baseline on OOV. 17

  39. POS Tagging Analysis POS accuracy per binned log frequency 0.018 Accuracy delta vs. src-only baseline 0.014 0.009 0.005 0 -0.005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Binned frequency MT-Tri Tri 18

Recommend


More recommend