semisupervised learning transfer learning and the future
play

Semisupervised Learning, Transfer Learning, and the Future at a - PowerPoint PPT Presentation

Semisupervised Learning, Transfer Learning, and the Future at a Glance Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning


  1. Semisupervised GAN Generator and discriminator do not need to play a zero-sum game Semisupervised GAN [11]: Discriminator learns one extra class “fake” in addition to K classes Softmax output units a ( L ) = ˆ ρ 2 R K + 1 for P ( y | x , Θ ) ⇠ Categorical ( ρ ) Cost function ( L labeled, M fake, N � L unlabeled): L K M N 1 ( y ( n ) = j ) log ˆ ρ ( n ) ρ ( m ) ρ ( n ) ∑ ∑ ∑ ∑ + log ˆ K + 1 + log ( 1 � ˆ K + 1 ) argmin Θ gen max j Θ dis n = 1 j = 1 m = 1 n = L + 1 Real, labeled points should be classified correctly Generated point should be identified as fake Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 12 / 57

  2. Semisupervised GAN Generator and discriminator do not need to play a zero-sum game Semisupervised GAN [11]: Discriminator learns one extra class “fake” in addition to K classes Softmax output units a ( L ) = ˆ ρ 2 R K + 1 for P ( y | x , Θ ) ⇠ Categorical ( ρ ) Cost function ( L labeled, M fake, N � L unlabeled): L K M N 1 ( y ( n ) = j ) log ˆ ρ ( n ) ρ ( m ) ρ ( n ) ∑ ∑ ∑ ∑ + log ˆ K + 1 + log ( 1 � ˆ K + 1 ) argmin Θ gen max j Θ dis n = 1 j = 1 m = 1 n = L + 1 Real, labeled points should be classified correctly Generated point should be identified as fake Real, unlabeled points can be in any class except K + 1 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 12 / 57

  3. Performance State-of-the-art classification performance given: 100 labeled points (out of 60K) in MNIST 4K labeled points (out of 50K) in CIFAR-10 With generators: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 13 / 57

  4. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 14 / 57

  5. Clustering Clustering is an ill-posed problem Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 15 / 57

  6. Clustering Clustering is an ill-posed problem E.g., how to cluster the following images into two group? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 15 / 57

  7. Semisupervised Clustering Di ff erent users may have di ff erent answers: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  8. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  9. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Semisupervised clustering : to ask some side information from the user to better uncover the user perspective Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  10. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Semisupervised clustering : to ask some side information from the user to better uncover the user perspective In what form? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  11. Point-Level Supervision Side info: must-links and/or cannot-links Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 17 / 57

  12. Point-Level Supervision Side info: must-links and/or cannot-links Constrained K -means [13]: to assign points to clusters without violating the constraints Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 17 / 57

  13. Sampling Bias Sampling of pairwise constraints matters: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  14. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  15. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform E.g., suppose we want to cluster products in an e-commerce website Use click-streams provided by the user to get must-links implicitly Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  16. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform E.g., suppose we want to cluster products in an e-commerce website Use click-streams provided by the user to get must-links implicitly User not likely to click products uniformly Instead, e.g., clicks products with the lowest prices Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  17. Feature-Level Supervision I Side info: perception vectors { p ( n ) 2 R B } N n = 1 E.g., bag-of-word vectors of the “reasons” (text) behind must-/cannot-links B the vocabulary size p ( n ) 6 = 0 if point n is covered by a must-/cannot-link Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 19 / 57

  18. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  19. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Use click-streams provided by the user as must-links Use the query that triggers clicks as the perception vector Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  20. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Use click-streams provided by the user as must-links Use the query that triggers clicks as the perception vector How to learn form the perception vectors? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  21. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  22. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  23. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  24. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B The embedding (parametrized by W and b ) applies to all points, thereby avoiding sampling bias Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  25. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 22 / 57

  26. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  27. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  28. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  29. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  30. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  31. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) How? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  32. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) How? To learn “correlations” between X ( source ) and X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  33. Branches [10] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 24 / 57

  34. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  35. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  36. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  37. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  38. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning None: zero shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  39. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning None: zero shot learning (How is that possible?) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  40. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 26 / 57

  41. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  42. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Models share weights that capture the correlation between the data/tasks Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  43. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Models share weights that capture the correlation between the data/tasks Which layers to share in deep NNs? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  44. Weight Sharing Application dependent, e.g., Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  45. Weight Sharing Application dependent, e.g., Shallow layers in image object recognition To share filters/feature detectors Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  46. Weight Sharing Application dependent, e.g., Shallow layers in image object recognition To share filters/feature detectors Deep layers in speech transcription To share the word map Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  47. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  48. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  49. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  50. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo A regularization technique rather than an optimization technique [3] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  51. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo A regularization technique rather than an optimization technique [3] Which weights to borrow from also depends on applications Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  52. Fine-Tuning I In addition to borrowing weights, we may update ( fine-tune ) the weights when training the target model Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 30 / 57

  53. Fine-Tuning I In addition to borrowing weights, we may update ( fine-tune ) the weights when training the target model Results from 2 CNNs (A and B) over ImageNet [14]: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 30 / 57

  54. Fine-Tuning II Caution Fine tuning does not always help! Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  55. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  56. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Large X ( target ) , di ff erent X ( source ) : Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  57. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  58. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  59. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : No (to avoid overfitting) Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  60. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : No (to avoid overfitting) Small X ( target ) , di ff erent X ( source ) : No Instead prepend/append simple weight rewriter (e.g., linear SVM) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  61. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 32 / 57

  62. Domain Adaptation Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 33 / 57

  63. Domain Adversarial Networks Goal: to learn domain-invariant features that help source model adapt to target task Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 34 / 57

  64. Domain Adversarial Networks Goal: to learn domain-invariant features that help source model adapt to target task Domain classifier + gradient reversal layer [7] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 34 / 57

  65. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 35 / 57

  66. Zero Shot Learning Zero shot learning: transfer learning with X ( source ) and empty X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 36 / 57

  67. Zero Shot Learning Zero shot learning: transfer learning with X ( source ) and empty X ( target ) How is that possible? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 36 / 57

  68. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  69. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Assume that labels in di ff erent domains share the same semantic space Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  70. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Assume that labels in di ff erent domains share the same semantic space Embedding function Ψ can be learned jointedly with the model (e.g., in Google Neural Machine Translation) or separately (e.g., in [1]) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  71. Why Does Zero Shot Learning Work? In task A, a model uses labeled pairs ( x ( i ) , y ( i ) ) ’s to learn the map between spaces of Φ ( x ) and Ψ ( y ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 38 / 57

  72. Why Does Zero Shot Learning Work? In task A, a model uses labeled pairs ( x ( i ) , y ( i ) ) ’s to learn the map between spaces of Φ ( x ) and Ψ ( y ) In task B (with zero shot), the model predicts label of point x 0 by First obtaining Φ ( x 0 ) 1 Then following the map to find out Ψ ( y 0 ) 2 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 38 / 57

  73. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 39 / 57

Recommend


More recommend