Generative Adversarial Network and it its Applications to Human La Language Processing 李宏毅 Hung-yi Lee Full version of the tutorial
Outline Part I: General Introduction of Generative Adversarial Network (GAN) Part II: Applications to Natural Language Processing Part III: Applications to Speech Processing
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo GAN ACGAN BGAN CGAN It is a wise choice to DCGAN attend this tutorial. EBGAN fGAN GoGAN …… Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “ Variational Approaches for Auto-Encoding Generative Adversarial Networks ”, arXiv, 2017
Generative Adversarial Network (GAN) • Anime face generation as example high vector Generator dimensional image vector Discri- score image minator Larger score means real, smaller score means fake.
Algorithm • Initialize generator and discriminator G D • In each training iteration: Step 1 : Fix generator G, and update discriminator D sample Update 1 1 1 1 D generated Database 0 0 0 0 objects randomly vector vector vector vector G Fix sampled Discriminator learns to assign high scores to real objects and low scores to generated objects.
Algorithm • Initialize generator and discriminator G D • In each training iteration: Step 2 : Fix discriminator D, and update generator G Generator learns to “ fool ” the discriminator hidden layer NN Discri- 0.13 Generator minator vector update fix large network Backpropagation
Algorithm • Initialize generator and discriminator G D • In each training iteration: Sample some Update 1 1 1 1 real objects: D Learning Generate some D fake objects: 0 0 0 0 vector vector vector vector fix G Learning G vector vector vector vector image 1 G D image image image update fix
The faces generated by machine. The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.
Conditional Generation Generation 0.3 0.1 −0.3 NN −0.1 −0.1 0.1 ⋮ ⋮ ⋮ Generator −0.7 0.7 0.9 In a specific range Conditional Generation “Girl with red hair NN and red eyes” Generator “Girl with yellow ribbon”
paired data blue eyes Conditional GAN red hair short hair c: red hair G Image x = G(c,z) 𝑨 Normal distribution 𝑑 D x is realistic or not + scalar (better) c and x are matched or not 𝑦 True text-image pairs: (red hair, ) 1 (blue hair , ) 0 0 (red hair, ) [Scott Reed, et al, ICML, 2016]
The images are generated by Yen-Hao Chen, Po-Chun Chien, Conditional GAN Jun-Chen Xie, Tsung-Han Wu. [Scott Reed, et al, ICML, 2016] x = G(c,z) paired data blue eyes G Image c: text red hair short hair red hair, green eyes blue hair, red eyes
Conditional GAN G Image c: sound "a dog barking sound" Training Data Collection video
The images are generated by Chia- Hung Wan and Shun-Po Chuang. Conditional GAN https://wjohn1483.github.io/ audio_to_scene/index.html Louder • Audio-to-image
Conditional GAN - Image-to-label Multi-label Image Classifier = Conditional Generator Input condition Generated output
Conditional GAN - Image-to-label F1 MS-COCO NUS-WIDE The classifiers can have VGG-16 56.0 33.9 different architectures. + GAN 60.4 41.2 Inception 62.4 53.5 The classifiers are +GAN 63.8 55.8 trained as conditional GAN. Resnet-101 62.8 53.1 +GAN 64.0 55.4 Resnet-152 63.3 52.1 +GAN 63.9 54.1 Att-RNN 62.1 54.7 [Tsai, et al., submitted to RLSD 62.0 46.9 ICASSP 2019]
Conditional GAN - Image-to-label F1 MS-COCO NUS-WIDE The classifiers can have VGG-16 56.0 33.9 different architectures. + GAN 60.4 41.2 Inception 62.4 53.5 The classifiers are +GAN 63.8 55.8 trained as conditional GAN. Resnet-101 62.8 53.1 +GAN 64.0 55.4 Conditional GAN Resnet-152 63.3 52.1 outperforms other +GAN 63.9 54.1 models designed for multi-label. Att-RNN 62.1 54.7 RLSD 62.0 46.9
Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Conditional GAN Model, https://arxiv.org/abs/1811.00787 – Speech Recognition
Unsupervised Conditional GAN G Condition Generated Object Object in Domain X Object in Domain Y Transform an object from one domain to another without paired data (e.g. style transfer) Domain Y Domain X Not Paired photos Vincent van Gogh’s paintings
Unsupervised Conditional Generation • Approach 1: Direct Transformation For texture or 𝐻 𝑌→𝑍 ? color change Domain X Domain Y • Approach 2: Projection to Common Space 𝐸𝐹 𝑍 𝐹𝑂 𝑌 Face Decoder of Encoder of Domain X Domain Y Attribute domain Y domain X Larger change, only keep the semantics
Domain X Direct Transformation Domain Y Become similar Domain X to domain Y 𝐻 𝑌→𝑍 ? 𝐸 𝑍 scalar Input image belongs to domain Y or not Domain Y
Domain X Direct Transformation Domain Y Become similar Domain X to domain Y 𝐻 𝑌→𝑍 Not what we want! ignore input 𝐸 𝑍 scalar Input image belongs to domain Y or not Domain Y
[Jun-Yan Zhu, et al., ICCV, 2017] Direct Transformation as close as possible Cycle consistency 𝐻 𝑌→𝑍 𝐻 Y→X Lack of information for reconstruction 𝐸 𝑍 scalar Input image belongs to domain Y or not Domain Y
Cycle GAN as close as possible 𝐻 𝑌→𝑍 𝐻 Y→X scalar: belongs to scalar: belongs to 𝐸 𝑍 𝐸 𝑌 domain X or not domain Y or not 𝐻 Y→X 𝐻 𝑌→𝑍 as close as possible
Unsupervised Conditional Generation • Approach 1: Direct Transformation For texture or 𝐻 𝑌→𝑍 ? color change Domain X Domain Y • Approach 2: Projection to Common Space 𝐸𝐹 𝑍 𝐹𝑂 𝑌 Face Decoder of Encoder of Domain X Domain Y Attribute domain Y domain X Larger change, only keep the semantics
Projection to Common Space Target 𝐸𝐹 𝑌 𝐹𝑂 𝑌 image image 𝐹𝑂 𝑍 𝐸𝐹 𝑍 image image Face Attribute Domain X Domain Y
Projection to Common Space Training Minimizing reconstruction error 𝐸𝐹 𝑌 𝐹𝑂 𝑌 image image 𝐹𝑂 𝑍 𝐸𝐹 𝑍 image image Domain X Domain Y
Projection to Common Space Training Minimizing reconstruction error Discriminator of X domain 𝐸𝐹 𝑌 𝐹𝑂 𝑌 𝐸 𝑌 image image 𝐸𝐹 𝑍 𝐹𝑂 𝑍 𝐸 𝑍 image image Discriminator Minimizing reconstruction error of Y domain Because we train two auto- encoders separately … The images with the same attribute may not project to the same position in the latent space.
Projection to Common Space Training Minimizing reconstruction error Discriminator of X domain 𝐸𝐹 𝑌 𝐹𝑂 𝑌 𝐸 𝑌 image image 𝐸𝐹 𝑍 𝐹𝑂 𝑍 𝐸 𝑍 image image Discriminator of Y domain Domain 𝐹𝑂 𝑌 and 𝐹𝑂 𝑍 fool the From 𝐹𝑂 𝑌 or 𝐹𝑂 𝑍 Discriminator domain discriminator The domain discriminator forces the output of 𝐹𝑂 𝑌 and 𝐹𝑂 𝑍 have the same distribution. [Guillaume Lample, et al., NIPS, 2017]
Projection to Common Space Training 𝐸𝐹 𝑌 𝐹𝑂 𝑌 𝐸𝐹 𝑍 𝐹𝑂 𝑍 Sharing the parameters of encoders and decoders Couple GAN [Ming-Yu Liu, et al., NIPS, 2016] UNIT [Ming-Yu Liu, et al., NIPS, 2017]
Projection to Common Space Training Minimizing reconstruction error Discriminator of X domain 𝐸𝐹 𝑌 𝐹𝑂 𝑌 𝐸 𝑌 image image 𝐸𝐹 𝑍 𝐹𝑂 𝑍 𝐸 𝑍 image image Discriminator of Y domain Cycle Consistency: Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Projection to Common Space Training To the same Discriminator latent space of X domain 𝐸𝐹 𝑌 𝐹𝑂 𝑌 𝐸 𝑌 image image 𝐸𝐹 𝑍 𝐹𝑂 𝑍 𝐸 𝑍 image image Discriminator of Y domain Semantic Consistency: Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]
Outline Part I: General Introduction of Generative Adversarial Network (GAN) Part II: Applications to Natural Language Processing Part III: Applications to Speech Processing
Unsupervised Conditional Generation Image Style Transfer Not Paired photos Vincent van Gogh’s paintings Text Style Transfer Not Paired It is bad. It is good. It’s a bad day. It’s a good day. I love you. I don’t love you. positive negative
Cycle GAN as close as possible 𝐻 𝑌→𝑍 𝐻 Y→X scalar: belongs to scalar: belongs to 𝐸 𝑍 𝐸 𝑌 domain X or not domain Y or not 𝐻 Y→X 𝐻 𝑌→𝑍 as close as possible
Cycle GAN as close as possible It is bad. It is good. 𝐻 𝑌→𝑍 𝐻 Y→X It is bad. negative positive negative 𝐸 𝑍 positive sentence? negative sentence? 𝐸 𝑌 I love you. I hate you. 𝐻 Y→X I love you. 𝐻 𝑌→𝑍 positive negative positive as close as possible
Discrete Issue Seq2seq model hidden layer with discrete output It is bad. It is good. 𝐻 𝑌→𝑍 negative positive update 𝐸 𝑍 positive sentence? large network fix Backpropagation
Recommend
More recommend