cot cooperative training for generative
play

CoT: Cooperative Training for Generative Modeling of Discrete Data - PowerPoint PPT Presentation

CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University Autoregressive Models Autoregressive


  1. CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University

  2. Autoregressive Models β€’ Autoregressive models factorize the distribution sequentially to build a fully tractable density function: β€’ π‘ž πœ„ 𝑦 0 , 𝑦 1 , … , 𝑦 π‘œβˆ’1 = π‘ž πœ„ 𝑦 0 π‘ž πœ„ 𝑦 1 𝑑 [0:1) )π‘ž πœ„ 𝑦 2 𝑑 [0:2) )π‘ž πœ„ 𝑦 3 𝑑 [0:3) ) … π‘ž πœ„ 𝑦 π‘œβˆ’1 𝑑 [0:π‘œβˆ’1) )

  3. Teacher Forcing and Exposure Bias β€’ For each sequence in the training set, maximize the estimated likelihood in the log scale: initial state p(x|s) p(x|s) p(x|s) Model Model Model forced forced estimate estimate Β·Β·Β· observation observation I have

  4. Teacher Forcing and Exposure Bias β€’ When used to generate random sample: initial state p(x|s) p(x|s) p(x|s) Model Model Model self self stochastic stochastic Β·Β·Β· observation observation sample sample Billie Jean

  5. Teacher Forcing and Exposure Bias β€’ Exposure Bias [Ranzato et al., 2015]: β€’ The intermediate process under training stage and inference stage is inconsistent. β€’ The distribution shift would accumulate along the timeline. Teacher Forcing Real p(x|s) Model Training Prefix Random Sampling Generated p(x|s) Model Inference Prefix

  6. Exposure Bias and Kullback-Leibler Divergence β€’ Exposure Bias could also be regarded as a result of optimization via minimizing Kullback-Leibler Divergence, denoted as KL(P||Q) for measured distributions P, Q.

  7. Kullback-Leibler Divergence, Symmetry of Divergences β€’ For any P, Q, KL(P||Q) not necessarily equals to KL(Q||P) β€’ KL ---smoothed and symmetrized--> Jensen-Shannon Divergence β€’ where M = 0.5 * (P + G)

  8. GAN, SeqGAN and Language GANs β€’ Ian Goodfellow proposed Generative Adversarial Network [2014] β€’ Ideally, GAN minimizes the JSD β€’ Can’t be directly applied to discrete sequence generation β€’ SeqGAN uses the REINFORCE gradient estimator to resolve this.

  9. Problems of SeqGAN β€’ Not trivially able to work from scratch. β€’ SeqGAN’s work-around: Pre-training via teacher forcing. β€’ Trade diversity for quality (mode collapse) β€’ According to previous reports([Lu et al. 2018; Caccia et al. 2018])

  10. Problems of SeqGAN β€’ Training signal is too sparse. initial state p(x|s) p(x|s) p(x|s) Generator Generator Generator self self stochastic stochastic observation observation sample sample Billie Jean single point signal single point signal Discriminator

  11. Cooperative Training: Back to Formula! β€’ Reconsider the algorithm from estimating & minimizing JSD: β€’ where M = 0.5 * (P + G) β€’ Instead of using a discriminator to achieve this, use another sequence model called β€œMediator” to approximate the mixture density M.

  12. Cooperative Training: More Information from Mediator β€’ Key Idea: The mediator provides DISTRIBUTION level signal in each time step. Generator Generator Generator initial state G(x|s) G(x|s) G(x|s) Billie Jean signal signal signal M(x|s) M(x|s) M(x|s) Mediator Mediator Mediator

  13. Cooperative Training: Factorizing the Cumulative Gradient Through Time, Final Objectives β€’ Generator Gradient: β€’ where 𝜌 𝑕 𝑑 𝑒 = 𝐻 πœ„ 𝑦 𝑑 𝑒 , 𝜌 𝑛 𝑑 𝑒 = 𝑁 𝜚 𝑦 𝑑 𝑒 , β€’ Mediator Objective:

  14. Experiment: Synthetic Turing Test

  15. Experiment: Real World Data Quality Test on EMNLP2017 WMT Reasonable Diversity Test on News Section EMNLP2017 WMT News Section

  16. Poster #44 Conclusion β€’ Key Ideas: β€’ Use a max-max game to replace min-max game of GANs, while still focusing on minimization of JSD. β€’ Use distribution-level signal from the introduced mediator in each step. β€’ Advantage: β€’ Works from scratch. β€’ Trade-off invariant performance gain while still being computationally cheap enough.

Recommend


More recommend