CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University
Autoregressive Models β’ Autoregressive models factorize the distribution sequentially to build a fully tractable density function: β’ π π π¦ 0 , π¦ 1 , β¦ , π¦ πβ1 = π π π¦ 0 π π π¦ 1 π‘ [0:1) )π π π¦ 2 π‘ [0:2) )π π π¦ 3 π‘ [0:3) ) β¦ π π π¦ πβ1 π‘ [0:πβ1) )
Teacher Forcing and Exposure Bias β’ For each sequence in the training set, maximize the estimated likelihood in the log scale: initial state p(x|s) p(x|s) p(x|s) Model Model Model forced forced estimate estimate Β·Β·Β· observation observation I have
Teacher Forcing and Exposure Bias β’ When used to generate random sample: initial state p(x|s) p(x|s) p(x|s) Model Model Model self self stochastic stochastic Β·Β·Β· observation observation sample sample Billie Jean
Teacher Forcing and Exposure Bias β’ Exposure Bias [Ranzato et al., 2015]: β’ The intermediate process under training stage and inference stage is inconsistent. β’ The distribution shift would accumulate along the timeline. Teacher Forcing Real p(x|s) Model Training Prefix Random Sampling Generated p(x|s) Model Inference Prefix
Exposure Bias and Kullback-Leibler Divergence β’ Exposure Bias could also be regarded as a result of optimization via minimizing Kullback-Leibler Divergence, denoted as KL(P||Q) for measured distributions P, Q.
Kullback-Leibler Divergence, Symmetry of Divergences β’ For any P, Q, KL(P||Q) not necessarily equals to KL(Q||P) β’ KL ---smoothed and symmetrized--> Jensen-Shannon Divergence β’ where M = 0.5 * (P + G)
GAN, SeqGAN and Language GANs β’ Ian Goodfellow proposed Generative Adversarial Network [2014] β’ Ideally, GAN minimizes the JSD β’ Canβt be directly applied to discrete sequence generation β’ SeqGAN uses the REINFORCE gradient estimator to resolve this.
Problems of SeqGAN β’ Not trivially able to work from scratch. β’ SeqGANβs work-around: Pre-training via teacher forcing. β’ Trade diversity for quality (mode collapse) β’ According to previous reports([Lu et al. 2018; Caccia et al. 2018])
Problems of SeqGAN β’ Training signal is too sparse. initial state p(x|s) p(x|s) p(x|s) Generator Generator Generator self self stochastic stochastic observation observation sample sample Billie Jean single point signal single point signal Discriminator
Cooperative Training: Back to Formula! β’ Reconsider the algorithm from estimating & minimizing JSD: β’ where M = 0.5 * (P + G) β’ Instead of using a discriminator to achieve this, use another sequence model called βMediatorβ to approximate the mixture density M.
Cooperative Training: More Information from Mediator β’ Key Idea: The mediator provides DISTRIBUTION level signal in each time step. Generator Generator Generator initial state G(x|s) G(x|s) G(x|s) Billie Jean signal signal signal M(x|s) M(x|s) M(x|s) Mediator Mediator Mediator
Cooperative Training: Factorizing the Cumulative Gradient Through Time, Final Objectives β’ Generator Gradient: β’ where π π π‘ π’ = π» π π¦ π‘ π’ , π π π‘ π’ = π π π¦ π‘ π’ , β’ Mediator Objective:
Experiment: Synthetic Turing Test
Experiment: Real World Data Quality Test on EMNLP2017 WMT Reasonable Diversity Test on News Section EMNLP2017 WMT News Section
Poster #44 Conclusion β’ Key Ideas: β’ Use a max-max game to replace min-max game of GANs, while still focusing on minimization of JSD. β’ Use distribution-level signal from the introduced mediator in each step. β’ Advantage: β’ Works from scratch. β’ Trade-off invariant performance gain while still being computationally cheap enough.
Recommend
More recommend