arxiv 1511 06349v4 cs lg 12 may 2016
play

arXiv:1511.06349v4 [cs.LG] 12 May 2016 horses are to buy any - PDF document

Generating Sentences from a Continuous Space Samuel R. Bowman Luke Vilnis NLP Group and Dept. of Linguistics CICS Stanford University University of Massachusetts Amherst sbowman@stanford.edu luke@cs.umass.edu Oriol Vinyals, Andrew M.


  1. Generating Sentences from a Continuous Space Samuel R. Bowman ∗ Luke Vilnis ∗ NLP Group and Dept. of Linguistics CICS Stanford University University of Massachusetts Amherst sbowman@stanford.edu luke@cs.umass.edu Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz & Samy Bengio Google Brain { vinyals, adai, rafalj, bengio } @google.com Abstract i went to the store to buy some groceries . i store to buy some groceries . The standard recurrent neural network i were to buy any groceries . arXiv:1511.06349v4 [cs.LG] 12 May 2016 horses are to buy any groceries . language model ( rnnlm ) generates sen- horses are to buy any animal . tences one word at a time and does not horses the favorite any animal . work from an explicit global sentence rep- horses the favorite favorite animal . resentation. In this work, we introduce horses are my favorite animal . and study an rnn -based variational au- toencoder generative model that incorpo- Table 1: Sentences produced by greedily decoding rates distributed latent representations of from points between two sentence encodings with entire sentences. This factorization al- a conventional autoencoder. The intermediate sen- lows it to explicitly model holistic prop- tences are not plausible English. erties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence repre- assumptions, and makes it capable of modeling sentations remarkably produce diverse and complex distributions over sequences, including well-formed sentences through simple de- those with long-term dependencies. However, by terministic decoding. By examining paths breaking the model structure down into a series of through this latent space, we are able to next-step predictions, the rnnlm does not expose generate coherent novel sentences that in- an interpretable representation of global features terpolate between known sentences. We like topic or of high-level syntactic properties. present techniques for solving the difficult We propose an extension of the rnnlm that is learning problem presented by this model, designed to explicitly capture such global features demonstrate its effectiveness in imputing in a continuous latent variable. Naively, maxi- missing words, explore many interesting mum likelihood learning in such a model presents properties of the model’s latent sentence an intractable inference problem. Drawing inspi- space, and present negative results on the ration from recent successes in modeling images use of the model in language modeling. (Gregor et al., 2015), handwriting, and natural speech (Chung et al., 2015), our model circum- 1 Introduction vents these difficulties using the architecture of a Recurrent neural network language models variational autoencoder and takes advantage of re- ( rnnlm s, Mikolov et al., 2011) represent the state cent advances in variational inference (Kingma and of the art in unsupervised generative modeling Welling, 2015; Rezende et al., 2014) that introduce for natural language sentences. In supervised a practical training technique for powerful neural settings, rnnlm decoders conditioned on task- network generative models with latent variables. specific features are the state of the art in tasks Our contributions are as follows: We propose a like machine translation (Sutskever et al., 2014; variational autoencoder architecture for text and Bahdanau et al., 2015) and image captioning discuss some of the obstacles to training it as well (Vinyals et al., 2015; Mao et al., 2015; Donahue as our proposed solutions. We find that on a stan- et al., 2015). The rnnlm generates sentences dard language modeling evaluation where a global word-by-word based on an evolving distributed variable is not explicitly needed, this model yields state representation, which makes it a proba- similar performance to existing rnnlm s. We also bilistic model with no significant independence evaluate our model using a larger corpus on the task of imputing missing words. For this task, ∗ First two authors contributed equally. Work was we introduce a novel evaluation strategy using an done when all authors were at Google, Inc.

  2. adversarial classifier, sidestepping the issue of in- of on the target sentence itself. Finally, para- tractable likelihood computations by drawing in- graph vector models (Le and Mikolov, 2014) are spiration from work on non-parametric two-sample non-recurrent sentence representation models. In a tests and adversarial training. In this setting, paragraph vector model, the encoding of a sentence our model’s global latent variable allows it to do is obtained by performing gradient-based inference well where simpler models fail. We finally intro- on a prospective encoding vector with the goal of duce several qualitative techniques for analyzing using it to predict the words in the sentence. the ability of our model to learn high level fea- tures of sentences. We find that they can produce 2.2 The variational autoencoder diverse, coherent sentences through purely deter- ministic decoding and that they can interpolate The variational autoencoder ( vae , Kingma and smoothly between sentences. Welling, 2015; Rezende et al., 2014) is a genera- tive model that is based on a regularized version 2 Background of the standard autoencoder. This model imposes a prior distribution on the hidden codes � z which 2.1 Unsupervised sentence encoding enforces a regular geometry over codes and makes A standard rnn language model predicts each it possible to draw proper samples from the model word of a sentence conditioned on the previous using ancestral sampling. word and an evolving hidden state. While effec- The vae modifies the autoencoder architecture tive, it does not learn a vector representation of by replacing the deterministic function ϕ enc with the full sentence. In order to incorporate a contin- a learned posterior recognition model , q ( � z | x ). This uous latent sentence representation, we first need a model parametrizes an approximate posterior dis- method to map between sentences and distributed tribution over � z (usually a diagonal Gaussian) with representations that can be trained in an unsuper- a neural network conditioned on x . Intuitively, the vised setting. While no strong generative model vae learns codes not as single points, but as soft is available for this problem, three non-generative ellipsoidal regions in latent space, forcing the codes techniques have shown promise: sequence autoen- to fill the space rather than memorizing the train- coders, skip-thought, and paragraph vector. ing data as isolated codes. Sequence autoencoders have seen some success If the vae were trained with a standard autoen- in pre-training sequence models for supervised coder’s reconstruction objective, it would learn to downstream tasks (Dai and Le, 2015) and in gen- encode its inputs deterministically by making the erating complete documents (Li et al., 2015a). variances in q ( � z | x ) vanishingly small (Raiko et al., An autoencoder consists of an encoder function 2015). Instead, the vae uses an objective which ϕ enc and a probabilistic decoder model p ( x | � z = encourages the model to keep its posterior distri- ϕ enc ( x )), and maximizes the likelihood of an ex- butions close to a prior p ( � z ), generally a standard ample x conditioned on � z , the learned code for Gaussian ( µ = � 0, σ = � 1). Additionally, this objec- x . In the case of a sequence autoencoder, both tive is a valid lower bound on the true log likelihood encoder and decoder are rnn s and examples are of the data, making the vae a generative model. token sequences. This objective takes the following form: Standard autoencoders are not effective at ex- tracting for global semantic features. In Table 1, we present the results of computing a path or ho- L ( θ ; x ) = − kl ( q θ ( � z | x ) || p ( � z )) motopy between the encodings for two sentences + E q θ ( � z | x ) [log p θ ( x | � z )] (1) and decoding each intermediate code. The in- ≤ log p ( x ) . termediate sentences are generally ungrammatical and do not transition smoothly from one to the This forces the model to be able to decode plausible other. This suggests that these models do not sentences from every point in the latent space that generally learn a smooth, interpretable feature sys- has a reasonable probability under the prior. tem for sentence encoding. In addition, since these models do not incorporate a prior over � z , they can- In the experiments presented below using vae not be used to assign probabilities to sentences or models, we use diagonal Gaussians for the prior to sample novel sentences. and posterior distributions p ( � z ) and q ( � z | x ), using Two other models have shown promise in learn- the Gaussian reparameterization trick of Kingma ing sentence encodings, but cannot be used in and Welling (2015). We train our models with a generative setting: Skip-thought models (Kiros stochastic gradient descent, and at each gradient et al., 2015) are unsupervised learning models that step we estimate the reconstruction cost using a take the same model structure as a sequence au- single sample from q ( � z | x ), but compute the kl di- toencoder, but generate text conditioned on a vergence term of the cost function in closed form, neighboring sentence from the target text, instead again following Kingma and Welling (2015).

Recommend


More recommend