arxiv 1610 04211v2 cs cl 17 nov 2016
play

arXiv:1610.04211v2 [cs.CL] 17 Nov 2016 cult to train and - PDF document

Gated End-to-End Memory Networks Fei Liu Julien Perez The University of Melbourne Xerox Research Centre Europe Victoria, Australia Grenoble, France fliu3@student.unimelb.edu.au julien.perez@xrce.xerox.com Abstract 1 Introduction


  1. Gated End-to-End Memory Networks Fei Liu ∗ Julien Perez The University of Melbourne Xerox Research Centre Europe Victoria, Australia Grenoble, France fliu3@student.unimelb.edu.au julien.perez@xrce.xerox.com Abstract 1 Introduction Deeper Neural Network models are more diffi- Machine reading using differentiable rea- arXiv:1610.04211v2 [cs.CL] 17 Nov 2016 cult to train and recurrency tends to complex- soning models has recently shown re- ify this optimization problem (Srivastava et al., markable progress. In this context, 2015b). While Deep Neural Network architec- End-to-End trainable Memory Networks tures have shown superior performance in numer- ( MemN2N ) have demonstrated promising ous areas, such as image, speech recognition and performance on simple natural language more recently text, the complexity of optimiz- based reasoning tasks such as factual rea- ing such large and non-convex parameter sets re- soning and basic deduction. However, mains a challenge. Indeed, the so-called vanish- other tasks, namely multi-fact question- ing/exploding gradient problem has been mainly answering, positional reasoning or dialog addressed using: 1. algorithmical responses, e.g., related tasks, remain challenging particu- normalized initialization stategies (LeCun et al., larly due to the necessity of more com- 1998; Glorot and Bengio, 2010); 2. architec- plex interactions between the memory and tural ones, e.g., intermediate normalization layers controller modules composing this family which facilitate the convergence of networks com- of models. In this paper, we introduce posed of tens of hidden layers (He et al., 2015; a novel end-to-end memory access regu- Saxe et al., 2014). Another problem of memory- lation mechanism inspired by the current enhanced neural models is the necessity of regulat- progress on the connection short-cutting ing memory access at the controller level. Mem- principle in the field of computer vision. ory access operations can be supervised (Kumar Concretely, we develop a Gated End-to- et al., 2016) and the number of times they are per- End trainable Memory Network architec- formed tends to be fixed apriori (Sukhbaatar et al., ture ( GMemN2N ). From the machine learn- 2015), a design choice which tends to be based ing perspective, this new capability is on the presumed degree of difficulty of the task in learned in an end-to-end fashion without question. Inspired by the recent success of object the use of any additional supervision sig- recognition in the field of computer vision (Srivas- nal which is, as far as our knowledge tava et al., 2015a; Srivastava et al., 2015b), we in- goes, the first of its kind. Our experi- vestigate the use of a gating mechanism in the con- ments show significant improvements on text of End-to-End Memory Networks ( MemN2N ) the most challenging tasks in the 20 bAbI (Sukhbaatar et al., 2015) in order to regulate the dataset, without the use of any domain access to the memory blocks in a differentiable knowledge. Then, we show improvements fashion. The formulation is realized by gated con- on the Dialog bAbI tasks including nections between the memory access layers and the real human-bot conversion-based Di- the controller stack of a MemN2N . As a result, the alog State Tracking Challenge ( DSTC-2 ) model is able to dynamically determine how and dataset. On these two datasets, our model when to skip its memory-based reasoning process. sets the new state of the art. Roadmap: Section 2 reviews state-of-the- ∗ art Memory Network models, connection short- work done as an Intern at Xerox Research Centre Europe

  2. cutting in neural networks and memory dynamics. include more than one set of input/output memo- In Section 3, we propose a differentiable gating ries by stacking a number of memory layers. In mechanism in MemN2N . Section 4 and 5 present this setting, each memory layer is named a hop and the ( k + 1) th hop takes as input the output of a set of experiments on the 20 bAbI reasoning the k th hop: tasks and the Dialog bAbI dataset. We report new state-of-the-art results on several of the most u k +1 = o k + u k (3) challenging tasks of the set, namely positional rea- soning, 3 -argument relation and the DSTC-2 task Lastly, the final step, the prediction of the an- while maintaining equally competitive results on swer to the question q , is performed by the rest. a = softmax ( W ( o K + u K )) ˆ (4) 2 Related Work where ˆ a is the predicted answer distribution, W ∈ This section starts with an introduction of the pri- R | V |× d is a parameter matrix for the model to learn mary elements of MemN2N . Then, we review two and K the total number of hops. key elements relevant to this work, namely short- cut connections in neural networks in and memory 2.2 Shortcut Connections dynamics in such models. Shortcut connections have been studied from both the theoretical and practical point of view in the 2.1 End-to-End Memory Networks general context of neural network architectures The architecture, introduced by MemN2N (Bishop, 1995; Ripley, 2007). More recently Sukhbaatar et al. (2015), consists of two main Residual Networks (He et al., 2016) and Highway components: supporting memories and final an- Networks (Srivastava et al., 2015a; Srivastava et swer prediction. Supporting memories are in turn al., 2015b) have been almost simultaneously pro- comprised of a set of input and output memory posed. While the former utilizes a residual cal- representations with memory cells. The input culus, the latter formulates a differentiable gate- and output memory cells, denoted by m i and c i , way mechanism as proposed in Long-Short Terms are obtained by transforming the input context Memory Networks in order to cope with long- x 1 , . . . , x n (or stories) using two embedding term dependency issues in the dataset in an end- matrices A and C (both of size d × | V | where to-end trainable manner. These two mechanisms d is the embedding size and | V | the vocabulary were proposed as a structural solution to the so- size) such that m i = A Φ( x i ) and c i = C Φ( x i ) called vanishing gradient problem by allowing the where Φ( · ) is a function that maps the input into model to shortcut its layered transformation struc- a bag of dimension | V | . Similarly, the question ture when necessary. q is encoded using another embedding matrix B ∈ R d ×| V | , resulting in a question embedding 2.3 Memory Dynamics u = B Φ( q ) . The input memories { m i } , together The necessity of dynamically regulating the in- with the embedding of the question u , are utilized teraction between the so-called controller and the to determine the relevance of each of the stories in memory blocks of a Memory Network model has the context, yielding a vector of attention weights been study in (Kumar et al., 2016; Xiong et al., 2016). In these works, the number of exchanges p i = softmax ( u ⊤ m i ) (1) between the controller stack and the memory mod- ule of the network is either monitored in a hard e a i where softmax ( a i ) = j ∈ [1 ,n ] e a j . Subse- supervised manner in the former or fixed apriori � in the latter. quently, the response o from the output memory In this paper, we propose an end-to-end super- is constructed by the weighted sum: vised model, with an automatically learned gat- ing mechanism, to perform dynamic regulation of � o = p i c i (2) memory interaction. The next section presents the i formulation of this new Gated End-to-End Mem- For more difficult tasks requiring multiple sup- ory Networks ( GMemN2N ). This contribution can porting memories, the model can be extended to be placed in parallel to the recent transition from

Recommend


More recommend