Investigating Relational Recurrent Neural Networks with Variable Length Memory Pointer Mahtab Ahmed and Robert E. Mercer Department of Computer Science University of Western Ontario London, ON, Canada
Introduction • Memory based Neural Networks can remember information longer while modelling temporal data. • Encode a Relational Memory Core (RMC) as the cell state inside an LSTM cell. • Uses standard Multi-head Self Attention. • Uses variable length memory pointer. • Evaluate on four different tasks. • State of the art on one of them; On par with the other three. 2
Standard LSTM 3
The model: Fixed Length Memory Pointer Random Input at t • Apply Multi-head Self Attention and create a weighted version, 𝑁 • Add a residual connection • Apply Layer-Normalization block on top of 𝑁 • Maintain separate version of mean and variance projection matrices. 4
The model: Fixed Length Memory Pointer (contd.) • n non-linear projections of ℎ ! are applied followed by a residual connection f = RELU and ℎ ! = 𝑁 • Resultant tensor 𝑌 (having shape 2 × b × d) is split on the cardinal dimension to extract the memory • LSTM’s candidate cell state gets changed to • 𝑦 ! is replaced with the projected input (= 𝑋𝑦 ! ) in all LSTM equations. 5
Variable Length Memory Pointer • Share W across all time steps. • Apply all the steps as before. • For Layer-Normalization, maintain just one version of mean and variance projection matrices. • Memory is still at the cardinal dimension. • Rather than looking at everything before • Track a fixed window of words (n-grams). • Mimic the behavior of convolution kernel. 6
Model Architecture � �LSTM Eq�a�ion� � LSTM Eq�a�ion� � LSTM Eq�a�ion� La�e��N��mali�a�i�� La�e��N��mali�a�i�� La�e��N��mali�a�i�� N��-Li�ea� P��jec�i�� N��-Li�ea� P��jec�i�� N��-Li�ea� P��jec�i�� La�e��N��mali�a�i�� La�e��N��mali�a�i�� La�e��N��mali�a�i�� M�l�i-Head�A��e��i�� M�l�i-Head�A��e��i�� M�l�i-Head�A��e��i�� Linear Projec�ion Linear Projec�ion Linear Projec�ion 7
Sentence Pair Modelling Classes Classifier ⊕ Sentence Sentence Representation Representation Encoder Encoder Word Word Representations Representations Left Sentence Right Sentence InferSent - https://arxiv.org/abs/1705.02364 8
Hyperparameters We tried a range of values for each hyperparameter. The ones that worked for us are bold-faced. • 9
Experimental Results Models marked with † are the ones that we implemented • 10
Attention Visualization Me���� <�> Me���� <�> .�� .22 0.�� 0.22 Me���� <�> He Me���� <�> Bef��e .32 .25 .43 0.23 0.12 0.65 Me���� <�> He a��� Me���� <�> Bef��e �ha� .2� .15 .1� .3� 0.2� 0.10 0.43 0.1� Me���� <�> He a��� ����ed Me���� <�> Bef��e �ha� he .1� .10 .1� .1� .3� 0.2� 0.11 0.33 0.14 0.13 Me���� <�> He a��� ����ed i� Me���� <�> Bef��e �ha� he he�d .12 .10 .13 .14 .31 .20 0.26 0.12 0.2� 0.11 0.11 0.10 Me���� �he Vi�gi�ia a�����e� ge�e�a�'� �f�ce Me���� Vi�gi�ia i�c��di�g de���� a�����e� ge�e�a� 0.16 0.1� 0.1� 0.16 0.22 0.11 .10 .0� .30 .30 .10 .13 Me���� Vi�gi�ia a�����e� ge�e�a�'� �f�ce </�> Me���� i�c��di�g de���� a�����e� ge�e�a� </�> .0� .22 .34 .13 .12 .10 0.1� 0.1� 0.15 0.2� 0.0� 0.15 He a��� ���ked i� �he Vi�gi�ia a�����e� ge�e�a�'� �f�ce. Bef��e �ha� he he�d �a�i��� ����� i� Vi�gi�ia, i�c��di�g de���� a�����e� ge�e�a�. 11
Conclusion • Extend the classical RMC with variable length memory pointer. • Uses a non-local context to compute an enhanced memory. • Design a sentence pair modelling architecture. • Evaluate on four different tasks. • On par performance on most of the tasks and best performance on one of them. • Interprets the attention shifting very well. • Memory pointer length does not follow a uniform pattern across all datasets. 12
Thank you 13
Recommend
More recommend