Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu. THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
What is a Recurrent Neural Network? H K U S T 2
Introduction What is Recurrent Neural Networks (RNN)? A deep learning model used for: y ( t ) h ( t ) tanh x ( t ) Machine Translation, Speech Recognition, Language Modeling, β¦ A vanilla RNN H K U S T 3
Introduction What is Recurrent Neural Networks (RNN)? y ( t ) A vanilla RNN takes an input π (#) , and update its hidden state π (#,-) using: y ( t ) π (#) = tanh (πΏπ #,- + πΎπ (#) ) # h ( t ) π 3 Visual Analytics Science & Technology h ( t ) tanh π¦ (5) π¦ (4) π¦ (3) π¦ (-) input tanh # h ( t ) π - RNN tanh x ( t ) hidden state A vanilla RNN x ( t ) π§ (4) π§ (3) π§ (-) π§ (5) output A 2-layer RNN H K U S T 4
? input output What has the RNN learned from data? H K U S T 5
Motivation What has the RNN learned from data? A. map the value of a single hidden unit on data (Karpathy A. et al., 2015) A unit sensitive to position in a line. A lot more units have no clear meanings. H K U S T 6
Motivation What has the RNN learned from data? B. matrix plots (Li J. et. al., 2016) Each column represents the value of the hidden state vector when reads a input word Scalability! Machine Translation: 4-layer, 1000 units/layer (Sutskever I. et al., 2014) Language Modeling: 2-layer, 1500 units/layer (Zaremba et al., 2015) H K U S T 7
Our Solution - RNNVis H K U S T 8
Our Solution Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation H K U S T 9
Solution Explaining an individual hidden unit using its most salient words How to define salient? Modelβs response to a word π₯ at step π’ : the update of hidden state Ξπ # (#) , π = 1, β¦ , π . Ξπ # = Ξβ : # ) implies that the word π₯ is more salient to unit π . Larger abs(Ξβ : (#) can vary given the same word π₯ , we use the expectation: Since Ξβ : E Ξπ # ο½ π₯ # = π₯ Can be estimated by running the model on dataset and take the mean. H K U S T 10
Solution Explaining an individual hidden unit using its most salient words 25% - 75% 9% - 91% response Unit: #36 Top 4 positive/negative salient words of unit 36 in an RNN (GRU) trained on Yelp review data. H K U S T 11
Solution Explaining an individual hidden unit using its most salient words mean 25% - 75% 9% - 91% Highly responsive hidden units Unit # Distribution of modelβs response given the word βheβ. Units reordered according to the mean. (an LSTM with 600 units) H K U S T 12
Solution Explaining an individual hidden unit using its most salient words Investigating one unit/word at a time⦠P: Too much user burden! S: An overview for easier exploration H K U S T 13
Solution Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation H K U S T 14
Solution Bi-graph Formulation Hidden Units he she by can may Words H K U S T 15
Solution Bi-graph Formulation Hidden Units he she by can may Words H K U S T 16
Solution Co-clustering Hidden Units he she by can may Words Algorithm* Spectral co-clustering (Dhillon I. S., 2001) H K U S T 17
Solution Co-clustering β Edge Aggregation Hidden Units Color: sign of the average edge weight Width: scale of the average edge weight he she by can may Words H K U S T 18
Solution may Co-clustering - Visualization Hidden Units can by she Words he H K U S T 19
Solution Co-clustering - Visualization Hidden Units Color: each unitβs salience to the selected word he she by can may Words Hidden Units Clusters Words Clusters (Memory Chips) (Word Clouds) H K U S T 20
Solution Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation H K U S T 21
Solution Glyph design for evaluating sentences Each glyph summarizes the dynamics of hidden unit clusters when reading a word Each bar represents the average scale of the value in a hidden units cluster The ratio of preserved value Decreased value More positive value are preserved Current value More negative value are preserved Update towards positive Increased value Update towards negative H K U S T 22
Case Studies How do RNNs handle sentiments? The language of Shakespeare H K U S T 23
Case Study β Sentiment Analysis Each unit has two sides Single-layer GRU with 50 hidden units (cells), trained on Yelp review data H K U S T 24
Case Study β Sentiment Analysis RNNs can learn to handle the context Single-layer GRU with 50 hidden units (cells), trained on Yelp review data negative positive Update towards positive A B Update towards negative Sentence A: I love the food, though the staff is not helpful Sentence B: The staff is not helpful, though I love the food H K U S T 25
Case Study β Sentiment Analysis Clues for the problem Single-layer GRU with 50 hidden units (cells), trained on Yelp review data. Problem: the data is not evenly sampled. H K U S T 26
Case Study β Sentiment Analysis Visual indicator of the performance Single-layer GRUs with 50 hidden units (cells), trained on Yelp review data. Accuracy (test): 91.9% Accuracy (test): 88.6% Balanced Dataset Unbalanced Dataset H K U S T 27
Case Studies How do RNNs handle the sentiments? The language of Shakespeare H K U S T 28
Case Study β Language Modeling The language of Shakespeare β A mixture of the old and the new H K U S T 29
Case Study β Language Modeling The language of Shakespeare β A mixture of the old and the new H K U S T 30
Discussion & Future Work β’ Clustering. The quality of co-clustering? Interactive clustering? β’ Glyph-based sentence visualization. Scalability? β’ Text data. How about speech data? β’ RNN models. More advanced RNN-based models like attention models? H K U S T 31
Thank you! Contact: Yao Ming, ymingaa@connect.ust.hk Page: www.myaooo.com/rnnvis Code: www.github.com/myaooo/rnnvis H K U S T 32
οΏ½ Technical Details Explaining individual hidden units - Decomposition The output of an RNN at step π’ is typically a probability distribution: L π # π : = softmax π½π (#) = exp π : L π # ) β exp(π N N L , π = 1,2, β¦ , π , is the output projection matrix. where π½ = π : The numerator of π : can be decomposed to: # π L π R β π R,- L Ξπ # ) L π # exp π : = exp Q π : = U exp(π : RT- πTπ L Ξπ # ) is the multiplicative contribution of input word π₯ # , the update of hidden state Here exp(π : Ξπ # can be regard as the modelβs response to π₯ # . H K U S T 33
Evaluation Expert Interview 1 2 3 4 5 Show Explore Answer Finish Compare a tutorial video the tool two models questions a survey H K U S T 34
Challenges What are the challenges? 1. The complexity of the model β’ Machine Translation: 4-layer LSTMs, 1000 units/layer (Sutskever I. et al., 2014) β’ Language Modeling: 2-layer LSTMs, 650 or 1500 units/layer (Zaremba et al., 2015) 2. The complexity of the hidden memory β’ Semantic information are distributed in hidden states of an RNN. 3. The complexity of the data β’ Patterns in sequential data like texts are difficult to be analyzed and interpreted H K U S T 35
Other Findings Comparing LSTMs and vanilla RNNs Left (A-C): co-cluster visualization of the last layer of an RNN. Right (D-F): visualization of the cell states of the last layer of an LSTM. Bottom (GH): two modelsβ responses to the same word βofferβ. H K U S T 36
Contribution β’ A visual technique for understanding what RNNs learned. β’ A VA tool that ablates the hidden dynamics of a trained RNN. β’ Interesting findings with RNN models. H K U S T 37
Recommend
More recommend