Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012 Richard Socher Joint work with Chris Manning, Andrew Ng Jeffrey Pennington, Eric Huang and Cliff Lin More information and code at www.socher.org 1 Socher, Manning, Ng
Word Vector Space Models Each word is associated with an n-dimensional vector. x 2 1 5 5 4 1.1 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 the country of my birth the place where I was born But how can we represent the meaning of longer phrases? By mapping them into the same vector space! 2 Socher, Manning, Ng
How should we map phrases into a vector space? Use the principle of compositionality! The meaning (vector) of a sentence is determined by x 2 (1)the meanings of its words and the country of my birth 5 (2)the rules that combine them. the place where I was born 4 Germany 3 France 1 Monday 2 5 Tuesday 1 0 1 2 3 4 5 6 7 8 9 10 x 1 5.5 6.1 Algorithm jointly learns compositional vector 1 2.5 3.5 3.8 representations (and 0.4 2.1 7 4 2.3 tree structure). 0.3 3.3 7 4.5 3.6 the country of my birth 3 Socher, Manning, Ng
Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W c 1 c 2 2. Word Vectors and Recursive Neural Networks 3. Recursive Autoencoders for Sentiment Analysis 4. Paraphrase Detection 4 Socher, Manning, Ng
Distributional Word Representations x 2 8 5 In 5 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 France Monday 5 Socher, Manning, Ng
Algorithms for finding word vector representations There are many well known algorithms that use cooccurrence statistics to compute a distributional representation for words • (Brown et al., 1992; Turney et al., 2003 and many others). • LSA (Landauer & Dumais, 1997). • Latent Dirichlet Allocation (LDA; Blei et al., 2003) Recent development: “Neural Language models.” • Bengio et al., (2003) introduced a language model to predict words given previous words which also learns vector representations. • Collobert & Weston (2008), Maas et al. (2011) from last lecture 6 Socher, Manning, Ng
Distributional Word Representations Recent development: “Neural language models” Collobert & Weston, 2008, Turian et al, 2010 7 Socher, Manning, Ng
Vectorial Sentence Meaning - Step 1: Parsing S VP AdjP NP AdjP 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 8 Socher, Manning, Ng
Vectorial Sentence Meaning - Step 2: Vectors at each node 5 S 4 VP 7 3 8 AdjP 3 5 NP 2 3 AdjP 3 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 9 Socher, Manning, Ng
Recursive Neural Networks for Structure Prediction Basic computational unit: Recursive Neural Network Inputs: two candidate children’s representations Outputs: 1. The semantic representation if the two nodes are merged. 2. Label that carries some information 8 3 about this node 8 label 3 3 3 Neural Network 8 9 4 5 1 3 not really exciting 8 3 5 3 10 Socher, Manning, Ng
Recursive Neural Network Definition 8 label 3 c 1 p = sigmoid ( W + b ) , c 2 Neural where sigmoid: Network 8 3 5 3 gives a distribution over a set of labels: c 1 c 2 11 Socher, Manning, Ng
Recursive Neural Network Definition 8 label Related Work: 3 • Previous RNN work (Goller & Küchler (1996), Costa et al. (2003)) Neural Network • assumed fixed tree structure and used one hot vectors. • No softmax classifiers 8 3 5 3 • Jordan Pollack (1990): Recursive auto- c 1 c 2 associative memories (RAAMs) • Hinton 1990 and Bottou (2011): Related ideas about recursive models. 12 Socher, Manning, Ng
Goal: Predict Pos/Neg Sentiment of Full Sentence 0.3 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 13 Socher, Manning, Ng
Predicting Sentiment with RNNs 0.5 0.5 0.5 0.3 0.5 0.7 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 14 Socher, Manning, Ng
Predicting Sentiment with RNNs c 1 p = sigmoid ( W + b ) c 2 5 3 2 3 0.5 0.9 Neural Neural Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 15 Socher, Manning, Ng
Predicting Sentiment with RNNs 8 3 0.3 Neural Network 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 16 Socher, Manning, Ng
Predicting Sentiment with RNNs 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 17 Socher, Manning, Ng
Predicting Sentiment with RNNs 8 3 0.3 Neural Network 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 18 Socher, Manning, Ng
Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W 2. Word Vectors and Recursive Neural Networks c 1 c 2 3. Recursive Autoencoders for Sentiment Analysis [Socher et al., EMNLP 2011] 4. Paraphrase Detection 19 Socher, Manning, Ng
Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … 20 Socher, Manning, Ng
Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … • Most methods start with a bag of words + linguistic features/processing/lexica • But such methods (including tf-idf ) can’t distinguish: + white blood cells destroying an infection - an infection destroying white blood cells 21 Socher, Manning, Ng
Single Scale Experiments: Movies Stealing Harvard doesn't care about cleverness, wit or any other kind of intelligent humor. A film of ideas and wry comic mayhem. 22 Socher, Manning, Ng
Recursive Autoencoders • Main Idea: A phrase vector is good, if it keeps as much information as possible about its children. 8 label 3 Neural Network 8 3 5 3 c 1 c 2 23 Socher, Manning, Ng
Recursive Autoencoders • Similar to RNN but with 2 differences: (1) Reconstruction error to keep as much information as possible Reconstruction error Softmax Classifier 8 label 3 W (2) W (label) Neural Network W (1) 8 3 5 3 c 1 c 2 c 1 p = sigmoid ( W + b ) c 2 24 Socher, Manning, Ng
Recursive Autoencoders • Reconstruction error details Reconstruction error Softmax Classifier W (2) W (label) W (1) 25 Socher, Manning, Ng
Recursive Autoencoders • Reconstruction error at every node • Important detail: normalization p 2 =f(W[x 1 ;p 1 ] + b) p 1 =f(W[x 2 ;x 3 ] + b) x 1 x 2 x 3 26 Socher, Manning, Ng
Recursive Autoencoders • Similar to RNN but with 2 differences: (2) Tree structure is determined by reconstruction error: – does not require a parser – get task dependent trees 1 5 0 2 3 0 2 1 5.4 0 3 0.6 2.3 3.1 0.7 Neural Neural Neural Neural Neural Network Network Network Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 27 Socher, Manning, Ng
Recursive Autoencoders 2 1 0.9 Neural 1 2 3 Network 0 0 5.4 3 3.1 0.7 5 Neural Neural Neural 2 Network Network Network 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 28 Socher, Manning, Ng
Recursive Autoencoders 8 2 3 0.7 1 0.9 Neural Neural 2 Network Network 0 3.1 5 Neural 2 3 Network 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 29 Socher, Manning, Ng
Recursive Autoencoders 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 30 Socher, Manning, Ng
RAE Training • Lower error over entire sentence x and its label t (+ regularization) • Error of a sentence is the error at all nodes in its tree: 31 Socher, Manning, Ng
RAE Training • Error at each node is a weighted combination of reconstruction error and cross-entropy (distribution likelihood) from softmax classifier Reconstruction error Cross-entropy error W (2) W (label) W (1) 32 Socher, Manning, Ng
Details for Training RNNs • Minimizing error by taking gradient steps computed from matrix derivatives • More efficient implementation via the backpropagation algorithm • Since we compute derivatives in a tree structure we can, we call it backpropagation through structure (Goller et al. 1996) 33 Socher, Manning, Ng
Recommend
More recommend