Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky, Chris Manning
A Neuron • If you know Logistic Regression, then you A single neuron A computa*onal unit with n (3) inputs already understand a and 1 output basic neural network and parameters W, b neuron! Inputs Ac*va*on Output func*on Bias unit corresponds to intercept term Alan Ritter ◦ socialmedia-class.org
A Neuron is essentially a binary logistic regression unit b: We can have an “always on” h w , b ( x ) = f ( w T x + b ) feature, which gives a class prior, or separate it out, as a bias term 1 f ( z ) = 1 + e − z w , b are the parameters of this neuron i.e., this logis3c regression model Alan Ritter ◦ socialmedia-class.org
A Neural Network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis6c regression func6ons, then we get a vector of outputs … Alan Ritter ◦ socialmedia-class.org
A Neural Network = running several logistic regressions at the same time … which we can feed into another logis2c regression func2on It is the loss func.on that will direct what the intermediate hidden variables should be, so as to do a good job at predic.ng the targets for the next layer, etc. Alan Ritter ◦ socialmedia-class.org
A Neural Network = running several logistic regressions at the same time Before we know it, we have a mul3layer neural network…. Alan Ritter ◦ socialmedia-class.org
f : Activation Function We have a 1 = f ( W 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) W 12 a 2 = f ( W 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 1 etc. In matrix nota/on a 2 z = Wx + b a 3 a = f ( z ) where f is applied element-wise: b 3 f ([ z 1 , z 2 , z 3 ]) = [ f ( z 1 ), f ( z 2 ), f ( z 3 )] Alan Ritter ◦ socialmedia-class.org
Activation Function logis'c (“sigmoid”) tanh tanh is just a rescaled and shi7ed sigmoid tanh( z ) = 2logistic(2 z ) − 1 Alan Ritter ◦ socialmedia-class.org
Activation Function hard tanh so* sign rec$fied linear (ReLu) a softsign( z ) = rect( z ) = max( z ,0) 1 + a hard tanh similar but computa3onally cheaper than tanh and saturates hard. • Glorot and Bengio, AISTATS 2011 discuss so*sign and rec3fier • Alan Ritter ◦ socialmedia-class.org
Non-linearity • Logistic (Softmax) Regression only gives linear decision boundaries Alan Ritter ◦ socialmedia-class.org
Non-linearity • Neural networks can learn much more complex functions and nonlinear decision boundaries! Alan Ritter ◦ socialmedia-class.org
Non-linearity Hidden Input Output Layer z = g ( V g ( Wx + b ) + c ) } output of first layer With no nonlinearity: z = VWx + Vb + c Equivalent to z = Ux + d Alan Ritter ◦ socialmedia-class.org
What about Word2vec (Skip-gram and CBOW)? Alan Ritter ◦ socialmedia-class.org
So, what about Word2vec (Skip-gram and CBOW)? It is not deep learning — but “shallow” neural networks. It is — in fact — a log-linear model (softmax regression). So, it is faster over larger dataset yielding better embeddings. Alan Ritter ◦ socialmedia-class.org
Learning Neural Networks Hidden change in output w.r.t. hidden Input Output Layer change in hidden w.r.t. input change in output w.r.t. input ‣ CompuEng these looks like running this network in reverse (backpropagaEon) ‣ I’ve omi3ed some details about how we get the gradients Alan Ritter ◦ socialmedia-class.org
Strategy for Successful NNs • Select network structure appropriate for problem - Structure: Single words, fixed windows, sentence based, document level; bag of words, recursive vs. recurrent, CNN, … - Nonlinearity • Check for implementation bugs with gradient checks • Parameter initialization • Optimization tricks • Should get close to 100% accuracy/precision/recall/etc… on training data • Tune number of iterations on dev data Alan Ritter ◦ socialmedia-class.org
Neural Machine Translation Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Ama Amazi zing ng ! 13 Alan Ritter ◦ socialmedia-class.org
Neural Machine Translation Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 5 0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf] Alan Ritter ◦ socialmedia-class.org
What is Neural MT (NMT)? Neural Machine Translation is the entire MT approach of modeling the ent one big artificial neural process via one network * * But sometimes we compromise this goal a little 14 Alan Ritter ◦ socialmedia-class.org
The three big wins of Neural MT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 24 Alan Ritter ◦ socialmedia-class.org
Neural encoder-decoder architectures − 0.2 Input − 0.1 Translated 0.1 Encoder Decoder text 0.4 text − 0.3 1.1 15 Alan Ritter ◦ socialmedia-class.org
Neural MT: The Bronze Age [Allen 1987 IEEE 1 st ICNN] 3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set The grandfather offered the little girl a book � El abuelo le ofrecio un libro a la nina pequena Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words 20 Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Modern Sequence Models for NMT [Sutskever et al. 2014, Bahdanau et al. 2014, et seq.] following [Jordan 1986] and more closely [Elman 1990] Translation The protests escalated over the weekend <EOS> generated 0.2 -0.1 0.2 0.2 0.3 0.4 0.5 0.4 -0.2 -0.4 -0.3 0.1 0.2 0.6 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.6 0.6 0.5 0.3 0.6 -0.1 -0.1 -0.1 -0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 0.1 -0.1 -0.5 -0.7 -0.7 -0.7 -0.7 -0.2 -0.3 -0.7 -0.7 -0.7 -0.7 -0.4 -0.7 0.1 0.1 0.1 0.1 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.2 0.1 Sentence 0.2 0.2 0.2 -0.4 0.2 0.1 0.2 -0.1 0.2 0.3 0.2 0.2 0.2 -0.8 0.6 -0.1 0.6 0.6 0.3 0.6 0.6 0.4 0.6 0.6 -0.2 0.6 meaning -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.5 -0.7 -0.7 -0.7 0.3 -0.7 -0.4 0.3 0.2 -0.5 -0.7 0.1 -0.7 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 is built up 0.4 0.2 0.2 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2 0.2 0.4 -0.2 0.6 0.6 0.6 0.6 -0.3 0.4 0.3 0.6 0.5 0.6 0.6 -0.6 -0.1 -0.1 -0.1 -0.1 0.1 -0.3 -0.1 -0.1 0.1 -0.5 -0.1 -0.1 0.2 -0.7 -0.7 -0.7 -0.4 -0.5 -0.4 -0.7 -0.7 0.3 0.4 -0.7 -0.7 -0.3 0.1 0.1 0.1 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.4 Source Feeding in Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend sentence last word A deep recurrent neural network Alan Ritter ◦ socialmedia-class.org
Long Short-Term Memory Networks (LSTM) Source: Colah’s Blog Alan Ritter ◦ socialmedia-class.org
Data-Driven Conversation • Twitter: ~ 500 Million Public SMS-Style Conversations per Month • Goal: Learn conversational agents directly from massive volumes of data. 26
Data-Driven Conversation • Twitter: ~ 500 Million Public SMS-Style Conversations per Month • Goal: Learn conversational agents directly from massive volumes of data. 26
[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? 27
[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { Output: Yum ! I 27
[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { Output: Yum ! I want to 27
[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { Output: Yum ! I want to be there 27
[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { { Output: Yum ! I want to be there tomorrow ! 27
Neural Conversation Alan Ritter ◦ socialmedia-class.org
Vanilla seq2seq & long sentences _ Je suis étudiant _ I am a student Je suis étudiant Problem : fixed-dimensional representations 130 Alan Ritter ◦ socialmedia-class.org
Started in computer vision! Attention Mechanism [Larochelle & Hinton, 2010], [ Denil, Bazzani, Larochelle, Freitas, 2012 ] Je suis étudiant Pool of source states _ I am a student Je • Solution: random access memory • Retrieve as needed. 131 Alan Ritter ◦ socialmedia-class.org
Recommend
More recommend