social media text analysis
play

Social Media & Text Analysis lecture 9 - Deep Learning for NLP - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky,


  1. Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky, Chris Manning

  2. A Neuron • If you know Logistic Regression, then you A single neuron A computa*onal unit with n (3) inputs already understand a and 1 output basic neural network and parameters W, b neuron! Inputs Ac*va*on Output func*on Bias unit corresponds to intercept term Alan Ritter ◦ socialmedia-class.org

  3. A Neuron is essentially a binary logistic regression unit b: We can have an “always on” h w , b ( x ) = f ( w T x + b ) feature, which gives a class prior, or separate it out, as a bias term 1 f ( z ) = 1 + e − z w , b are the parameters of this neuron i.e., this logis3c regression model Alan Ritter ◦ socialmedia-class.org

  4. A Neural Network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis6c regression func6ons, then we get a vector of outputs … Alan Ritter ◦ socialmedia-class.org

  5. A Neural Network = running several logistic regressions at the same time … which we can feed into another logis2c regression func2on It is the loss func.on that will direct what the intermediate hidden variables should be, so as to do a good job at predic.ng the targets for the next layer, etc. Alan Ritter ◦ socialmedia-class.org

  6. A Neural Network = running several logistic regressions at the same time Before we know it, we have a mul3layer neural network…. Alan Ritter ◦ socialmedia-class.org

  7. f : Activation Function We have a 1 = f ( W 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) W 12 a 2 = f ( W 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 1 etc. In matrix nota/on a 2 z = Wx + b a 3 a = f ( z ) where f is applied element-wise: b 3 f ([ z 1 , z 2 , z 3 ]) = [ f ( z 1 ), f ( z 2 ), f ( z 3 )] Alan Ritter ◦ socialmedia-class.org

  8. Activation Function logis'c (“sigmoid”) tanh tanh is just a rescaled and shi7ed sigmoid tanh( z ) = 2logistic(2 z ) − 1 Alan Ritter ◦ socialmedia-class.org

  9. Activation Function hard tanh so* sign rec$fied linear (ReLu) a softsign( z ) = rect( z ) = max( z ,0) 1 + a hard tanh similar but computa3onally cheaper than tanh and saturates hard. • Glorot and Bengio, AISTATS 2011 discuss so*sign and rec3fier • Alan Ritter ◦ socialmedia-class.org

  10. Non-linearity • Logistic (Softmax) Regression only gives linear decision boundaries Alan Ritter ◦ socialmedia-class.org

  11. Non-linearity • Neural networks can learn much more complex functions and nonlinear decision boundaries! Alan Ritter ◦ socialmedia-class.org

  12. Non-linearity Hidden Input Output Layer z = g ( V g ( Wx + b ) + c ) } output of first layer With no nonlinearity: z = VWx + Vb + c Equivalent to z = Ux + d Alan Ritter ◦ socialmedia-class.org

  13. What about Word2vec 
 (Skip-gram and CBOW)? Alan Ritter ◦ socialmedia-class.org

  14. So, what about Word2vec 
 (Skip-gram and CBOW)? It is not deep learning — but “shallow” neural networks. It is — in fact — a log-linear model (softmax regression). So, it is faster over larger dataset yielding better embeddings. Alan Ritter ◦ socialmedia-class.org

  15. Learning Neural Networks Hidden change in output w.r.t. hidden Input Output Layer change in hidden w.r.t. input change in output w.r.t. input ‣ CompuEng these looks like running this 
 network in reverse (backpropagaEon) ‣ I’ve omi3ed some details about how we 
 get the gradients Alan Ritter ◦ socialmedia-class.org

  16. Strategy for Successful NNs • Select network structure appropriate for problem - Structure: Single words, fixed windows, sentence based, document level; bag of words, recursive vs. recurrent, CNN, … - Nonlinearity • Check for implementation bugs with gradient checks • Parameter initialization • Optimization tricks • Should get close to 100% accuracy/precision/recall/etc… on training data • Tune number of iterations on dev data Alan Ritter ◦ socialmedia-class.org

  17. Neural Machine Translation Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Ama Amazi zing ng ! 13 Alan Ritter ◦ socialmedia-class.org

  18. Neural Machine Translation Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 5 0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf] Alan Ritter ◦ socialmedia-class.org

  19. What is Neural MT (NMT)? Neural Machine Translation is the entire MT approach of modeling the ent one big artificial neural process via one network * * But sometimes we compromise this goal a little 14 Alan Ritter ◦ socialmedia-class.org

  20. The three big wins of Neural MT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 24 Alan Ritter ◦ socialmedia-class.org

  21. Neural encoder-decoder architectures − 0.2 Input − 0.1 Translated 0.1 Encoder Decoder text 0.4 text − 0.3 1.1 15 Alan Ritter ◦ socialmedia-class.org

  22. Neural MT: The Bronze Age [Allen 1987 IEEE 1 st ICNN] 3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set The grandfather offered the little girl a book � El abuelo le ofrecio un libro a la nina pequena Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words 20 Alan Ritter ◦ socialmedia-class.org

  23. Alan Ritter ◦ socialmedia-class.org

  24. Modern Sequence Models for NMT [Sutskever et al. 2014, Bahdanau et al. 2014, et seq.] following [Jordan 1986] and more closely [Elman 1990] Translation The protests escalated over the weekend <EOS> generated 0.2 -0.1 0.2 0.2 0.3 0.4 0.5 0.4 -0.2 -0.4 -0.3 0.1 0.2 0.6 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.6 0.6 0.5 0.3 0.6 -0.1 -0.1 -0.1 -0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 0.1 -0.1 -0.5 -0.7 -0.7 -0.7 -0.7 -0.2 -0.3 -0.7 -0.7 -0.7 -0.7 -0.4 -0.7 0.1 0.1 0.1 0.1 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.2 0.1 Sentence 0.2 0.2 0.2 -0.4 0.2 0.1 0.2 -0.1 0.2 0.3 0.2 0.2 0.2 -0.8 0.6 -0.1 0.6 0.6 0.3 0.6 0.6 0.4 0.6 0.6 -0.2 0.6 meaning -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.5 -0.7 -0.7 -0.7 0.3 -0.7 -0.4 0.3 0.2 -0.5 -0.7 0.1 -0.7 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 is built up 0.4 0.2 0.2 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2 0.2 0.4 -0.2 0.6 0.6 0.6 0.6 -0.3 0.4 0.3 0.6 0.5 0.6 0.6 -0.6 -0.1 -0.1 -0.1 -0.1 0.1 -0.3 -0.1 -0.1 0.1 -0.5 -0.1 -0.1 0.2 -0.7 -0.7 -0.7 -0.4 -0.5 -0.4 -0.7 -0.7 0.3 0.4 -0.7 -0.7 -0.3 0.1 0.1 0.1 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.4 Source Feeding in Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend sentence last word A deep recurrent neural network Alan Ritter ◦ socialmedia-class.org

  25. Long Short-Term Memory Networks (LSTM) Source: Colah’s Blog Alan Ritter ◦ socialmedia-class.org

  26. Data-Driven Conversation • Twitter: ~ 500 Million Public SMS-Style Conversations per Month • Goal: Learn conversational agents directly from massive volumes of data. 26

  27. Data-Driven Conversation • Twitter: ~ 500 Million Public SMS-Style Conversations per Month • Goal: Learn conversational agents directly from massive volumes of data. 26

  28. [Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? 27

  29. [Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { Output: Yum ! I 27

  30. [Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { Output: Yum ! I want to 27

  31. [Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { Output: Yum ! I want to be there 27

  32. [Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { { Output: Yum ! I want to be there tomorrow ! 27

  33. Neural Conversation Alan Ritter ◦ socialmedia-class.org

  34. Vanilla seq2seq & long sentences _ Je suis étudiant _ I am a student Je suis étudiant Problem : fixed-dimensional representations 130 Alan Ritter ◦ socialmedia-class.org

  35. Started in computer vision! Attention Mechanism [Larochelle & Hinton, 2010], [ Denil, Bazzani, Larochelle, Freitas, 2012 ] Je suis étudiant Pool of source states _ I am a student Je • Solution: random access memory • Retrieve as needed. 131 Alan Ritter ◦ socialmedia-class.org

Recommend


More recommend