Social Media & Text Analysis lecture 9 - Deep Learning for NLP - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky, Chris Manning

A Neuron • If you know Logistic Regression, then you A single neuron A computa*onal unit with n (3) inputs already understand a and 1 output basic neural network and parameters W, b neuron! Inputs Ac*va*on Output func*on Bias unit corresponds to intercept term Alan Ritter ◦ socialmedia-class.org

A Neuron is essentially a binary logistic regression unit b: We can have an “always on” h w , b ( x ) = f ( w T x + b ) feature, which gives a class prior, or separate it out, as a bias term 1 f ( z ) = 1 + e − z w , b are the parameters of this neuron i.e., this logis3c regression model Alan Ritter ◦ socialmedia-class.org

A Neural Network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis6c regression func6ons, then we get a vector of outputs … Alan Ritter ◦ socialmedia-class.org

A Neural Network = running several logistic regressions at the same time … which we can feed into another logis2c regression func2on It is the loss func.on that will direct what the intermediate hidden variables should be, so as to do a good job at predic.ng the targets for the next layer, etc. Alan Ritter ◦ socialmedia-class.org

A Neural Network = running several logistic regressions at the same time Before we know it, we have a mul3layer neural network…. Alan Ritter ◦ socialmedia-class.org

f : Activation Function We have a 1 = f ( W 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) W 12 a 2 = f ( W 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 1 etc. In matrix nota/on a 2 z = Wx + b a 3 a = f ( z ) where f is applied element-wise: b 3 f ([ z 1 , z 2 , z 3 ]) = [ f ( z 1 ), f ( z 2 ), f ( z 3 )] Alan Ritter ◦ socialmedia-class.org

Activation Function logis'c (“sigmoid”) tanh tanh is just a rescaled and shi7ed sigmoid tanh( z ) = 2logistic(2 z ) − 1 Alan Ritter ◦ socialmedia-class.org

Activation Function hard tanh so* sign rec$fied linear (ReLu) a softsign( z ) = rect( z ) = max( z ,0) 1 + a hard tanh similar but computa3onally cheaper than tanh and saturates hard. • Glorot and Bengio, AISTATS 2011 discuss so*sign and rec3fier • Alan Ritter ◦ socialmedia-class.org

Non-linearity • Logistic (Softmax) Regression only gives linear decision boundaries Alan Ritter ◦ socialmedia-class.org

Non-linearity • Neural networks can learn much more complex functions and nonlinear decision boundaries! Alan Ritter ◦ socialmedia-class.org

Non-linearity Hidden Input Output Layer z = g ( V g ( Wx + b ) + c ) } output of first layer With no nonlinearity: z = VWx + Vb + c Equivalent to z = Ux + d Alan Ritter ◦ socialmedia-class.org

What about Word2vec   (Skip-gram and CBOW)? Alan Ritter ◦ socialmedia-class.org

So, what about Word2vec   (Skip-gram and CBOW)? It is not deep learning — but “shallow” neural networks. It is — in fact — a log-linear model (softmax regression). So, it is faster over larger dataset yielding better embeddings. Alan Ritter ◦ socialmedia-class.org

Learning Neural Networks Hidden change in output w.r.t. hidden Input Output Layer change in hidden w.r.t. input change in output w.r.t. input ‣ CompuEng these looks like running this   network in reverse (backpropagaEon) ‣ I’ve omi3ed some details about how we   get the gradients Alan Ritter ◦ socialmedia-class.org

Strategy for Successful NNs • Select network structure appropriate for problem - Structure: Single words, fixed windows, sentence based, document level; bag of words, recursive vs. recurrent, CNN, … - Nonlinearity • Check for implementation bugs with gradient checks • Parameter initialization • Optimization tricks • Should get close to 100% accuracy/precision/recall/etc… on training data • Tune number of iterations on dev data Alan Ritter ◦ socialmedia-class.org

Neural Machine Translation Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Ama Amazi zing ng ! 13 Alan Ritter ◦ socialmedia-class.org

Neural Machine Translation Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 5 0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf] Alan Ritter ◦ socialmedia-class.org

What is Neural MT (NMT)? Neural Machine Translation is the entire MT approach of modeling the ent one big artificial neural process via one network * * But sometimes we compromise this goal a little 14 Alan Ritter ◦ socialmedia-class.org

The three big wins of Neural MT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 24 Alan Ritter ◦ socialmedia-class.org

Neural encoder-decoder architectures − 0.2 Input − 0.1 Translated 0.1 Encoder Decoder text 0.4 text − 0.3 1.1 15 Alan Ritter ◦ socialmedia-class.org

Neural MT: The Bronze Age [Allen 1987 IEEE 1 st ICNN] 3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set The grandfather offered the little girl a book � El abuelo le ofrecio un libro a la nina pequena Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words 20 Alan Ritter ◦ socialmedia-class.org

Alan Ritter ◦ socialmedia-class.org

Modern Sequence Models for NMT [Sutskever et al. 2014, Bahdanau et al. 2014, et seq.] following [Jordan 1986] and more closely [Elman 1990] Translation The protests escalated over the weekend <EOS> generated 0.2 -0.1 0.2 0.2 0.3 0.4 0.5 0.4 -0.2 -0.4 -0.3 0.1 0.2 0.6 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.6 0.6 0.5 0.3 0.6 -0.1 -0.1 -0.1 -0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 0.1 -0.1 -0.5 -0.7 -0.7 -0.7 -0.7 -0.2 -0.3 -0.7 -0.7 -0.7 -0.7 -0.4 -0.7 0.1 0.1 0.1 0.1 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.2 0.1 Sentence 0.2 0.2 0.2 -0.4 0.2 0.1 0.2 -0.1 0.2 0.3 0.2 0.2 0.2 -0.8 0.6 -0.1 0.6 0.6 0.3 0.6 0.6 0.4 0.6 0.6 -0.2 0.6 meaning -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.5 -0.7 -0.7 -0.7 0.3 -0.7 -0.4 0.3 0.2 -0.5 -0.7 0.1 -0.7 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 is built up 0.4 0.2 0.2 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2 0.2 0.4 -0.2 0.6 0.6 0.6 0.6 -0.3 0.4 0.3 0.6 0.5 0.6 0.6 -0.6 -0.1 -0.1 -0.1 -0.1 0.1 -0.3 -0.1 -0.1 0.1 -0.5 -0.1 -0.1 0.2 -0.7 -0.7 -0.7 -0.4 -0.5 -0.4 -0.7 -0.7 0.3 0.4 -0.7 -0.7 -0.3 0.1 0.1 0.1 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.4 Source Feeding in Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend sentence last word A deep recurrent neural network Alan Ritter ◦ socialmedia-class.org

Long Short-Term Memory Networks (LSTM) Source: Colah’s Blog Alan Ritter ◦ socialmedia-class.org

Data-Driven Conversation • Twitter: ~ 500 Million Public SMS-Style Conversations per Month • Goal: Learn conversational agents directly from massive volumes of data. 26

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? 27

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { Output: Yum ! I 27

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { Output: Yum ! I want to 27

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { Output: Yum ! I want to be there 27

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { { Output: Yum ! I want to be there tomorrow ! 27

Neural Conversation Alan Ritter ◦ socialmedia-class.org

Vanilla seq2seq & long sentences _ Je suis étudiant _ I am a student Je suis étudiant Problem : fixed-dimensional representations 130 Alan Ritter ◦ socialmedia-class.org

Started in computer vision! Attention Mechanism [Larochelle & Hinton, 2010], [ Denil, Bazzani, Larochelle, Freitas, 2012 ] Je suis étudiant Pool of source states _ I am a student Je • Solution: random access memory • Retrieve as needed. 131 Alan Ritter ◦ socialmedia-class.org

Social Media & Text Analysis lecture 9 - Deep Learning for NLP - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky,

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Social Media donts What is social media Social media is nothing new Just an extension

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Getting Social What is social media? Why does social media matter? What social media

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

L i n u x D a y s , 2 5 O t t o b r e 2 0 2 0 D e n i s J a r o

The 8 th International Workshop on the Physics of Compressible Turbulent Mixing Experimental study

Estimating capacities and rates of Gaussian quantum channels Stefano Mancini School of Science

CSCI 446: Artificial Intelligence Perceptrons Instructor: Michele Van Dyne [These slides were

Atomic nuclei: from fundamental interactions to structure and stars Kai Hebeler Mainz, April 7,

Study and experiment on the alternative technique of frequencydependent squeezing generation

Internet Of Things? Internet Of Thieves! Pullar Giovanni Battista [IT Engineer&DevOps]

Microseismicity & Stimulation the case of Soultz-sous-Forts Jean Charlty, Louis Dorbath,

Social Media & Text Analysis lecture 9 - Deep Learning for NLP - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 9 - Deep Learning for NLP CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Many slides are adapted from Richard Socher, Greg Durret, Chris Dyer, Dan Jurafsky,

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Social Media donts What is social media Social media is nothing new Just an extension

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Getting Social What is social media? Why does social media matter? What social media

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

L i n u x D a y s , 2 5 O t t o b r e 2 0 2 0 D e n i s J a r o

The 8 th International Workshop on the Physics of Compressible Turbulent Mixing Experimental study

Estimating capacities and rates of Gaussian quantum channels Stefano Mancini School of Science

CSCI 446: Artificial Intelligence Perceptrons Instructor: Michele Van Dyne [These slides were

Atomic nuclei: from fundamental interactions to structure and stars Kai Hebeler Mainz, April 7,

Study and experiment on the alternative technique of frequencydependent squeezing generation

Internet Of Things? Internet Of Thieves! Pullar Giovanni Battista [IT Engineer&amp;DevOps]

Microseismicity &amp; Stimulation the case of Soultz-sous-Forts Jean Charlty, Louis Dorbath,

Internet Of Things? Internet Of Thieves! Pullar Giovanni Battista [IT Engineer&DevOps]

Microseismicity & Stimulation the case of Soultz-sous-Forts Jean Charlty, Louis Dorbath,