Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 · Fall 2020 Lecture 2: Linear text classification Emma Strubell

Let’s try this again… Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2

Outline ■ Basic representations of text data for classification ■ Four linear classifiers ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine; SVM) ■ Logistic regression 3

Text classification Problem definition ■ Given a text xt w = ( w 1 , w 2 , . . . , w T ) ∈ V ∗ ■ Choose a label el y ∈ Y . ■ For example: ■ Sentiment analysis { positive, negative, neutral } Y = Y = ■ Toxic comment classification { toxic, non-toxic } Y = Y = ■ Language identification { Mandarin, English, Spanish, … } Y = Y = w = The drinks were strong but the fish tacos were bland y = negative w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 4

How to represent text for classification? One choice of R : bag-of-words ■ Sequence length T can be different for every sentence/document ■ The bag-of-words is a fixed-length vector of word counts: w = The drinks were strong but the fish tacos were bland x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e t k d s e n r u s … … c h o r e n h o a … b … … fi a t c w a t r v y t a t l d b s z t r a a ■ Length of x is equal to the size of the vocabulary, V ■ For each x there may be many possible w (representation ignores word order) 5

Linear classification on bag-of-words ■ Let score the compatibility of bag-of-words x and label y . x ψ ( x , y ) . Then: y = argmax ˆ ψ ( x , y ) . y ■ In a linear classifier this scoring function has the simple form: X ψ ( x , y ) = θ · f ( x , y ) = θ j × f j ( x , y ) , j =1 where θ is a vector of weights, and f is a feature function 6

Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 | {z } ( K − 1) × V 7

Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ | {z } {z ( K − 2) × V V 8

Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = negative x bland x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ T f ( x , y = K ) = 0 0 0 0 0 0 0 0 0 … 0 x 0 x 1 … x |V| | {z } ( K − 1) × V 9

<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -0.16 -1.66 -1.55 0.23 0.17 -3.43 0.18 -2.08 -1.46 0.13 1.47 -0.06 1.84 … 0.36 K × V def compute_score(x, y, weights): total = 0 for feature, count in feature_function(x, y).items(): total += weights[feature] * count return total 10

<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -1.13 -0.37 0.97 0.58 -1.46 -1 -0.49 2.35 0.49 -0.34 0.69 0.87 0.36 … -0.26 K × V import numpy as np def compute_score(x, y, weights): return np.dot(weights, feature_function(x, y)) 11

Ok, but how to obtain θ ? ■ The learning problem is to find the right weights θ . ■ The rest of this lecture will cover four supervised learning algorithms: ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine) ■ Logistic regression ■ All these methods assume a labeled dataset of N examples: aset { ( x ( i ) , y ( i ) ) } N i =1 . 12

Probabilistic classification ■ Naïve Bayes is a probabilistic classifier. It takes the following strategy: ■ Define a probability model p ( x , y ) ■ Estimate the parameters of the probability model by maximum likelihood , i.e. by maximizing the likelihood of the dataset ■ Set the scoring function equal to the log-probability: ψ ( x , y ) = log p ( x , y ) = log p ( y | x ) + C , where C is constant in y . This ensures that: y = argmax ˆ p ( y | x ) . y 13

A probability model for text classification ■ First, assume each instance (( x , y ) pair) is independent of the others: N p ( x (1: N ) , y (1: N ) ) = p ( x ( i ) , y ( i ) ) . Y i =1 ■ Apply the chain rule of probability: p ( x , y ) = p ( x | y ) × p ( y ) ■ Define the parametric form of each probability: p ( y ) = Categorical( µ ) p ( x | y ) = Multinomial( φ , T ) . ■ The multinomial is a distribution over vectors of counts ■ The parameters μ and φ are vectors of probabilities 14

The multinomial distribution ■ Suppose the word bland has probability φ j . What is the probability that this word appears 3 times? ■ Each word’s probability is exponentiated by its count: ⇣P V ⌘ ! V j =1 x j φ x j Y Multinomial( x ; φ , T ) = j . Q V j =1 ( x j !) j =1 ■ The coefficient is the count of the number of possible orderings of x . Crucially, it does not depend on the frequency parameter φ . 15

<latexit sha1_base64="4F1q9zeaHiIDviznv+xz7u4uTGE=">ADHXicbVJLaxsxEJa3r9R9xGmPvYiaQPrA7KaF9BjaS6GXFGonYBmj1Y5sEa20SLNuzOJ/Uuip/Se9lV5Lf0jvldbyDoZkPQx81DH5MWnmM4z+d6MbNW7fv7Nzt3rv/4OFub+/RyNvSCRgKq607S7kHrQwMUaGs8IBz1MNp+n5u3X8dAHOK2s+4bKASc5nRklOAbXtNf7QBmqHDw9GNEXNHk27fXjQVwbvQqSBvRJYyfTvc5flR5mBQaO79OIkLnFTcoRIaVl1Wei4OczGAdoeOg2qerRV3Q/eDIqrQvHIK29lzMqnu/zNPAzDnO/XZs7bwuNi5RvplUyhQlghGbRrLUFC1d60Az5UCgXgbAhVNhVirm3HGBQa3u/uU2c9ALwNZHKi/rzl3mwMBnYfOcm+x5xSTPlV5mIHmpcVUxL/j62R4mS1U4RtFLjaSdJkGZNapmTJca5DI1lfbHZ45svpuj1DV5NB4PZ4twFSrGgptPbB05mxZtIqvtvProqEAl0GJDR/aRtGWJRkey2ugtHhIHk1OPz4un/8tlmZHfKEPCUHJCFH5Ji8JydkSARZkC/kG/kefY1+RD+jXxtq1GlyHpOWRb/ARK7CRY=</latexit> Naïve Bayes text classification ■ Naïve Bayes can be formulated in our linear classification framework by setting θ equal to the log parameters: θ = log φ y1,w1 log φ y1,w2 … log φ y1,wv log μ y1 log φ y2,w1 … log φ y2,wv log μ y2 … log φ yk,wv log μ yk K × ( V + 1) ψ ( x , y ) = θ · f ( x , y ) = log p ( x | y ) + log p ( y ) , where f ( x , y ) is extended to include an “offset” 1 for each possible label after the word counts. 16

Estimating Naïve Bayes ■ In relative frequency estimation, the parameters are set to empirical frequencies: i : y ( i ) = y x ( i ) P count( y , j ) j ˆ φ y , j = = P V i : y ( i ) = y x ( i ) P V j 0 =1 count( y , j 0 ) P j 0 =1 j 0 count( y ) µ y = ˆ y 0 count( y 0 ) . P ■ This turns out to be identical to the maximum likelihood estimate (yay): N N ˆ p ( x ( i ) , y ( i ) ) = argmax log p ( x ( i ) , y ( i ) ) . Y X φ , ˆ µ = argmax φ , µ φ , µ i =1 i =1 17

Smoothing, bias, variance ■ To deal with low counts, it can be helpful to smooth probabilities: α + count( y , j ) ˆ φ y , j = V α + P V . j 0 =1 count( y , j 0 ) ■ Smoothing introduces bias , moving the parameters away from their maximum- likelihood estimates. ■ But, it corrects variance , the extent to which the parameters depend on the idiosyncrasies of a finite dataset. ■ The smoothing term α is a hyperparameter that must be tuned on a development set . 18

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text classification Emma Strubell Lets try this again Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2 Outline Basic

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

BASIC INPUT/OUTPUT Fundamentals of Computer Science I Outline: Basic Input/Output Screen

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

Self-Stabilizing Token Distribution with Constant-Space for Trees Yuichi Sudo 1 , Ajoy K. Datta 2

0 + - decays in a factorization approach Analysis done in collaboration with Robert

LArIAT In 10 Minutes New Perspectives 2018 Hunter Sullivan University of Texas at Arlington On

SVD-LDA: Topic Modeling for Full-Text Recommender Systems Sergey Nikolenko Steklov Mathematical

Tweaking TCPs Timers Kieran Mansley Laboratory for Communication Engineering Context

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -