Algorithms for NLP CS 11-711 · Fall 2020 Lecture 2: Linear text classification Emma Strubell
Let’s try this again… Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2
Outline ■ Basic representations of text data for classification ■ Four linear classifiers ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine; SVM) ■ Logistic regression 3
Text classification Problem definition ■ Given a text xt w = ( w 1 , w 2 , . . . , w T ) ∈ V ∗ ■ Choose a label el y ∈ Y . ■ For example: ■ Sentiment analysis { positive, negative, neutral } Y = Y = ■ Toxic comment classification { toxic, non-toxic } Y = Y = ■ Language identification { Mandarin, English, Spanish, … } Y = Y = w = The drinks were strong but the fish tacos were bland y = negative w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 4
How to represent text for classification? One choice of R : bag-of-words ■ Sequence length T can be different for every sentence/document ■ The bag-of-words is a fixed-length vector of word counts: w = The drinks were strong but the fish tacos were bland x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e t k d s e n r u s … … c h o r e n h o a … b … … fi a t c w a t r v y t a t l d b s z t r a a ■ Length of x is equal to the size of the vocabulary, V ■ For each x there may be many possible w (representation ignores word order) 5
Linear classification on bag-of-words ■ Let score the compatibility of bag-of-words x and label y . x ψ ( x , y ) . Then: y = argmax ˆ ψ ( x , y ) . y ■ In a linear classifier this scoring function has the simple form: X ψ ( x , y ) = θ · f ( x , y ) = θ j × f j ( x , y ) , j =1 where θ is a vector of weights, and f is a feature function 6
Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 | {z } ( K − 1) × V 7
Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ | {z } {z ( K − 2) × V V 8
Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = negative x bland x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ T f ( x , y = K ) = 0 0 0 0 0 0 0 0 0 … 0 x 0 x 1 … x |V| | {z } ( K − 1) × V 9
<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -0.16 -1.66 -1.55 0.23 0.17 -3.43 0.18 -2.08 -1.46 0.13 1.47 -0.06 1.84 … 0.36 K × V def compute_score(x, y, weights): total = 0 for feature, count in feature_function(x, y).items(): total += weights[feature] * count return total 10
<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -1.13 -0.37 0.97 0.58 -1.46 -1 -0.49 2.35 0.49 -0.34 0.69 0.87 0.36 … -0.26 K × V import numpy as np def compute_score(x, y, weights): return np.dot(weights, feature_function(x, y)) 11
Ok, but how to obtain θ ? ■ The learning problem is to find the right weights θ . ■ The rest of this lecture will cover four supervised learning algorithms: ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine) ■ Logistic regression ■ All these methods assume a labeled dataset of N examples: aset { ( x ( i ) , y ( i ) ) } N i =1 . 12
Probabilistic classification ■ Naïve Bayes is a probabilistic classifier. It takes the following strategy: ■ Define a probability model p ( x , y ) ■ Estimate the parameters of the probability model by maximum likelihood , i.e. by maximizing the likelihood of the dataset ■ Set the scoring function equal to the log-probability: ψ ( x , y ) = log p ( x , y ) = log p ( y | x ) + C , where C is constant in y . This ensures that: y = argmax ˆ p ( y | x ) . y 13
A probability model for text classification ■ First, assume each instance (( x , y ) pair) is independent of the others: N p ( x (1: N ) , y (1: N ) ) = p ( x ( i ) , y ( i ) ) . Y i =1 ■ Apply the chain rule of probability: p ( x , y ) = p ( x | y ) × p ( y ) ■ Define the parametric form of each probability: p ( y ) = Categorical( µ ) p ( x | y ) = Multinomial( φ , T ) . ■ The multinomial is a distribution over vectors of counts ■ The parameters μ and φ are vectors of probabilities 14
The multinomial distribution ■ Suppose the word bland has probability φ j . What is the probability that this word appears 3 times? ■ Each word’s probability is exponentiated by its count: ⇣P V ⌘ ! V j =1 x j φ x j Y Multinomial( x ; φ , T ) = j . Q V j =1 ( x j !) j =1 ■ The coefficient is the count of the number of possible orderings of x . Crucially, it does not depend on the frequency parameter φ . 15
<latexit sha1_base64="4F1q9zeaHiIDviznv+xz7u4uTGE=">ADHXicbVJLaxsxEJa3r9R9xGmPvYiaQPrA7KaF9BjaS6GXFGonYBmj1Y5sEa20SLNuzOJ/Uuip/Se9lV5Lf0jvldbyDoZkPQx81DH5MWnmM4z+d6MbNW7fv7Nzt3rv/4OFub+/RyNvSCRgKq607S7kHrQwMUaGs8IBz1MNp+n5u3X8dAHOK2s+4bKASc5nRklOAbXtNf7QBmqHDw9GNEXNHk27fXjQVwbvQqSBvRJYyfTvc5flR5mBQaO79OIkLnFTcoRIaVl1Wei4OczGAdoeOg2qerRV3Q/eDIqrQvHIK29lzMqnu/zNPAzDnO/XZs7bwuNi5RvplUyhQlghGbRrLUFC1d60Az5UCgXgbAhVNhVirm3HGBQa3u/uU2c9ALwNZHKi/rzl3mwMBnYfOcm+x5xSTPlV5mIHmpcVUxL/j62R4mS1U4RtFLjaSdJkGZNapmTJca5DI1lfbHZ45svpuj1DV5NB4PZ4twFSrGgptPbB05mxZtIqvtvProqEAl0GJDR/aRtGWJRkey2ugtHhIHk1OPz4un/8tlmZHfKEPCUHJCFH5Ji8JydkSARZkC/kG/kefY1+RD+jXxtq1GlyHpOWRb/ARK7CRY=</latexit> Naïve Bayes text classification ■ Naïve Bayes can be formulated in our linear classification framework by setting θ equal to the log parameters: θ = log φ y1,w1 log φ y1,w2 … log φ y1,wv log μ y1 log φ y2,w1 … log φ y2,wv log μ y2 … log φ yk,wv log μ yk K × ( V + 1) ψ ( x , y ) = θ · f ( x , y ) = log p ( x | y ) + log p ( y ) , where f ( x , y ) is extended to include an “offset” 1 for each possible label after the word counts. 16
Estimating Naïve Bayes ■ In relative frequency estimation, the parameters are set to empirical frequencies: i : y ( i ) = y x ( i ) P count( y , j ) j ˆ φ y , j = = P V i : y ( i ) = y x ( i ) P V j 0 =1 count( y , j 0 ) P j 0 =1 j 0 count( y ) µ y = ˆ y 0 count( y 0 ) . P ■ This turns out to be identical to the maximum likelihood estimate (yay): N N ˆ p ( x ( i ) , y ( i ) ) = argmax log p ( x ( i ) , y ( i ) ) . Y X φ , ˆ µ = argmax φ , µ φ , µ i =1 i =1 17
Smoothing, bias, variance ■ To deal with low counts, it can be helpful to smooth probabilities: α + count( y , j ) ˆ φ y , j = V α + P V . j 0 =1 count( y , j 0 ) ■ Smoothing introduces bias , moving the parameters away from their maximum- likelihood estimates. ■ But, it corrects variance , the extent to which the parameters depend on the idiosyncrasies of a finite dataset. ■ The smoothing term α is a hyperparameter that must be tuned on a development set . 18
Recommend
More recommend