Natural Language Processing (CSE 517): Text Classification (II) Noah Smith � 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17
Quick Review: Text Classification Input: a piece of text x ∈ V † , usually a document (r.v. X ) Output: a label from a finite set L (r.v. L ) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V † → L 3. Apply classify to as much data as you want! We covered na¨ ıve Bayes, reviewed multinomial logistic regression, and, briefly, the perceptron. 2 / 17
Multinomial Logistic Regression as “Log Loss” exp w · φ ( x , ℓ ) p ( L = ℓ | x ) = � ℓ ′ ∈L exp w · φ ( x , ℓ ′ ) MLE can be rewritten as a minimization problem: n � � exp w · φ ( x i , ℓ ′ ) ˆ w = argmin log − w · φ ( x i , ℓ i ) � �� � w i =1 ℓ ′ ∈L hope � �� � fear Recall from lecture 3: ◮ Be wise and regularize! ◮ Solve with batch or stochastic gradient methods. ◮ w j has an interpretation. 3 / 17
Log Loss and Hinge Loss for ( x , ℓ ) � � � exp w · φ ( x , ℓ ′ ) log loss: log − w · φ ( x , ℓ ) ℓ ′ ∈L � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) hinge loss: max − w · φ ( x , ℓ ) In the binary case, where “score” is the linear score of the correct label: 5 4 3 loss 2 1 0 −4 −2 0 2 4 score 4 / 17 In purple is the hinge loss, in blue is the log loss; in red is the
Minimizing Hinge Loss: Perceptron n � � � ℓ ′ ∈L w · φ ( x i , ℓ ′ ) min max − w · φ ( x i , ℓ i ) w i =1 Stochastic subgradient descent on the above is called the perceptron algorithm. ◮ For t ∈ { 1 , . . . , T } : ◮ Pick i t uniformly at random from { 1 , . . . , n } . ◮ ˆ ℓ i t ← argmax ℓ ∈L w · φ ( x i t , ℓ ) � � φ ( x i t , ˆ ◮ w ← w − α ℓ i t ) − φ ( x i t , ℓ i t ) 5 / 17
Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. 6 / 17
Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . 7 / 17
Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . Intuition: estimate the scoring function so that score( ℓ i ) − score(ˆ ℓ ) ∝ cost( ℓ i , ˆ ℓ ) 8 / 17
General Hinge Loss for ( x , ℓ ) � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) + cost( ℓ, ℓ ′ ) max − w · φ ( x , ℓ ) In the binary case, with cost( − 1 , 1) = 1 : 6 function(x) −x + pmax(x, 1) 5 4 3 2 1 0 −4 −2 0 2 4 x In blue is the general hinge loss; in red is the “zero-one” loss (error). 9 / 17
Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. 10 / 17
Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. 11 / 17
Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. Really good tool: SVM light , http://svmlight.joachims.org 12 / 17
Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. 13 / 17
Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. ◮ Often, instead of linear models that explicitly calculate w · φ , these methods are “kernelized” and rearrange all calculations to involve inner-products between φ vectors. ◮ Example: K linear ( v , w ) = v · w K polynomial ( v , w ) = ( v · w + 1) p K Gaussian ( v , w ) = exp −� v − w � 2 2 2 σ 2 ◮ Linear kernels are most common in NLP. 14 / 17
General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. 15 / 17
General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. 16 / 17
General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. 17 / 17
General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. ◮ Lots of papers about neural networks, but with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). 18 / 17
Readings and Reminders ◮ Jurafsky and Martin (2015); Collins (2011) ◮ Submit a suggestion for an exam question by Friday at 5pm. 19 / 17
References I Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, 2011. URL http://www.cs.columbia.edu/~mcollins/em.pdf . Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research , 2(5): 265–292, 2001. Daniel Jurafsky and James H. Martin. Classification: Naive Bayes, logistic regression, sentiment (draft chapter), 2015. URL https://web.stanford.edu/~jurafsky/slp3/7.pdf . Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text representations. In Proc. of EMNLP , 2015. URL http://www.aclweb.org/anthology/D/D15/D15-1251.pdf . 20 / 17
Recommend
More recommend