Natural Language Processing (CSE 517): Text Classification (II) - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith � 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17

Quick Review: Text Classification Input: a piece of text x ∈ V † , usually a document (r.v. X ) Output: a label from a finite set L (r.v. L ) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V † → L 3. Apply classify to as much data as you want! We covered na¨ ıve Bayes, reviewed multinomial logistic regression, and, briefly, the perceptron. 2 / 17

Multinomial Logistic Regression as “Log Loss” exp w · φ ( x , ℓ ) p ( L = ℓ | x ) = � ℓ ′ ∈L exp w · φ ( x , ℓ ′ ) MLE can be rewritten as a minimization problem: n � � exp w · φ ( x i , ℓ ′ ) ˆ w = argmin log − w · φ ( x i , ℓ i ) � �� w i =1 ℓ ′ ∈L hope � �� fear Recall from lecture 3: ◮ Be wise and regularize! ◮ Solve with batch or stochastic gradient methods. ◮ w j has an interpretation. 3 / 17

Log Loss and Hinge Loss for ( x , ℓ ) � � � exp w · φ ( x , ℓ ′ ) log loss: log − w · φ ( x , ℓ ) ℓ ′ ∈L � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) hinge loss: max − w · φ ( x , ℓ ) In the binary case, where “score” is the linear score of the correct label: 5 4 3 loss 2 1 0 −4 −2 0 2 4 score 4 / 17 In purple is the hinge loss, in blue is the log loss; in red is the

Minimizing Hinge Loss: Perceptron n � � � ℓ ′ ∈L w · φ ( x i , ℓ ′ ) min max − w · φ ( x i , ℓ i ) w i =1 Stochastic subgradient descent on the above is called the perceptron algorithm. ◮ For t ∈ { 1 , . . . , T } : ◮ Pick i t uniformly at random from { 1 , . . . , n } . ◮ ˆ ℓ i t ← argmax ℓ ∈L w · φ ( x i t , ℓ ) � � φ ( x i t , ˆ ◮ w ← w − α ℓ i t ) − φ ( x i t , ℓ i t ) 5 / 17

Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. 6 / 17

Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . 7 / 17

Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . Intuition: estimate the scoring function so that score( ℓ i ) − score(ˆ ℓ ) ∝ cost( ℓ i , ˆ ℓ ) 8 / 17

General Hinge Loss for ( x , ℓ ) � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) + cost( ℓ, ℓ ′ ) max − w · φ ( x , ℓ ) In the binary case, with cost( − 1 , 1) = 1 : 6 function(x) −x + pmax(x, 1) 5 4 3 2 1 0 −4 −2 0 2 4 x In blue is the general hinge loss; in red is the “zero-one” loss (error). 9 / 17

Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. 10 / 17

Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. 11 / 17

Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. Really good tool: SVM light , http://svmlight.joachims.org 12 / 17

Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. 13 / 17

Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. ◮ Often, instead of linear models that explicitly calculate w · φ , these methods are “kernelized” and rearrange all calculations to involve inner-products between φ vectors. ◮ Example: K linear ( v , w ) = v · w K polynomial ( v , w ) = ( v · w + 1) p K Gaussian ( v , w ) = exp −� v − w � 2 2 2 σ 2 ◮ Linear kernels are most common in NLP. 14 / 17

General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. 15 / 17

General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. 16 / 17

General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. 17 / 17

General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. ◮ Lots of papers about neural networks, but with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). 18 / 17

Readings and Reminders ◮ Jurafsky and Martin (2015); Collins (2011) ◮ Submit a suggestion for an exam question by Friday at 5pm. 19 / 17

References I Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, 2011. URL http://www.cs.columbia.edu/~mcollins/em.pdf . Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research , 2(5): 265–292, 2001. Daniel Jurafsky and James H. Martin. Classification: Naive Bayes, logistic regression, sentiment (draft chapter), 2015. URL https://web.stanford.edu/~jurafsky/slp3/7.pdf . Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text representations. In Proc. of EMNLP , 2015. URL http://www.aclweb.org/anthology/D/D15/D15-1251.pdf . 20 / 17

Natural Language Processing (CSE 517): Text Classification (II) - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17 Quick Review: Text Classification Input: a piece of text x V , usually a

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

CSEP 517 Natural Language Processing Autumn 2018 Text Classification Linear Models Luke

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

CSE 517: Natural Language Processing New Quals Course! Instructor: Luke Zettlemoyer Winter 2013

CSE 517 Natural Language Processing Winter 2017 Introduction Yejin Choi Slides adapted from

CSE 517 Natural Language Processing - Winter 2018! - Yejin Choi Computer Science &

Information Extraction Industrial Natural Language Processing Industrial Natural Language

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Natural Language Processing (CSE 517): Machine Translation Noah Smith 2018 c University of

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Adversarial Bipartite Matching Rizal Fathony* ,# , Sima Behpour*, Xinhua Zhang, Brian D. Ziebart

Session 2 : Numerical Python and plotting Session 2 In this session: Session 1 exercise

CS-270 Algorithms 2013/14 Solution to Coursework I Oliver Kullmann Swansea University Computer

Academics: Aim and objectives for the course Top-level syllabus Logistics:

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Natural Language Processing (CSE 517): Text Classification (II) - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17 Quick Review: Text Classification Input: a piece of text x V , usually a

CSEP 517 Natural Language Processing Text Classification Linear Models Luke Zettlemoyer -

CSEP 517 Natural Language Processing Autumn 2018 Text Classification Linear Models Luke

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

CSE 517: Natural Language Processing New Quals Course! Instructor: Luke Zettlemoyer Winter 2013

CSE 517 Natural Language Processing Winter 2017 Introduction Yejin Choi Slides adapted from

CSE 517 Natural Language Processing - Winter 2018! - Yejin Choi Computer Science &amp;

Information Extraction Industrial Natural Language Processing Industrial Natural Language

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Natural Language Processing (CSE 517): Machine Translation Noah Smith 2018 c University of

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Adversarial Bipartite Matching Rizal Fathony* ,# , Sima Behpour*, Xinhua Zhang, Brian D. Ziebart

Session 2 : Numerical Python and plotting Session 2 In this session: Session 1 exercise

CS-270 Algorithms 2013/14 Solution to Coursework I Oliver Kullmann Swansea University Computer

Academics: Aim and objectives for the course Top-level syllabus Logistics:

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

CSE 517 Natural Language Processing - Winter 2018! - Yejin Choi Computer Science &