Classification Statistical NLP Spring 2011 Automatically make a - PDF document

Classification Statistical NLP Spring 2011 � Automatically make a decision about inputs � Example: document → category � Example: image of digit → digit � Example: image of object → object type � Example: query + webpages → best match � Example: symptoms → diagnosis � … � Three main ideas Lecture 11: Classification � Representation as feature vectors / kernel functions � Scoring by linear functions Dan Klein – UC Berkeley � Learning by optimization 2 Example: Text Classification Some Definitions � We want to classify documents into semantic categories INPUTS … win the election … DOCUMENT CATEGORY … win the election … … win the election … … win the election … CANDIDATE SPORTS, POLITICS, OTHER SET … win the election … POLITICS … win the election … … win the game … SPORTS CANDIDATES SPORTS … see a movie … OTHER … win the election … TRUE POLITICS � Classically, do this on the basis of counts of words in the OUTPUTS document, but other information sources are relevant: � Document length FEATURE � Document’s source VECTORS � Document layout POLITICS ∧ “ election ” SPORTS ∧ “ win ” � Document sender Remember: if y contains � … POLITICS ∧ “ win ” x , we also write f(y) Feature Vectors Block Feature Vectors � Example: web page ranking (not actually classification) � Sometimes, we think of the input as having features, which are multiplied by outputs to form the candidates x i = “Apple Computers” … win the election … “ win ” “ election ” … win the election … … win the election … … win the election … 1

Linear Models: Scoring Linear Models: Decision Rule � In a linear model, each feature gets a weight w � The linear decision rule: … win the election … … win the election … … win the election … … win the election … � We score hypotheses by multiplying features and weights: … win the election … … win the election … … win the election … … win the election … … win the election … … win the election … � We’ve said nothing about where weights come from! Binary Classification Multiclass Decision Rule � If more than two � Important special case: binary classification classes: � Classes are y=+1/-1 BIAS : -3 � Highest score wins free : 4 � Boundaries are more money : 2 complex money 2 � Harder to visualize � Decision boundary is +1 = SPAM a hyperplane 1 -1 = HAM 0 0 1 free � There are other ways: e.g. reconcile pairwise decisions 10 Learning Classifier Weights Linear Models: Naïve-Bayes � Two broad approaches to learning weights � (Multinomial) Naïve-Bayes is a linear model, where: � Generative: work with a probabilistic model of the data, weights are (log) local conditional probabilities � Advantages: learning weights is easy, smoothing is well- understood, backed by understanding of modeling � Discriminative: set weights based on some error- related criterion y � Advantages: error-driven, often weights which are good for classification aren’t the ones which best describe the data d 1 d 2 d n � We’ll mainly talk about the latter for now 2

Example: Sensors Example: Stoplights �� P(+,+,r) = 3/8 P(-,-,r) = 1/8 P(+,+,s) = 1/8 P(-,-,s) = 3/8 ��!�� NB FACTORS: PREDICTIONS: NB FACTORS: � P(s) = 1/2 � P(b) = 1/7 � P(w) = 6/7 � P(r,+,+) = (½)(¾)(¾) �� P(+|s) = 1/4 � P(r|w) = 1/2 � P(r|b) = 1 � P(s,+,+) = (½)(¼)(¼) � P(+|r) = 3/4 � P(g|w) = 1/2 � P(g|b) = 0 � P(r|+,+) = 9/10 �� "� #� � P(s|+,+) = 1/10 Example: Stoplights How to pick weights? � What does the model say when both lights are red? � Goal: choose “best” vector w given training data � For now, we mean “best for classification” � P(b,r,r) = (1/7)(1)(1) = 1/7 = 4/28 � P(w,r,r) = (6/7)(1/2)(1/2) = 6/28 = 6/28 � The ideal: the weights which have greatest test set � P(w|r,r) = 6/10! accuracy / F1 / whatever � We’ll guess that (r,r) indicates lights are working! � But, don’t have the test set � Must compute weights from training set � Imagine if P(b) were boosted higher, to 1/2: � Maybe we want weights which give best training set � P(b,r,r) = (1/2)(1)(1) = 1/2 = 4/8 accuracy? � P( w ,r,r) = (1/2)(1/2)(1/2) = 1/8 = 1/8 � Hard discontinuous optimization problem � May not (does not) generalize to test set � P(w|r,r) = 1/5! � Easy to overfit � Changing the parameters bought accuracy at the Though, min-error training for MT expense of data likelihood does exactly this. Minimize Training Error? Linear Models: Perceptron � The perceptron algorithm � A loss function declares how costly each mistake is � Iteratively processes the training set, reacting to training errors � Can be thought of as trying to drive down training error � The (online) perceptron algorithm: � E.g. 0 loss for correct label, 1 loss for wrong label � Start with zero weights w � Can weight mistakes differently (e.g. false positives worse than false negatives or Hamming distance over structured labels) � Visit training instances one by one � Try to classify � We could, in principle, minimize training loss: � If correct, no change! � If wrong: adjust weights � This is a hard, discontinuous optimization problem 3

Example: “Best” Web Page Examples: Perceptron � Separable Case x i = “Apple Computers” 21 Perceptrons and Separability Examples: Perceptron Separable � A data set is separable if some � Non-Separable Case parameters classify it perfectly � Convergence: if training data separable, perceptron will separate (binary case) � Mistake Bound: the maximum Non-Separable number of mistakes (binary case) related to the margin or degree of separability 23 Issues with Perceptrons Problems with Perceptrons � Perceptron “goal”: separate the training data � Overtraining: test / held-out accuracy usually rises, then falls � Overtraining isn’t quite as bad as overfitting, but is similar 1. This may be an entire 2. Or it may be impossible � Regularization: if the data isn’t feasible space separable, weights often thrash around � Averaging weight vectors over time can help (averaged perceptron) � [Freund & Schapire 99, Collins 02] � Mediocre generalization: finds a “barely” separating solution 4

Objective Functions Linear Separators � What do we want from our weights? � Which of these linear separators is optimal? � Depends! � So far: minimize (training) errors: � This is the “zero-one loss” � Discontinuous, minimizing is NP-complete � Not really what we want anyway � Maximum entropy and SVMs have other objectives related to zero-one loss 28 Classification Margin (Binary) Classification Margin Distance of x i to separator is its margin, m i � � For each example x i and possible mistaken candidate y , we � Examples closest to the hyperplane are support vectors avoid that mistake by a margin m i (y) (with zero-one loss) � Margin γ of the separator is the minimum m γ � Margin γ of the entire separator is the minimum m m � It is also the largest γ for which the following constraints hold Maximum Margin Why Max Margin? � Why do this? Various arguments: � Separable SVMs: find the max-margin w � Solution depends only on the boundary cases, or support vectors (but remember how this diagram is broken!) � Solution robust to movement of support vectors � Sparse solutions (features not in support vectors get zero weight) � Generalization bound arguments � Works well in practice for many problems � Can stick this into Matlab and (slowly) get an SVM � Won’t work (well) if non-separable Support vectors 5

Max Margin / Small Norm Soft Margin Classification � Reformulation: find the smallest w which separates data � What if the training set is not linearly separable? � Slack variables ξ i can be added to allow misclassification of difficult Remember this condition? or noisy examples, resulting in a soft margin classifier � γ scales linearly in w, so if ||w|| isn’t constrained, we can take any separating w and scale up our margin ξ i ξ i � Instead of fixing the scale of w, we can fix γ = 1 Note: exist other Maximum Margin choices of how to Maximum Margin penalize slacks! � Non-separable SVMs � Add slack to the constraints � Make objective pay (linearly) for slack: � C is called the capacity of the SVM – the smoothing knob � Learning: � Can still stick this into Matlab if you want � Constrained optimization is hard; better methods! � We’ll come back to this later Maximum Entropy II Linear Models: Maximum Entropy � Motivation for maximum entropy: � Maximum entropy (logistic regression) � Connection to maximum entropy principle (sort of) � Use the scores as probabilities: � Might want to do a good job of being uncertain on Make positive noisy cases… Normalize � … in practice, though, posteriors are pretty peaked � Maximize the (log) conditional likelihood of training data � Regularization (smoothing) 6

Classification Statistical NLP Spring 2011 Automatically make a - PDF document

Classification Statistical NLP Spring 2011 Automatically make a decision about inputs Example: document category Example: image of digit digit Example: image of object object type Example: query + webpages best

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Statistical NLP Spring 2011 Lecture 11: Classification Dan Klein UC Berkeley Classification

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Question Classification Ling573 NLP Systems and Applications April 22, 2014 Roadmap

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein UC Berkeley, Taylor

Algorithms for NLP Classification I Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor

Text Classification 1 Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 12, 2017

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

A look into mesh density IN THIS WEBINAR: PRESENTED BY: Mesh refinement Andrew Nelson

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 1.1 - 1

Aposteriori error analysis of timestepping schemes for the wave equation using elliptic

Digital Matting Digital Matting Introduction to Digital Matting Introduction to Digital Matting

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Inverse Problems and Regularization An Introduction Stefan Kindermann Industrial Mathematics

The Active Versus Passive Management Debate Challenge, Risk & Future Thierry Roncalli

Classification Statistical NLP Spring 2011 Automatically make a - PDF document

Classification Statistical NLP Spring 2011 Automatically make a decision about inputs Example: document category Example: image of digit digit Example: image of object object type Example: query + webpages best

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Statistical NLP Spring 2011 Lecture 11: Classification Dan Klein UC Berkeley Classification

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Question Classification Ling573 NLP Systems and Applications April 22, 2014 Roadmap

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein UC Berkeley, Taylor

Algorithms for NLP Classification I Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor

Text Classification 1 Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 12, 2017

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

A look into mesh density IN THIS WEBINAR: PRESENTED BY: Mesh refinement Andrew Nelson

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 1.1 - 1

Aposteriori error analysis of timestepping schemes for the wave equation using elliptic

Digital Matting Digital Matting Introduction to Digital Matting Introduction to Digital Matting

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Inverse Problems and Regularization An Introduction Stefan Kindermann Industrial Mathematics

The Active Versus Passive Management Debate Challenge, Risk &amp; Future Thierry Roncalli

The Active Versus Passive Management Debate Challenge, Risk & Future Thierry Roncalli