Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein - PowerPoint PPT Presentation

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein – UC Berkeley, Taylor Berg-Kirkpatrick, Yulia Tsvetkov – CMU

Classification Image → Digit

Classification Document → Category

Classification Query + Web Pages → Best Match “Apple Computers”

Classification Sentence → Parse Tree x y The screen was a sea of red

Classification Sentence → Translation

Classification ▪ Three main ideas ▪ Representation as feature vectors ▪ Scoring by linear functions ▪ Learning (the scoring functions) by optimization

Some Definitions INPUTS close the ____ CANDIDATE {table, door, … } SET CANDIDATE table TRUE door OUTPUT FEATURE VECTORS “close” in x ∧ y=“door” x -1 =“the” ∧ y=“door” y occurs in x x -1 =“the” ∧ y=“table”

Features

Feature Vectors ▪ Example: web page ranking x i = “Apple Computers”

Block Feature Vectors … win the election … “ win ” “ election ” … win the election … … win the election … … win the election …

Non-Block Feature Vectors ▪ Example: a parse tree’s features may be the productions present in the tree S VP NP NP S N N NP VP VP V N N V S NP NP VP N N V N VP V N ▪ Different candidates will often share features

Linear Models

Linear Models: Scoring ▪ In a linear model, each feature gets a weight w … win the election … … win the election … ▪ We score hypotheses by multiplying features and weights: … win the election … … win the election …

Linear Models: Decision Rule ▪ The linear decision rule: … win the election … … win the election … … win the election … … win the election … … win the election … … win the election …

Binary Classification ▪ Important special case: binary classification ▪ Classes are y=+1/-1 ▪ Decision boundary is 2 a hyperplane +1 1 -1 0 0 1

Multiclass Decision Rule ▪ If more than two classes: Highest score wins

Learning

Learning Classifier Weights ▪ Two broad approaches to learning weights ▪ Generative: work with a probabilistic model of the data, weights are (log) local conditional probabilities ▪ Advantages: learning weights is easy, smoothing is well-understood, backed by understanding of modeling ▪ Discriminative: set weights based on some error-related criterion ▪ Advantages: error-driven, often weights which are good for classification aren’t the ones which best describe the data ▪ We’ll mainly talk about the latter for now

How to pick weights? ▪ Goal: choose “best” vector w given training data ▪ For now, we mean “best for classification” ▪ The ideal: the weights which have greatest test set accuracy / F1 / whatever ▪ But, don’t have the test set ▪ Must compute weights from training set ▪ Maybe we want weights which give best training set accuracy?

Minimize Training Error? ▪ A loss function declares how costly each mistake is ▪ E.g. 0 loss for correct label, 1 loss for wrong label ▪ Can weight mistakes differently (e.g. false positives worse than false negatives or Hamming distance over structured labels) ▪ We could, in principle, minimize training loss: ▪ This is a hard, discontinuous optimization problem

Linear Models: Perceptron ▪ The perceptron algorithm ▪ Iteratively processes the training set, reacting to training errors ▪ Can be thought of as trying to drive down training error ▪ The (online) perceptron algorithm: ▪ Start with zero weights w ▪ Visit training instances one by one ▪ Try to classify ▪ If correct, no change! ▪ If wrong: adjust weights

Example: “Best” Web Page x i = “Apple Computers”

Examples: Perceptron ▪ Separable Case 24

Examples: Perceptron ▪ Non-Separable Case 25

Problems with Perceptron ▪ Perceptron “Goal”: Seperate the training data

Objective Functions ▪ What do we want from our weights? ▪ So far: minimize (training) errors: or ▪ This is the “zero-one loss” ▪ Discontinuous, minimizing is NP-complete ▪ Maximum entropy and SVMs have other objectives related to zero-one loss

Margin

Linear Separators ▪ Which of these linear separators is optimal? 29

Classification Margin (Binary) ▪ Distance of x i to separator is its margin, m i ▪ Examples closest to the hyperplane are support vectors ▪ Margin γ of the separator is the minimum m γ m

Classification Margin ▪ For each example x i and possible mistaken candidate y , we avoid that mistake by a margin m i (y) (with zero-one loss) ▪ Margin γ of the entire separator is the minimum m ▪ It is also the largest γ for which the following constraints hold

Maximum Margin ▪ Separable SVMs: find the max-margin w ▪ Can stick this into Matlab and (slowly) get an SVM ▪ Won’t work (well) if non-separable

Max Margin / Small Norm ▪ Reformulation: find the smallest w which separates data Remember this condition? ▪ γ scales linearly in w, so if ||w|| isn’t constrained, we can take any separating w and scale up our margin ▪ Instead of fixing the scale of w, we can fix γ = 1

Gamma to w

Soft Margin Classification ▪ What if the training set is not linearly separable? ▪ Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting in a soft margin classifier ξ i ξ i

Note: exist other Maximum Margin choices of how to penalize slacks! ▪ Non-separable SVMs ▪ Add slack to the constraints ▪ Make objective pay (linearly) for slack: ▪ C is called the capacity of the SVM – the smoothing knob ▪ Learning: ▪ Can still stick this into Matlab if you want ▪ Constrained optimization is hard; better methods!

Hinge Loss ▪ We have a constrained minimization ▪ … but we can solve for ξ i ▪ Giving

Why Max Margin? ▪ Why do this? Various arguments: ▪ Solution depends only on the boundary cases, or support vectors ▪ Solution robust to movement of support vectors ▪ Sparse solutions (features not in support vectors get zero weight) ▪ Generalization bound arguments ▪ Works well in practice for many problems

Likelihood

Linear Models: Maximum Entropy ▪ Maximum entropy (logistic regression) ▪ Use the scores as probabilities: Make positive Normalize ▪ Maximize the (log) conditional likelihood of training data

Maximum Entropy II ▪ Motivation for maximum entropy: ▪ Connection to maximum entropy principle (sort of) ▪ Might want to do a good job of being uncertain on noisy cases … ▪ … in practice, though, posteriors are pretty peaked ▪ Regularization (smoothing)

Maximum Entropy

Loss Comparison

Log-Loss ▪ If we view maxent as a minimization problem: ▪ This minimizes the “log loss” on each example ▪ One view: log loss is an upper bound on zero-one loss

Remember SVMs - Hinge Loss Plot really only right ▪ Consider the per-instance objective: in binary case ▪ This is called the “hinge loss” ▪ Unlike maxent / log loss, you stop gaining objective once the true label wins by enough ▪ You can start from here and derive the SVM objective ▪ Can solve directly with sub-gradient decent (e.g. Pegasos: Shalev-Shwartz et al 07)

Max vs “Soft-Max” Margin ▪ SVMs: You can make this zero ▪ Maxent: … but not this one ▪ Very similar! Both try to make the true score better than a function of the other scores ▪ The SVM tries to beat the augmented runner-up ▪ The Maxent classifier tries to beat the “soft-max”

Loss Functions: Comparison ▪ Zero-One Loss ▪ Hinge ▪ Log

Separators: Comparison

Structure

Handwriting recognition x y brace Sequential structure [Slides: Taskar and Klein 05]

CFG Parsing x y The screen was a sea of red Recursive structure

Bilingual Word Alignment En x y vertu de les What nouvelle What is the anticipated is propositions the cost of collecting fees , anticipated under the new proposal? quel cost est of le collecting En vertu de nouvelle côut fees prévu propositions, quel est le under de côut prévu de perception the perception de les droits? new de proposal le ? droits ? Combinatorial structure

Definitions INPUTS CANDIDATE SET CANDIDATES TRUE OUTPUTS FEATURE VECTORS

Structured Models space of feasible outputs Assumption: Score is a sum of local “part” scores Parts = nodes, edges, productions

CFG Parsing #(NP → DT NN) … #(PP → IN NP) … #(NN → ‘sea’)

Bilingual word alignment En vertu de What les is nouvelle the k propositions anticipated , ▪ association cost quel of est ▪ position collecting le fees côut ▪ orthography under prévu the de new perception j proposal de ? le droits ?

Efficient Decoding ▪ Common case: you have a black box which computes at least approximately, and you want to learn w ▪ Easiest option is the structured perceptron [Collins 01] ▪ Structure enters here in that the search for the best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A* … ) ▪ Prediction is structured, learning update is not

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein - PowerPoint PPT Presentation

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick, Yulia Tsvetkov CMU Classification Image Digit Classification Document Category Classification Query + Web Pages Best

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Sponsors: How to Find Them & Create a Win-Win @fundraiserchad PHILANTHROPIC PIE What

COVID-19 All participants Use the This webinar This webinar is Ohio Balance of State Continuum

Reaching Out Options for Sustainability Research in Human - Computer Interaction, 2014 M. Six

SPIRITUAL FORMATION DEFINITION Synonym for Discipleship? Mentoring? Or something else

Individual Choice Behavior: Presentation effects present a different kind of challenge to

Hack the SIEM and Win the War Many Thanks to the Following... All the people that taught me this

Auctioning Many Treasury bills Similar Items Stock repurchases and IPOs

Robust Self Testing for Linear Constraint Games Andrea Coladangelo Jalex Stark Department of

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein - PowerPoint PPT Presentation

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick, Yulia Tsvetkov CMU Classification Image Digit Classification Document Category Classification Query + Web Pages Best

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Sponsors: How to Find Them &amp; Create a Win-Win @fundraiserchad PHILANTHROPIC PIE What

COVID-19 All participants Use the This webinar This webinar is Ohio Balance of State Continuum

Reaching Out Options for Sustainability Research in Human - Computer Interaction, 2014 M. Six

SPIRITUAL FORMATION DEFINITION Synonym for Discipleship? Mentoring? Or something else

Individual Choice Behavior: Presentation effects present a different kind of challenge to

Hack the SIEM and Win the War Many Thanks to the Following... All the people that taught me this

Auctioning Many Treasury bills Similar Items Stock repurchases and IPOs

Robust Self Testing for Linear Constraint Games Andrea Coladangelo Jalex Stark Department of

Sponsors: How to Find Them & Create a Win-Win @fundraiserchad PHILANTHROPIC PIE What