Structured Perceptron/ Margin Methods Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/

Types of Prediction • Two classes ( binary classification ) positive I hate this movie negative • Multiple classes ( multi-class classification ) very good good I hate this movie neutral bad very bad • Exponential/infinite labels ( structured prediction ) I hate this movie PRP VBP DT NN I hate this movie kono eiga ga kirai

Many Varieties of Structured Prediction! • Models: • RNN-based decoders Covered • Convolution/self attentional decoders already • CRFs w/ local factors • Training algorithms: • Maximum likelihood w/ teacher forcing • Sequence level likelihood Covered • Structured perceptron, structured large margin today • Reinforcement learning/minimum risk training • Sampling corruptions of data

<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> <latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> <latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> Reminder: Globally Normalized Models • Locally normalized models: each decision made by the model has a probability that adds to one | Y | e S ( y j | X,y 1 ,...,y j − 1 ) Y P ( Y | X ) = y j ∈ V e S (˜ y j | X,y 1 ,...,y j − 1 ) P ˜ j =1 • Globally normalized models (a.k.a. energy- based models): each sentence has a score, which is not normalized over a particular decision e S ( X,Y ) P ( Y | X ) = Y ∈ V ∗ e S ( X, ˜ Y ) P ˜

Globally Normalized Likelihood

<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> <latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> <latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit> Difficulties Training Globally Normalized Models • Partition function problematic e S ( X,Y ) P ( Y | X ) = Y ∈ V ∗ e S ( X, ˜ Y ) P ˜ • Two options for calculating partition function • Structure model to allow enumeration via dynamic programming, e.g. linear chain CRF, CFG • Estimate partition function through sub-sampling hypothesis space

Two Methods for Approximation • Sampling: • Sample k samples according to the probability distribution • + Unbiased estimator: as k gets large will approach true distribution • - High variance: what if we get low-probability samples? • Beam search: • Search for k best hypotheses • - Biased estimator: may result in systematic differences from true distribution • + Lower variance: more likely to get high-probability outputs

Un-normalized Models: Structured Perceptron

Structured Perceptron/ Margin Methods Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Types of Prediction Two classes ( binary classification ) positive I hate this movie negative

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random

lti Is convex Perceptron Boosting Max-Margin Conditional Likelihood MIRA Based on Uses a

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Structured Perceptron with Inexact Search x x the man bit the dog x the man bit

Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and

Large Margin Classification Using the Perceptron Algorithm (Part 2) Henry Tan Georgetown

TAG, Dynamic Programming, and the Perceptron for Efficent, Feature-Rich Parsing Xavier Carreras,

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE

Joint Event Trigger Identification and Event Coreference Resolution with Structured Perceptron

Loss-augmented Structured Prediction CMSC 723 / LING 723 / INST 725 Marine Carpuat Figures,

Structured-Cut: A Max-Margin Feature Selection Framework for Video Segmentation Nikhil S. Naikal

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

23 Action-Oriented Design Methods 1. Use Cases 2. Structured Analysis/Design (SA/SD) 3. Structured

26. Data-Oriented Design Methods 1) Jackson Structured Programming (JSP) and Jackson Structured

Structured Prediction Problem Unstructured prediction Structured prediction Part of

more tasks, more methods CMSC 470 Marine Carpuat Recap: We know how to perform POS tagging with

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

Structured Perceptron/ Margin Methods Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Types of Prediction Two classes ( binary classification ) positive I hate this movie negative

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random

lti Is convex Perceptron Boosting Max-Margin Conditional Likelihood MIRA Based on Uses a

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Structured Perceptron with Inexact Search x x the man bit the dog x the man bit

Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and

Large Margin Classification Using the Perceptron Algorithm (Part 2) Henry Tan Georgetown

TAG, Dynamic Programming, and the Perceptron for Efficent, Feature-Rich Parsing Xavier Carreras,

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Part-of-Speech T agging: HMM &amp; structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE

Joint Event Trigger Identification and Event Coreference Resolution with Structured Perceptron

Loss-augmented Structured Prediction CMSC 723 / LING 723 / INST 725 Marine Carpuat Figures,

Structured-Cut: A Max-Margin Feature Selection Framework for Video Segmentation Nikhil S. Naikal

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

23 Action-Oriented Design Methods 1. Use Cases 2. Structured Analysis/Design (SA/SD) 3. Structured

26. Data-Oriented Design Methods 1) Jackson Structured Programming (JSP) and Jackson Structured

Structured Prediction Problem Unstructured prediction Structured prediction Part of

more tasks, more methods CMSC 470 Marine Carpuat Recap: We know how to perform POS tagging with

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE