Introduction to Classification and Sequence Labeling Grzegorz Chrupa - PowerPoint PPT Presentation

Introduction to Classification and Sequence Labeling Grzegorz Chrupa� la Spoken Language Systems Saarland University Annual IRTG Meeting 2009 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 1 / 95

Outline Preliminaries 1 Bayesian Decision Theory 2 Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Linear models 5 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 6 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 2 / 95

Notes on notation We learn functions from examples of inputs and outputs The inputs are usually denoted as x ∈ X . The outputs are y ∈ Y The most common and well studied scenario is classification: we learn to map some arbitrary objects into a small number of fixed classes Our arbitrary input object have to be represented somehow: we have to extract the features if the object which are useful for determining which output it should map to Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 4 / 95

Feature function Objects are represented using the feature function, also known as the representation function. The most commonly used representation is a d dimensional vector of real values: Φ : X → R d (1)  f 1  f 2     ·   Φ( x ) = (2)   ·     ·   f d We will often simplify notation by assuming that input object are already vectors in d -dimensional real space Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 5 / 95

Basic vector notation and operations In this tutorial vectors are written in boldface x . An alternative notation is − → x Subscripts and italic are used to refer to vector components: x i Subscripts on boldface symbols, or more commonly superscripts in brackets are used to index whole vectors: x i or x ( i ) Dot (inner) product can be written: ◮ x · z ◮ � x , z � ◮ x T z Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 6 / 95

Dot product and matrix multiplication Notation x T treats the vector as a one column matrix and transposes it into a one row matrix. This matrix can then be multiplied with a one column matrix, giving a scalar. Dot product: d � x · z = x i z i i =1 Matrix multiplication: n � ( AB ) i , j = A i , k B k , j k =1 Example   z 1  = x 1 z 1 + x 2 z 2 + x 3 z 3 ( x 1 x 2 x 3 ) z 2  z 3 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 7 / 95

Supervised learning In supervised learning we try to learn a function h : X → Y where . Binary classification: Y = {− 1 , +1 } Multiclass classification: Y = { 1 , . . . , K } (finite set of labels) Regression: Y = R Sequence labeling: h : W n → L n Structure learning: Inputs and outputs are structures such as e.g. trees or graphs We will often initially focus on binary classification, and then generalize Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 8 / 95

Prior, conditional, posterior A priori probability (prior): P ( Y i ) Class-conditional ◮ Density p ( x | Y i ) (continuous feature x ) ◮ Probability P ( x | Y i ) (discrete feature x ) Joint probability: p ( Y i , x ) = P ( Y i | x ) p ( x ) = p ( x | Y i ) P ( Y i ) Posterior via Bayes formula: P ( Y i | x ) = p ( x | Y i ) p ( Y i ) p ( x ) p ( x | Y i ) likelihood of Y i with respect to x In general we work with feature vectors x rather than single features x Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 10 / 95

Loss function and risk for classification Let { Y 1 .. Y c } be the classes and { α 1 , . . . , α a } be the decisions Then λ ( α i | Y j ) describes the loss associated with decision α i when the target class is Y j Expected loss (or risk) associated with α i : c � R ( α i | x ) = λ ( α i | Y j ) P ( Y j | x ) j =1 A decision function α maps feature vectors x to decisions α 1 , . . . , α a The overall risk is then given by: � R = R ( α ( x ) | x ) p ( x ) d x Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 11 / 95

Zero-one loss for classification The zero-one loss function assigns loss 1 when a mistake is made and loss 0 otherwise If decision α i means deciding that the output class is Y i then the zero-one loss is: � 0 i = j λ ( α i | Y j ) = 1 i � = j The risk under zero-one loss is the average probability of error: c � R ( α i | x ) = λ ( α i | Y j ) P ( Y j | x ) (3) j =0 � = P ( Y j | x ) (4) j � = i = 1 − P ( Y i | x ) (5) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 12 / 95

Bayes decision rule Under Bayes decision rule we choose the action which minimizes the conditional risk. Thus we choose the class Y i which maximizes P ( Y i | x ): Bayes decision rule Decide Y i if P ( Y i | x ) > P ( Y j | x ) for all j � = i (6) Thus the Bayes decision rule gives minimum error rate classification Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 13 / 95

Discriminant functions For classification among c classes, have a set of c functions { g 1 , . . . , g c } For each class Y i , g i : X → R The classifier chooses the class index for x by solving: y = argmax g i ( x ) i That is it chooses the category corresponding to the largest discriminant For the Bayes classifier under the minimum error rate decision, g i ( x ) = P ( Y i | x ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 14 / 95

Choice of discriminant function Choice of the set of discriminant functions is not unique We can replace every g i with f ◦ g i where f is a monotonically increasing (or decreasing) function without affecting the decision. Examples ◮ g i ( x ) = p ( Y i | x ) = p ( x | Y i ) P ( Y i ) / � j p ( x | Y j ) P ( Y j ) ◮ g i ( x ) = p ( x | Y i ) P ( Y i ) ◮ g i ( x ) = ln p ( x | Y i ) + ln P ( Y i ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 15 / 95

Dichotomizer A dichotomizer is a binary classifier Traditionally treated in a special way, using a single determinant g ( x ) ≡ g 1 ( x ) − g 2 ( x ) With the corresponding decision rule: � Y 1 if g ( x ) > 0 y = otherwise Y 2 Commonly used dichotomizing discriminants under the minimum error rate criterion: ◮ g ( x ) = P ( Y 1 | x ) − P ( Y 2 | x ) ◮ g ( x ) = ln p ( x | Y 1 ) p ( x | Y 2 ) + ln P ( Y 1 ) P ( Y 2 ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 16 / 95

Decision regions and decision boundaries A decision rule divides the feature space into decision regions R 1 , . . . , R c . If g i ( x ) > g j ( x ) for all j � = i then x is in R i (i.e. belongs to class Y i ) Regions are separated by decision boundaries: surfaces in feature space where discriminants are tied 0.6 Y 1 Y 2 0.5 0.4 p(x|Y i ) 0.3 0.2 0.1 0.0 R 1 R 2 R 1 −4 −2 0 2 4 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling x 2009 17 / 95

Parameter estimation If we know the priors P ( Y i ) and class-conditional densities p ( x | Y i ), the optimal classification is obtained using the Bayes decision rule In practice, those probabilities are almost never available Thus, we need to estimate them from training data ◮ Priors are easy to estimate for typical classification problems ◮ However, for class-conditional densities, training data is typically sparse! If we know (or assume) the general model structure, estimating the model parameters is more feasible For example, we assume that p ( x | Y i ) is a normal density with mean µ i and covariance matrix Σ i Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 19 / 95

Introduction to Classification and Sequence Labeling Grzegorz Chrupa - PowerPoint PPT Presentation

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language Systems Saarland University Annual IRTG Meeting 2009 Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 1 / 95 Outline

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Text Classification and Sequence Labeling Graham Neubig Text Classification

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap:

Statistically Based Model Comparison Techniques H. T. Banks Center for Research in Scientific

I n d u s t r y e x p e r i e n c e a n d p o s i t i o n t o T i O 2 c l a s s i f i c a t i o

Chan Carusone, Cobb, Cooper, Guta, Kandel, Stewart, Strike Experience of PHAs who use

Applying Link-based Classification to Label Blogs Graham Cormode Smriti Bhagat, Irina Rozenbaum

Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry Zhu, Aarti Singh Supervised

Neural Methods for Semantic Role Labeling Diego Marcheggiani , Michael Roth, Ivan Titov, Benjamin

An Overview of Labelling-Based Justification Status Martin Caminada Yining Wu 1 1

Introduction to Classification and Sequence Labeling Grzegorz Chrupa - PowerPoint PPT Presentation

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language Systems Saarland University Annual IRTG Meeting 2009 Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 1 / 95 Outline

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Text Classification and Sequence Labeling Graham Neubig Text Classification

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling &amp; Labeling Larry Nees Seed Administrator Office of INDIANA

Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap:

Statistically Based Model Comparison Techniques H. T. Banks Center for Research in Scientific

I n d u s t r y e x p e r i e n c e a n d p o s i t i o n t o T i O 2 c l a s s i f i c a t i o

Chan Carusone, Cobb, Cooper, Guta, Kandel, Stewart, Strike Experience of PHAs who use

Applying Link-based Classification to Label Blogs Graham Cormode Smriti Bhagat, Irina Rozenbaum

Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry Zhu, Aarti Singh Supervised

Neural Methods for Semantic Role Labeling Diego Marcheggiani , Michael Roth, Ivan Titov, Benjamin

An Overview of Labelling-Based Justification Status Martin Caminada Yining Wu 1 1

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA