Introduction to Classification and Sequence Labeling Grzegorz Chrupa� la Spoken Language Systems Saarland University Annual IRTG Meeting 2009 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 1 / 95
Outline Preliminaries 1 Bayesian Decision Theory 2 Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Linear models 5 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 6 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 2 / 95
Outline Preliminaries 1 Bayesian Decision Theory 2 Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Linear models 5 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 6 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 3 / 95
Notes on notation We learn functions from examples of inputs and outputs The inputs are usually denoted as x ∈ X . The outputs are y ∈ Y The most common and well studied scenario is classification: we learn to map some arbitrary objects into a small number of fixed classes Our arbitrary input object have to be represented somehow: we have to extract the features if the object which are useful for determining which output it should map to Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 4 / 95
Feature function Objects are represented using the feature function, also known as the representation function. The most commonly used representation is a d dimensional vector of real values: Φ : X → R d (1) f 1 f 2 · Φ( x ) = (2) · · f d We will often simplify notation by assuming that input object are already vectors in d -dimensional real space Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 5 / 95
Basic vector notation and operations In this tutorial vectors are written in boldface x . An alternative notation is − → x Subscripts and italic are used to refer to vector components: x i Subscripts on boldface symbols, or more commonly superscripts in brackets are used to index whole vectors: x i or x ( i ) Dot (inner) product can be written: ◮ x · z ◮ � x , z � ◮ x T z Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 6 / 95
Dot product and matrix multiplication Notation x T treats the vector as a one column matrix and transposes it into a one row matrix. This matrix can then be multiplied with a one column matrix, giving a scalar. Dot product: d � x · z = x i z i i =1 Matrix multiplication: n � ( AB ) i , j = A i , k B k , j k =1 Example z 1 = x 1 z 1 + x 2 z 2 + x 3 z 3 ( x 1 x 2 x 3 ) z 2 z 3 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 7 / 95
Supervised learning In supervised learning we try to learn a function h : X → Y where . Binary classification: Y = {− 1 , +1 } Multiclass classification: Y = { 1 , . . . , K } (finite set of labels) Regression: Y = R Sequence labeling: h : W n → L n Structure learning: Inputs and outputs are structures such as e.g. trees or graphs We will often initially focus on binary classification, and then generalize Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 8 / 95
Outline Preliminaries 1 Bayesian Decision Theory 2 Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Linear models 5 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 6 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 9 / 95
Prior, conditional, posterior A priori probability (prior): P ( Y i ) Class-conditional ◮ Density p ( x | Y i ) (continuous feature x ) ◮ Probability P ( x | Y i ) (discrete feature x ) Joint probability: p ( Y i , x ) = P ( Y i | x ) p ( x ) = p ( x | Y i ) P ( Y i ) Posterior via Bayes formula: P ( Y i | x ) = p ( x | Y i ) p ( Y i ) p ( x ) p ( x | Y i ) likelihood of Y i with respect to x In general we work with feature vectors x rather than single features x Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 10 / 95
Loss function and risk for classification Let { Y 1 .. Y c } be the classes and { α 1 , . . . , α a } be the decisions Then λ ( α i | Y j ) describes the loss associated with decision α i when the target class is Y j Expected loss (or risk) associated with α i : c � R ( α i | x ) = λ ( α i | Y j ) P ( Y j | x ) j =1 A decision function α maps feature vectors x to decisions α 1 , . . . , α a The overall risk is then given by: � R = R ( α ( x ) | x ) p ( x ) d x Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 11 / 95
Zero-one loss for classification The zero-one loss function assigns loss 1 when a mistake is made and loss 0 otherwise If decision α i means deciding that the output class is Y i then the zero-one loss is: � 0 i = j λ ( α i | Y j ) = 1 i � = j The risk under zero-one loss is the average probability of error: c � R ( α i | x ) = λ ( α i | Y j ) P ( Y j | x ) (3) j =0 � = P ( Y j | x ) (4) j � = i = 1 − P ( Y i | x ) (5) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 12 / 95
Bayes decision rule Under Bayes decision rule we choose the action which minimizes the conditional risk. Thus we choose the class Y i which maximizes P ( Y i | x ): Bayes decision rule Decide Y i if P ( Y i | x ) > P ( Y j | x ) for all j � = i (6) Thus the Bayes decision rule gives minimum error rate classification Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 13 / 95
Discriminant functions For classification among c classes, have a set of c functions { g 1 , . . . , g c } For each class Y i , g i : X → R The classifier chooses the class index for x by solving: y = argmax g i ( x ) i That is it chooses the category corresponding to the largest discriminant For the Bayes classifier under the minimum error rate decision, g i ( x ) = P ( Y i | x ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 14 / 95
Choice of discriminant function Choice of the set of discriminant functions is not unique We can replace every g i with f ◦ g i where f is a monotonically increasing (or decreasing) function without affecting the decision. Examples ◮ g i ( x ) = p ( Y i | x ) = p ( x | Y i ) P ( Y i ) / � j p ( x | Y j ) P ( Y j ) ◮ g i ( x ) = p ( x | Y i ) P ( Y i ) ◮ g i ( x ) = ln p ( x | Y i ) + ln P ( Y i ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 15 / 95
Dichotomizer A dichotomizer is a binary classifier Traditionally treated in a special way, using a single determinant g ( x ) ≡ g 1 ( x ) − g 2 ( x ) With the corresponding decision rule: � Y 1 if g ( x ) > 0 y = otherwise Y 2 Commonly used dichotomizing discriminants under the minimum error rate criterion: ◮ g ( x ) = P ( Y 1 | x ) − P ( Y 2 | x ) ◮ g ( x ) = ln p ( x | Y 1 ) p ( x | Y 2 ) + ln P ( Y 1 ) P ( Y 2 ) Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 16 / 95
Decision regions and decision boundaries A decision rule divides the feature space into decision regions R 1 , . . . , R c . If g i ( x ) > g j ( x ) for all j � = i then x is in R i (i.e. belongs to class Y i ) Regions are separated by decision boundaries: surfaces in feature space where discriminants are tied 0.6 Y 1 Y 2 0.5 0.4 p(x|Y i ) 0.3 0.2 0.1 0.0 R 1 R 2 R 1 −4 −2 0 2 4 Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling x 2009 17 / 95
Outline Preliminaries 1 Bayesian Decision Theory 2 Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Linear models 5 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 6 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 18 / 95
Parameter estimation If we know the priors P ( Y i ) and class-conditional densities p ( x | Y i ), the optimal classification is obtained using the Bayes decision rule In practice, those probabilities are almost never available Thus, we need to estimate them from training data ◮ Priors are easy to estimate for typical classification problems ◮ However, for class-conditional densities, training data is typically sparse! If we know (or assume) the general model structure, estimating the model parameters is more feasible For example, we assume that p ( x | Y i ) is a normal density with mean µ i and covariance matrix Σ i Grzegorz Chrupa� la (UdS) Classification and Sequence Labeling 2009 19 / 95
Recommend
More recommend