Outline Preliminaries 1 Bayesian Decision Theory 2 Minimum error rate classification Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Decision trees 5 Linear models 6 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 7 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 19 / 111
Parameter estimation If we know the priors P ( Y i ) and class-conditional densities p ( x | Y i ), the optimal classification is obtained using the Bayes decision rule In practice, those probabilities are almost never available Thus, we need to estimate them from training data ◮ Priors are easy to estimate for typical classification problems ◮ However, for class-conditional densities, training data is typically sparse! If we know (or assume) the general model structure, estimating the model parameters is more feasible For example, we assume that p ( x | Y i ) is a normal density with mean µ i and covariance matrix Σ i Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 20 / 111
The normal density – univariate � � 2 � � x − µ 1 − 1 p ( x ) = √ exp (7) 2 σ 2 πσ Completely specified by two parameters, mean µ and variance σ 2 � ∞ µ ≡ E [ x ] = xp ( x ) dx −∞ � ∞ σ 2 ≡ E [( x − µ ) 2 ] = ( x − µ ) 2 p ( x ) dx −∞ Notation: p ( x ) ∼ N ( µ, σ 2 ) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 21 / 111
Mulitvariate normal density � � 1 − 1 2( x − µ ) t Σ − 1 ( x − µ ) p ( x ) = (2 π ) d / 2 | Σ | 1 / 2 exp (8) Where µ is the mean vector And Σ is the covariance matrix � Σ ≡ E [( x − µ )( x − µ ) T ] = ( x − µ )( x − µ ) T p ( x ) d x Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 22 / 111
Maximum likelihood estimation The density p ( x ) conditioned on class Y i has a parametric form, and depends on θ The data set D consists of n instances x 1 , . . . , x n We assume the instances are chosen independently, thus the likelihood of θ with respect to the training instances is: n � p ( D| θ ) = p ( x k | θ ) k =1 In maximum likelihood estimation we will find the θ which maximizes this likelihood Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 23 / 111
Problem – MLE for the mean of univariate normal density Let θ = µ Maximizing log likelihood is equivalent to maximizing likelihood (but tends to be analytically easier) l ( θ ) ≡ ln p ( D| θ ) n � l ( θ ) = ln p ( x k | θ ) k =1 Our solution is the argument which maximizes this function ˆ θ = argmax l ( θ ) θ n ˆ � θ = argmax ln p ( x k | θ ) θ k =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 24 / 111
Normal density – log likelihood � � 2 � 1 − 1 � x − θ √ p ( x | θ ) = exp 2 σ 2 πσ � � 2 � 1 − 1 � x − θ √ ln p ( x | θ ) = ln exp (9) 2 σ 2 πσ = − 1 1 2 σ 2 ( x k − θ ) 2 2 ln 2 πσ + (10) We will find the maximum of the log-likelihood function by finding the point where the first derivative = 0 n d � ln p ( x k | θ ) = 0 d θ k =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 25 / 111
Substituting the log likelihood we get n d − 1 1 � 2 σ 2 ( x k − θ ) 2 2 ln 2 πσ + = 0 (11) d θ k =1 n d θ − 1 d 1 � 2 σ 2 ( x k − θ ) 2 2 ln 2 πσ + = 0 (12) k =1 n + 1 � 0 σ 2 ( x k − θ ) = 0 (13) k =1 n 1 � ( x k − θ ) = 0 (14) σ 2 k =1 n � ( x k − θ ) = 0 (15) k =0 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 26 / 111
Finally we rearrange to get θ n � ( x k − θ ) = 0 (16) k =1 n � − n θ + x k = 0 (17) k =1 n � − n θ = − x k (18) k =1 n θ = 1 � (19) x k n k =1 Which is the ... mean of the training examples Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 27 / 111
MLE for variance and for multivariate Gaussians Using a similar approach, we can derive the MLE estimate for the variance of univariate normal density n σ 2 = 1 � µ ) 2 ˆ ( x k − ˆ n k =1 For the multivariate case, the MLE estimates are as follows n µ = 1 � ˆ x k n k =1 n Σ = 1 ˆ � µ ) T ( x k − ˆ µ )( x k − ˆ n k =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 28 / 111
Exercise – Bayes classifier In this exercise the goal is to use the Bayes classifier to distinguish between examples of two different species of iris. We will use the parameters derived from the training examples using ML estimates. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 29 / 111
Outline Preliminaries 1 Bayesian Decision Theory 2 Minimum error rate classification Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Decision trees 5 Linear models 6 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 7 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 30 / 111
Non-parametric techniques In many (most) cases assuming the examples come from a parametric distribution is not valid Non-parametric technique don’t make the assumption that the form of the distribution is known ◮ Density estimation – Parzen windows ◮ Use training examples to derive decision functions directly: K-nearest neighbors, decision trees ◮ Assume a known form for discriminant functions, and estimate their parameters from training data (e.g. linear models) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 31 / 111
KNN classifier K-Nearest neighbors idea When classifying a new example, find k nearest training example, and assign the majority label Also known as ◮ Memory-based learning ◮ Instance or exemplar based learning ◮ Similarity-based methods ◮ Case-based reasoning Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 32 / 111
Distance metrics in feature space Euclidean distance or L 2 norm in d dimensional space: � d � � � D ( x , x ′ ) = ( x i − x ′ i ) 2 � i =1 L 1 norm (Manhattan or taxicab distance) d L 1 ( x , x ′ ) = � | x i − x ′ i | i =1 L ∞ or maximum norm d L ∞ ( x , x ′ ) = i =1 | x i − x ′ max i | In general, L k norm: � d � 1 / k � L k ( x , x ′ ) = | x i − x ′ i | k i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 33 / 111
Hamming distance Hamming distance used to compare strings of symbolic attributes Equivalent to L 1 norm for binary strings Defines the distance between two instances to be the sum of per-feature distances For symbolic features the per-feature distance is 0 for an exact match and 1 for a mismatch. d � Hamming ( x , x ′ ) = δ ( x i , x ′ i ) (20) i =1 � if x i = x ′ 0 δ ( x i , x ′ i i ) = (21) if x i � = x ′ 1 i Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 34 / 111
IB1 algorithm For a vector with a mixture of symbolic and numeric values, the above definition of per feature distance is used for symbolic features, while for numeric ones we can use the scaled absolute difference x i − x ′ δ ( x i , x ′ i i ) = . (22) max i − min i Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 35 / 111
Nearest-neighbor Voronoi tessalation Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 36 / 111
IB1 with feature weighting The per-feature distance is multiplied by the weight of the feature for which it is computed: d � D w ( x , x ′ ) = w i δ ( x i , x ′ i ) (23) i =1 where w i is the weight of the i th feature. [Daelemans and van den Bosch, 2005] describe two entropy-based methods and a χ 2 -based method to find a good weight vector w . Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 37 / 111
Information gain A measure of how much knowing the value of a certain feature for an example decreases our uncertainty about its class, i.e. difference in class entropy with and without information about the feature value. � w i = H ( Y ) − P ( v ) H ( Y | v ) (24) v ∈ V i where w i is the weight of the i th feature Y is the set of class labels V i is the set of possible values for the i th feature P ( v ) is the probability of value v class entropy H ( Y ) = − � y ∈ Y P ( y ) log 2 P ( y ) H ( Y | v ) is the conditional class entropy given that feature value = v Numeric values need to be temporarily discretized for this to work Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 38 / 111
Gain ratio IG assigns excessive weight to features with a large number of values. To remedy this bias information gain can be normalized by the entropy of the feature values, which gives the gain ratio: H ( Y ) − � v ∈ V i P ( v ) H ( Y | v ) w i = (25) H ( V i ) For a feature with a unique value for each instance in the training set, the entropy of the feature values in the denominator will be maximally high, and will thus give it a low weight. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 39 / 111
χ 2 The χ 2 statistic for a problem with k classes and m values for feature F : k m ( E ij − O ij ) 2 χ 2 = � � (26) E ij i =1 j =1 where O ij is the observed number of instances with the i th class label and the j th value of feature F E ij is the expected number of such instances in case the null hypothesis is true: E ij = n · j n i · n ·· n ij is the frequency count of instances with the i th class label and the j th value of feature F ◮ n · j = � k i =1 n ij ◮ n i · = � m j =1 n ij ◮ n ·· = � k � m j =0 n ij i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 40 / 111
χ 2 example Consider a spam detection task: your features are words present/absent in email messages Compute χ 2 for each word to use a weightings for a KNN classifier The statistic can be computed from a contingency table. Eg. those are (fake) counts of rock-hard in 2000 messages rock-hard ¬ rock-hard ham 4 996 spam 100 900 We need to sum ( E ij − O ij ) 2 / E ij for the four cells in the table: (52 − 4) 2 + (948 − 996) 2 + (52 − 100) 2 + (948 − 900) 2 = 93 . 4761 52 948 52 948 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 41 / 111
Problem – IG from contingency table Use the contingency table in the previous example to compute: Information gain Gain ratio for this the feature rock-hard with respect to the classes spam and ham. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 42 / 111
Distance-weighted class voting So far all the instances in the neighborhood are weighted equally for computing the majority class We may want to treat the votes from very close neighbors as more important than votes from more distant ones A variety of distance weighting schemes have been proposed to implement this idea; see [Daelemans and van den Bosch, 2005] for details and discussion. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 43 / 111
KNN – summary Non-parametric: makes no assumptions about the probability distribution the examples come from Does not assume data is linearly separable Derives decision rule directly from training data “Lazy learning”: ◮ During learning little “work” is done by the algorithm: the training instances are simply stored in memory in some efficient manner. ◮ During prediction the test instance is compared to the training instances, the neighborhood is calculated, and the majority label assigned No information discarded: “exceptional” and low frequency training instances are available for prediction Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 44 / 111
Outline Preliminaries 1 Bayesian Decision Theory 2 Minimum error rate classification Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Decision trees 5 Linear models 6 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 7 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 45 / 111
Decision trees “Nonmetric method” No numerical vectors of features needed Just attribute-value lists Ask a question about an attribute to partition the set of objects Resulting classification tree easy to interpret! Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 46 / 111
Decision trees Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 47 / 111
Outline Preliminaries 1 Bayesian Decision Theory 2 Minimum error rate classification Discriminant functions and decision surfaces Parametric models and parameter estimation 3 Non-parametric techniques 4 K-Nearest neighbors classifier Decision trees 5 Linear models 6 Perceptron Large margin and kernel methods Logistic regression (Maxent) Sequence labeling and structure prediction 7 Maximum Entropy Markov Models Sequence perceptron Conditional Random Fields Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 48 / 111
Linear models With linear classifiers we assume the form of the discriminant function to be known, and use training data to estimate its parameters No assumptions about underlaying probability distribution – in this limited sense they are non-parametric Learning a linear classifier formulated as minimizing the criterion function Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 49 / 111
Linear discriminant function A discriminant linear in the components of x has the following form: g ( x ) = w · x + b (27) d � g ( x ) = w i x i + b (28) i =1 Here w is the weight vector and b is the bias, or threshold weight This function is a weighted sum of the components of x (shifted by the bias) For a binary classification problem, the decision function becomes: � Y 1 if g ( x ) > 0 f ( x ; w , b ) = Y 2 otherwise Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 50 / 111
Linear decision boundary A hyperplane is a generalization of a straight line to > 2 dimensions A hyperplane contains all the points in a d dimensional space satisfying the following equation: a 1 x 1 + a 2 x 2 , . . . , + a d x d + b = 0 By identifying the components of w with the coefficients a i , we can see how the weight vector and the bias define a linear decision surface in d dimensions Such a classifier relies on the examples being linearly separable Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 51 / 111
Normal vector Geometrically, the weight vector w is a normal vector of the separating hyperplane A normal vector of a surface is any vector which is perpendicular to it Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 52 / 111
Bias The orientation (or slope) of the hyperplane is determined by w while the location (intercept) is determined by the bias It is common to simplify notation by including the bias in the weight vector, i.e. b = w 0 We need to add an additional component to the feature vector x This component is always x 0 = 1 The discriminant function is then simply the dot product between the weight vector and the feature vector: g ( x ) = w · x (29) d � g ( x ) = (30) w i x i i =0 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 53 / 111
Separating hyperplanes in 2 dimensions 4 y=−1x−0.5 y=−3x+1 y=69x+1 ● 2 ● ● ● ● y 0 ● −2 ● −4 −4 −2 0 2 4 x Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 54 / 111
Perceptron training How do we find a set of weights that separate our classes? Perceptron : A simple mistake-driven online algortihm ◮ Start with a zero weight vector and process each training example in turn. ◮ If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. ◮ If weights stop changing, stop If examples are linearly separable, then this algorithm is guaranteed to converge on the solution vector Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 55 / 111
Fixed increment online perceptron algorithm Binary classification, with classes +1 and − 1 Decision function y ′ = sign ( w · x + b ) Perceptron ( x 1: N , y 1: N , I ): 1: w ← 0 2: b ← 0 3: for i = 1 ... I do for n = 1 ... N do 4: if y ( n ) ( w · x ( n ) + b ) ≤ 0 then 5: w ← w + y ( n ) x ( n ) 6: b ← b + y ( n ) 7: 8: return ( w , b ) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 56 / 111
Weight averaging Although the algorithm is guaranteed to converge, the solution is not unique! Sensitive to the order in which examples are processed Separating the training sample does not equal good accuracy on unseen data Empirically, better generalization performance with weight averaging ◮ A method of avoiding overfitting ◮ As final weight vector, use the mean of all the weight vector values for each step of the algorithm Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 57 / 111
Efficient averaged perceptron algorithm Perceptron ( x 1: N , y 1: N , I ): 1: w ← 0 ; w a ← 0 2: b ← 0 ; b a ← 0 3: c ← 1 4: for i = 1 ... I do for n = 1 ... N do 5: if y ( n ) ( w · x ( n ) + b ) ≤ 0 then 6: w ← w + y ( n ) x ( n ) ; b ← b + y ( n ) 7: w a ← w a + cy ( n ) x ( n ) ; b a ← b a + cy ( n ) 8: c ← c + 1 9: 10: return ( w − w a / c , b − b a / c ) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 58 / 111
Problem: Average perceptron Weight averaging Show that the above algorithm performs weight averaging. Hints: In the standard perceptron algorithm, the final weight vector (and bias) is the sum of the updates at each step. In average perceptron, the final weight vector should be the mean of the sum of partial sums of updates at each step Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 59 / 111
Solution Let’s formalize: Basic perceptron: final weights are the sum of updates at each step: n � f ( x ( i ) ) w = (31) i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 60 / 111
Solution Let’s formalize: Basic perceptron: final weights are the sum of updates at each step: n � f ( x ( i ) ) w = (31) i =1 Naive weight averaging: final weights are the mean of the sum of partial sums: n i w = 1 � � f ( x ( j ) ) (32) n i =1 j =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 60 / 111
Solution Let’s formalize: Basic perceptron: final weights are the sum of updates at each step: n � f ( x ( i ) ) w = (31) i =1 Naive weight averaging: final weights are the mean of the sum of partial sums: n i w = 1 � � f ( x ( j ) ) (32) n i =1 j =1 Efficient weight averaging: n n � � f ( x ( i ) ) − if ( x ( i ) ) / n w = (33) i =1 i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 60 / 111
Show that equations 32 and 33 are equivalent. Note that we can rewrite the sum of partial sums by multiplying the update at each step by the factor indicating in how many of the partial sums it appears n i w =1 � � f ( x ( j ) ) (34) n i =1 j =1 n =1 � ( n − i ) f ( x ( i ) ) (35) n i =1 � n � =1 � nf ( x ( i ) ) − if ( x ( i ) ) (36) n i =1 � n n � =1 � f ( x ( i ) ) − � if ( x ( i ) ) n (37) n i =1 i =1 n n � � if ( x ( i ) ) / n = f ( x i ) − (38) i =1 i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 61 / 111
Margins Intiuitively, not all solution vectors (and corresponding hyperplanes (23)) are equally good It makes sense for the decision boundary to be as far away from the training instances as possible ◮ this improves the chance that if the position of the data points is slightly perturbed, the decision boundary will still be correct. Results from Statistical Learning Theory confirm these intuitions: maintaining large margins leads to small generalization error [Vapnik, 1995] Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 62 / 111
Functional margin The functional margin of an instance ( x , y ) with respect to some hyperplane ( w , b ) is defined to be γ = y ( w · x + b ) (39) A large margin version of the Perceptron update: if y ( w · x + b ) ≤ θ then update where θ is the threshold or margin parameter So an update is made not only on incorrectly classified examples, but also on those classified with insufficient confidence. Max-margin algorithms maximize the minimum margin between the decision boundary and the training set (cf. SVM) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 63 / 111
Perceptron – dual formulation We noticed earlier that the weight vector ends up being a linear combination of training examples. Instantiating the update function f ( x ( i ) ): n � α i y x ( i ) w = i =1 where α i = 1 if the i th was misclassified, and = 0 otherwise. The discriminant function then becomes: � n � � α i y ( i ) x ( i ) g ( x ) = · x + b (40) i =1 n α i y ( i ) x ( i ) · x + b � = (41) i =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 64 / 111
Dual Perceptron training DualPerceptron ( x 1: N , y 1: N , I ): 1: α ← 0 2: b ← 0 3: for j = 1 ... I do for k = 1 ... N do 4: if y ( k ) �� N i =1 α i y ( i ) x ( i ) · x ( k ) + b � ≤ 0 then 5: α i ← α i + 1 6: b ← b + y ( k ) 7: 8: return ( α , b ) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 65 / 111
Kernels Note that in the dual formulation there is no explicit weight vector: the training algorithm and the classification are expressed in terms of dot products between training examples and the test example We can generalize such dual algorithms to use Kernel functions ◮ A kernel function can be thought of as dot product in some transformed feature space K ( x , z ) = φ ( x ) · φ ( z ) where the map φ projects the vectors in the original feature space onto the transformed feature space ◮ It can also be thought of as a similarity function in the input object space Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 66 / 111
Kernel – example Consider the following kernel function K : R d × R d → R K ( x , z ) = ( x · z ) 2 (42) � d � � d � � � = x i z i x i z i (43) i =1 i =1 d d � � = x i x j z i z j (44) i =1 j =1 d � = ( x i x j )( z i z j ) (45) i , j =1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 67 / 111
Kernel vs feature map Feature map φ corresponding to K for d = 2 dimensions x 1 x 1 � x 1 � x 1 x 2 φ = (46) x 2 x 2 x 1 x 2 x 2 Computing feature map φ explicitly needs O ( d 2 ) time Computing K is linear O ( d ) in the number of dimensions Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 68 / 111
Why does it matter If you think of features as binary indicators, then the quadratic kernel above creates feature conjunctions E.g. in NER if x 1 indicates that word is capitalized and x 2 indicates that the previous token is a sentence boundary, with the quadratic kernel we efficiently compute the feature that both conditions are the case. Geometric intuition: mapping points to higher dimensional space makes them easier to separate with a linear boundary Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 69 / 111
Separability in 2D and 3D Figure: Two dimensional classification example, non-separable in two dimensions, becomes separable when mapped to 3 dimensions by ( x 1 , x 2 ) �→ ( x 2 1 , 2 x 1 x 2 , x 2 2 ) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 70 / 111
Support Vector Machines The two key ideas, large margin, and the “kernel trick” come together in Support Vector Machines Margin: a decision boundary which is as far away from the training instances as possible improves the chance that if the position of the data points is slightly perturbed, the decision boundary will still be correct. Results from Statistical Learning Theory confirm these intuitions: maintaining large margins leads to small generalization error [Vapnik, 1995]. A perceptron algorithm finds any hyperplane which separates the classes: SVM finds the one that additionally has the maximum margin Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 71 / 111
Quadratic optimization formulation Functional margin can be made larger just by rescaling the weights by a constant Hence we can fix the functional margin to be 1 and minimize the norm of the weight vector This is equivalent to maximizing the geometric margin For linearly separable training instances (( x 1 , y 1 ) , ..., ( x n , y n )) find the hyperplane ( w , b ) that solves the optimization problem: 1 2 || w || 2 minimize w , b (47) subject to y i ( w · x i + b ) ≥ 1 ∀ i ∈ 1 .. n This hyperplane separates the examples with geometric margin 2 / || w || Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 72 / 111
Support vectors SVM finds a separating hyperplane with the largest margin to the nearest instance This has the effect that the decision boundary is fully determined by a small subset of the training examples (the nearest ones on both sides) Those instances are the support vectors Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 73 / 111
Separating hyperplane and support vectors 4 ● 2 ● ● ● ● y 0 ● −2 ● −4 −4 −2 0 2 4 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 74 / 111 x
Soft margin SVM with soft margin works by relaxing the requirement that all data points lie outside the margin For each offending instance there is a “slack variable” ξ i which measures how much it would have move to obey the margin constraint. n 1 2 || w || 2 + C � minimize w , b ξ i i =1 (48) subject to y i ( w · x i + b ) ≥ 1 − ξ i ∀ i ∈ 1 .. n ξ i > 0 where ξ i = max(0 , 1 − y i ( w · x i + b )) The hyper-parameter C trades off minimizing the norm of the weight vector versus classifying correctly as many examples as possible. As the value of C tends towards infinity the soft-margin SVM approximates the hard-margin version. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 75 / 111
Dual form The dual formulation is in terms of support vectors, where SV is the set of their indices: � � � f ( x , α ∗ , b ∗ ) = sign y i α ∗ i ( x i · x ) + b ∗ (49) i ∈ SV The weights in this decision function are the Lagrange multipliers α ∗ . n n α i − 1 � � minimize W ( α ) = y i y j α i α j ( x i · x j ) 2 i =1 i , j =1 (50) n � subject to y i α i = 0 ∀ i ∈ 1 .. n α i ≥ 0 i =1 The Lagrangian weights together with the support vectors determine ( w , b ): � w = α i y i x i (51) i ∈ SV b = y k − w · x k for any k such that α k � = 0 (52) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 76 / 111
Multiclass classification with SVM SVM is essentially a binary classifier. A common method to perform multiclass classification with SVM, is to train multiple binary classifiers and combine their predictions to form the final prediction. This can be done in two ways: ◮ One-vs-rest (also known as one-vs-all): train | Y | binary classifiers and choose the class for which the margin is the largest. ◮ One-vs-one: train | Y | ( | Y | − 1) / 2 pairwise binary classifiers, and choose the class selected by the majority of them. An alternative method is to make the weight vector and the feature function Φ depend on the output y , and learn a single classifier which will predict the class with the highest score: w · Φ( x , y ′ ) + b y = argmax (53) y ′ ∈ Y Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 77 / 111
Linear Regression Training data: observations paired with outcomes ( n ∈ R ) Observations have features (predictors, typically also real numbers) The model is a regression line y = ax + b which best fits the observations ◮ a is the slope ◮ b is the intercept ◮ This model has two parameters (or weigths) ◮ One feature = x ◮ Example: ⋆ x = number of vague adjectives in property descriptions ⋆ y = amount house sold over asking price Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 78 / 111
Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 79 / 111
Multiple linear regression More generally y = w 0 + � N i =0 w i f i , where ◮ y = outcome ◮ w 0 = intercept ◮ f 1 .. f N = features vector and w 1 .. w N weight vector We ignore w 0 by adding a special f 0 feature, then the equation is equivalent to dot product: y = w · f Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 80 / 111
Learning linear regression Minimize sum squared error over the training set of M examples M ( y ( j ) pred − y ( j ) � obs ) 2 cost ( W ) = j =0 where N w i f ( j ) y j � pred = i i =0 Closed-form formula for choosing the best set of weights W is given by: W = ( X T X ) − 1 X T − → y where the matrix X contains training example features, and − → y is the vector of outcomes. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 81 / 111
Logistic regression In logistic regression we use the linear model to do classification, i.e. assign probabilities to class labels For binary classification, predict p ( y = true | x ). But predictions of linear regression model are ∈ R , whereas p ( y = true | x ) ∈ [0 , 1] Instead predict logit function of the probability: � p ( y = true | x ) � ln = w · f (54) 1 − p ( y = true | x ) p ( y = true | x ) 1 − p ( y = true | x ) = e w · f (55) Solving for p ( y = true | x ) we obtain: e w · f p ( y = true | x ) = (56) 1 + e w · f �� N � exp i =0 w i f i = (57) �� N � 1 + exp i =0 w i f i Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 82 / 111
Logistic regression - classification Example x belongs to class true if: p ( y = true | x ) 1 − p ( y = true | x ) > 1 (58) e w · f > 1 (59) w · f > 0 (60) N � w i f i > 0 (61) i =0 The equation � N i =0 w i f i = 0 defines the hyperplane in N -dimensional space, with points above this hyperplane belonging to class true Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 83 / 111
Logistic regression - learning Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations x For the training set with M examples: M � P ( y ( i ) | x ( i ) ) w = argmax ˆ (62) w i =0 A problem in convex optimization ◮ L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno method) ◮ gradient ascent ◮ conjugate gradient ◮ iterative scaling algorithms Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 84 / 111
Maximum Entropy model Logistic regression with more than two classes = multinomial logistic regression Also known as Maximum Entropy (MaxEnt) The MaxEnt equation generalizes (57) above: �� N � exp i =0 w ci f i p ( c | x ) = (63) �� N � � c ′ ∈ C exp i =0 w c ′ i f i The denominator is the normalization factor usually called Z used to make the score into a proper probability distribution N p ( c | x ) = 1 � Z exp w ci f i i =0 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 85 / 111
MaxEnt features In Maxent modeling normally binary freatures are used Features depend on classes: f i ( c , x ) ∈ { 0 , 1 } Those are indicator features Example x : Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow Example features: � 1 if word i = race & c = NN f 1 ( c , x ) = 0 otherwise � 1 if t i − 1 = TO & c = VB f 2 ( c , x ) = 0 otherwise � 1 if suffix ( word i ) = ing & c = VBG f 3 ( c , x ) = 0 otherwise Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 86 / 111
Binarizing features Example x : Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow word i suf tag i − 1 is-case(w i ) Vector of symbolic features of x : race ace TO TRUE Class-dependent indicator features of x : word i =race suf=ing suf=ace tag i − 1 =TO tag i − 1 =DT is-lower(w i )=TRUE . . . JJ 0 0 0 0 0 0 VB 1 0 1 1 0 1 NN 0 0 0 0 0 0 . . . Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 87 / 111
Entia non sunt multiplicanda praeter necessitatem Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 88 / 111
Maximum Entropy principle Jayes, 1957 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 89 / 111
Entropy Out of all possible models, choose the simplest one consistent with the data (Occam’s razor) Entropy of the distribution of discrete random variable X : � H ( X ) = − P ( X = x ) log 2 P ( X = x ) x The uniform distribution has the highest entropy Finding the maximum entropy distribution in the set C of possible distributions p ∗ = argmax H ( p ) p ∈ C Berger et al. (1996) showed that solving this optimization problem is equivalent to finding the multinomial logistic regression model whose weights maximize the likelihood of the training data. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 90 / 111
Maxent principle – simple example Find a Maximum Entropy probability distribution p ( a , b ) where a ∈ { x , y } and b ∈ { 0 , 1 } The only thing we know are is the following constraint: ◮ p ( x , 0) + p ( y , 0) = 0 . 6 p ( a , b ) 0 1 x ? ? ? ? y total 0.6 1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 91 / 111
Maxent principle – simple example Find a Maximum Entropy probability distribution p ( a , b ) where a ∈ { x , y } and b ∈ { 0 , 1 } The only thing we know are is the following constraint: ◮ p ( x , 0) + p ( y , 0) = 0 . 6 p ( a , b ) 0 1 x ? ? ? ? y total 0.6 1 p ( a , b ) 0 1 x 0.3 0.2 0.3 0.2 y total 0.6 1 Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 91 / 111
Constraints The constraints imposed on the probability model are encoded in the features: the expected value of each one of I indicator features f i under a model p should be equal to the expected value under the empirical distribution ˜ p obtained from the training data: ∀ i ∈ I , E p [ f i ] = E ˜ p [ f i ] (64) The expected value under the empirical distribution is given by: N p ( x , y ) f i ( x , y ) = 1 � � � E ˜ p [ f i ] = ˜ f i ( x j , y j ) (65) N x y j The expected value according to model p is: � � E p [ f i ] = p ( x , y ) f i ( x , y ) (66) x y Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 92 / 111
Approximation However, this requires summing over all possible object - class label pairs, which is in general not possible. Therefore the following standard approximation is used: N p ( x ) p ( y | x ) f i ( x , y ) = 1 � � � � E p [ f i ] = ˜ p ( y | x j ) f i ( x j , y ) (67) N x y y j where ˜ p ( x ) is the relative frequency of object x in the training data This has the advantage that ˜ p ( x ) for unseen events is 0. The term p ( y | x ) is calculated according to Equation 63. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 93 / 111
Regularization Although the Maximum Entropy models are maximally uniform subject to the constraints, they can still overfit (too many features) Regularization relaxes the constraints and results in models with smaller weights which may perform better on new data. Instead of solving the optimization in Equation 62, here in log form: M � log p w ( y ( i ) | x ( i ) ) , w = argmax ˆ (68) w i =0 we solve instead the following modified problem: M � log p w ( y ( i ) | x ( i ) ) + α R ( w ) w = argmax ˆ (69) w i =0 where R is the regularizer used to penalize large weights [Jurafsky and Martin, 2008b]. Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 94 / 111
Gaussian smoothing We can use a regularizer which assumes that weight values have a Gaussian distribution centered on 0 and with variance σ 2 . By multiplying each weight by a Gaussian prior we will maximize the following equation: M d w 2 j � log p w ( y ( i ) | x ( i ) ) − � ˆ w = argmax (70) 2 σ 2 w j i =0 j =0 where σ 2 j are the variances of the Gaussians of feature weights. This modification corresponds to using a maximum a posteriori rather than maximum likelihood model estimation Common to constrain all the weights to have the same global variance, which gives a single tunable algorithm parameter (can be found on held-out data) Grzegorz Chrupa� la (UdS) Machine Learning Tutorial 2009 95 / 111
Recommend
More recommend