CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu
Linear classifiers so far… What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary CS446 Machine Learning 2
Data are not linearly separability Noise / outliers Target function is not linear in X CS446 Machine Learning 3
Dual representation of linear classifiers CS446 Machine Learning 4
Dual representation Recall the Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual representation: Write w as a weighted sum of training items: w = ∑ n α n y n x n α n : how often was x n misclassified? f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 5
Dual representation Primal Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual Perceptron update rule: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 CS446 Machine Learning 6
Dual representation Classifying x in the primal: f( x ) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f( x ) = ∑ n α n y n x n x α n = weight of n -th training example (to be learned) x n x = dot product between x n and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn) CS446 Machine Learning 7
Kernels
Making data linearly separable Original feature space 2 1.5 1 0.5 x2 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x1 f( x ) = 1 iff x 1 2 + x 2 2 ≤ 1 CS446 Machine Learning 9
Making data linearly separable Transformed feature space 2 1.5 x2*x2 1 0.5 0 0 0.5 1 1.5 2 x1*x1 Transform data: x = (x 1 , x 2 ) => x’ = (x 1 2 , x 2 2 ) f( x’ ) = 1 iff x’ 1 + x’ 2 ≤ 1 CS446 Machine Learning 10
Making data linearly separable x 1 These data aren’t linearly separable in the x 1 space But adding a second dimension with x 2 = x 1 2 makes them linearly separable in 〈 x 1 , x 2 〉 : x 1 2 x 1 CS446 Machine Learning 11
Making data linearly separable It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x 2 ) – include transformed features in addition to the original features – capture interactions between features (e.g. x 3 = x 1 x 2 ) But this may blow up the number of features CS446 Machine Learning 12
Making data linearly separable We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected CS446 Machine Learning 13
The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 14
Quadratic kernel Original features: x = (a, b) Transformed features: φ ( x ) = (a 2 , b 2 , √ 2·ab) Dot product in transformed space: φ ( x 1 )· φ ( x 2 ) = a 1 2 a 2 2 + b 1 2 b 2 2 , 2·a 1 b 1 a 2 b 2 = ( x 1 · x 2 ) 2 Kernel: K( x 1 , x 2 ) = ( x 1 · x 2 ) 2 = φ ( x 1 )· φ ( x 2 ) CS446 Machine Learning 15
Polynomial kernels Polynomial kernel of degree p: – Basic form K( x i ,x j ) = ( x i ·x j ) p – Standard form (captures all lower order terms): K( x i ,x j ) = ( x i ·x j + 1) p CS446 Machine Learning 16
From dual to kernel perceptron Dual Perceptron: f( x ) = ∑ d α d x d · x m Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 Kernel Perceptron: f( x ) = ∑ d α d φ ( x d )· φ ( x m ) = ∑ d α d K( x d · x m ) Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d K( x d · x m ) < 0: α m := α m + 1 CS446 Machine Learning 17
Maximum margin classifiers
Maximum margin classifiers CS446 Machine Learning 19
Hard vs. soft margins
Dealing with outliers: Slack variables ξ i ξ i measures by how much example ( x i , y i ) fails to achieve margin δ CS446 Machine Learning 21
Soft margins Hard margin (primal) Soft margin (primal) n 1 ∑ 1 min 2 w ⋅ w + C ξ i min 2 w ⋅ w w i = 1 w subject to subject to ξ i ≥ 0 ∀ i y 1 ( w ⋅ x 1 ) ≥ 1 y 1 ( w ⋅ x 1 ) ≥ (1 − ξ 1 ) and ξ 1 ≥ 0 ... ... y n ( w ⋅ x n ) ≥ 1 y n ( w ⋅ x n ) ≥ (1 − ξ n ) and ξ n ≥ 0 Minimize training error while maximizing the margin ∑ i ξ i is an upper bound on the number of training errors C controls tradeoff between margin and training error CS446 Machine Learning 22
Recommend
More recommend