CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu
Admin CS446 Machine Learning 2
Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than two days after their due date. Let us know if there are any special circumstances (family, health, etc.)
Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 4
Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 5
Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 6
Back to linear classifiers…. CS446 Machine Learning 7
Linear classifiers so far… What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary CS446 Machine Learning 8
Data are not linearly separability Noise / outliers Target function is not linear in X CS446 Machine Learning 9
Today’s key concepts Kernel trick: Dealing with target functions that are not linearly separable. This requires us to move to the dual representation. CS446 Machine Learning 10
Dual representation of linear classifiers CS446 Machine Learning 11
Dual representation Recall the Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual representation: Write w as a weighted sum of training items: w = ∑ n α n y n x n α n : how often was x n misclassified? f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 12
Dual representation Primal Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual Perceptron update rule: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 CS446 Machine Learning 13
Dual representation Classifying x in the primal: f( x ) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f( x ) = ∑ n α n y n x n x α n = weight of n -th training example (to be learned) x n x = dot product between x n and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn) CS446 Machine Learning 14
Kernels
Making data linearly separable Original feature space 2 1.5 1 0.5 x2 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x1 f( x ) = 1 iff x 1 2 + x 2 2 ≤ 1 CS446 Machine Learning 16
Making data linearly separable Transformed feature space 2 1.5 x2*x2 1 0.5 0 0 0.5 1 1.5 2 x1*x1 Transform data: x = (x 1 , x 2 ) => x’ = (x 1 2 , x 2 2 ) f( x’ ) = 1 iff x’ 1 + x’ 2 ≤ 1 CS446 Machine Learning 17
Making data linearly separable x 1 These data aren’t linearly separable in the x 1 space But adding a second dimension with x 2 = x 1 2 makes them linearly separable in 〈 x 1 , x 2 〉 : x 1 2 x 1 CS446 Machine Learning 18
Making data linearly separable It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x 2 ) – include transformed features in addition to the original features – capture interactions between features (e.g. x 3 = x 1 x 2 ) But this may blow up the number of features CS446 Machine Learning 19
Making data linearly separable We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected CS446 Machine Learning 20
The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 21
Quadratic kernel Original features: x = (a, b) Transformed features: φ ( x ) = (a 2 , b 2 , √ 2·ab) Dot product in transformed space: φ ( x 1 )· φ ( x 2 ) = a 1 2 a 2 2 + b 1 2 b 2 2 , 2·a 1 b 1 a 2 b 2 = ( x 1 · x 2 ) 2 Kernel: K( x 1 , x 2 ) = ( x 1 · x 2 ) 2 = φ ( x 1 )· φ ( x 2 ) CS446 Machine Learning 22
Polynomial kernels Polynomial kernel of degree p: – Basic form K( x i ,x j ) = ( x i ·x j ) p – Standard form (captures all lower order terms): K( x i ,x j ) = ( x i ·x j + 1) p CS446 Machine Learning 23
From dual to kernel perceptron Dual Perceptron: f( x ) = ∑ d α d x d · x m Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 Kernel Perceptron: f( x ) = ∑ d α d φ ( x d )· φ ( x m ) = ∑ d α d K( x d · x m ) Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d K( x d · x m ) < 0: α m := α m + 1 CS446 Machine Learning 24
Primal and dual representation Linear classifier (primal representation): w defines weights of features of x f( x ) = w · x Linear classifier (dual representation): Rewrite w as a (weighted) sum of training items: w = ∑ n α n y n x n f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 25
The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 26
The kernel matrix The kernel matrix of a data set D = { x 1 , …, x n } defined by a kernel function k ( x, z ) = φ ( x ) φ ( z ) is the n × n matrix K with K ij = k ( x i , x j ) You’ll also find the term ‘Gram matrix’ used: – The Gram matrix of a set of n vectors S = { x 1 … x n } is the n × n matrix G with G ij = x i x j – The kernel matrix is the Gram matrix of { φ ( x 1 ), …, φ ( x n )} CS446 Machine Learning 27
Properties of the kernel matrix K K is symmetric: K ij = k ( x i , x j ) = φ ( x i ) φ ( x j ) = k ( x j , x i ) = K ji K is positive semi-definite ( ∀ vectors v : v T Kv ≥ 0): D D D D Proof: v T Kv = ∑ ∑ ∑ ∑ v i v j K ij v i v j φ ( x i ), φ ( x j ) = i = 1 j = 1 i = 1 j = 1 D D N N D D ∑ ∑ ∑ ∑ ∑ ∑ v i v j φ k ( x i ) ⋅ φ k ( x j ) = v i φ k ( x i ) ⋅ v j φ k ( x j ) = i = 1 j = 1 k = 1 k = 1 i = 1 j = 1 2 # & N D ∑ ∑ v i φ k ( x i ) ≥ 0 = % ( $ ' k = 1 i = 1 CS446 Machine Learning 28
Quadratic kernel (1) K( x , z ) = ( xz ) 2 This corresponds to a feature space which contains only terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 , these are x 1 x 1 , x 1 x 2 , x 2 x 2 ) For x = (x 1 , x 2 ), z = (z 1 , z 2 ): K( x , z ) = ( xz ) 2 = x 1 2 z 1 2 + 2x 1 z 1 x 2 z 2 + x 2 2 z 2 2 = φ ( x )· φ ( z ) Hence, φ ( x ) = (x 1 2 , √ 2·x 1 x 2 , x 2 2 ) CS446 Machine Learning 29
Quadratic kernel (2) K( x , z ) = ( xz + c) 2 This corresponds to a feature space which contains constants, linear terms (original features), as well as terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 : x 1 , x 2 , x 1 x 1 , x 1 x 2 , x 2 x 2 ) CS446 Machine Learning 30
Polynomial kernels – Linear kernel: k( x , z ) = xz – Polynomial kernel of degree d : (only d th-order interactions): k( x , z ) = ( xz ) d – Polynomial kernel up to degree d : (all interactions of order d or lower: k( x , z ) = ( xz + c) d with c > 0 CS446 Machine Learning 31
Constructing new kernels from one existing kernel k ( x , x’ ) You can construct new kernels k’ ( x , x’ ) from k ( x , x’ ) by: – Multiplying k ( x , x’ ) by a constant c : k’ ( x , x’ ) = ck ( x , x’ ) – Multiplying k ( x , x’ ) by a function f applied to x and x’ : k’ ( x , x’ ) = f ( x ) k ( x , x’ ) f ( x’ ) – Applying a polynomial (with non-negative coefficients) to k ( x , x’ ): k’ ( x , x’ ) = P ( k ( x , x’ ) ) with P ( z ) = ∑ i a i z i and a i ≥ 0 – Exponentiating k ( x , x’ ): k’ ( x , x’ ) = exp( k ( x , x’ )) CS446 Machine Learning 32
Recommend
More recommend