l ecture 8 d ual and k ernels
play

L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2


  1. CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu

  2. Admin CS446 Machine Learning 2

  3. Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than two days after their due date. Let us know if there are any special circumstances (family, health, etc.)

  4. Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 4

  5. Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 5

  6. Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 6

  7. Back to linear classifiers…. CS446 Machine Learning 7

  8. Linear classifiers so far… What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary CS446 Machine Learning 8

  9. Data are not linearly separability Noise / outliers Target function is not linear in X CS446 Machine Learning 9

  10. Today’s key concepts Kernel trick: Dealing with target functions that are not linearly separable. This requires us to move to the dual representation. CS446 Machine Learning 10

  11. Dual representation of linear classifiers CS446 Machine Learning 11

  12. Dual representation Recall the Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual representation: Write w as a weighted sum of training items: w = ∑ n α n y n x n α n : how often was x n misclassified? f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 12

  13. Dual representation Primal Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual Perceptron update rule: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 CS446 Machine Learning 13

  14. Dual representation Classifying x in the primal: f( x ) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f( x ) = ∑ n α n y n x n x α n = weight of n -th training example (to be learned) x n x = dot product between x n and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn) CS446 Machine Learning 14

  15. Kernels

  16. Making data linearly separable Original feature space 2 1.5 1 0.5 x2 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x1 f( x ) = 1 iff x 1 2 + x 2 2 ≤ 1 CS446 Machine Learning 16

  17. Making data linearly separable Transformed feature space 2 1.5 x2*x2 1 0.5 0 0 0.5 1 1.5 2 x1*x1 Transform data: x = (x 1 , x 2 ) => x’ = (x 1 2 , x 2 2 ) f( x’ ) = 1 iff x’ 1 + x’ 2 ≤ 1 CS446 Machine Learning 17

  18. Making data linearly separable x 1 These data aren’t linearly separable in the x 1 space But adding a second dimension with x 2 = x 1 2 makes them linearly separable in 〈 x 1 , x 2 〉 : x 1 2 x 1 CS446 Machine Learning 18

  19. Making data linearly separable It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x 2 ) – include transformed features in addition to the original features – capture interactions between features (e.g. x 3 = x 1 x 2 ) But this may blow up the number of features CS446 Machine Learning 19

  20. Making data linearly separable We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected CS446 Machine Learning 20

  21. The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 21

  22. Quadratic kernel Original features: x = (a, b) Transformed features: φ ( x ) = (a 2 , b 2 , √ 2·ab) Dot product in transformed space: φ ( x 1 )· φ ( x 2 ) = a 1 2 a 2 2 + b 1 2 b 2 2 , 2·a 1 b 1 a 2 b 2 = ( x 1 · x 2 ) 2 Kernel: K( x 1 , x 2 ) = ( x 1 · x 2 ) 2 = φ ( x 1 )· φ ( x 2 ) CS446 Machine Learning 22

  23. Polynomial kernels Polynomial kernel of degree p: – Basic form K( x i ,x j ) = ( x i ·x j ) p – Standard form (captures all lower order terms): K( x i ,x j ) = ( x i ·x j + 1) p CS446 Machine Learning 23

  24. From dual to kernel perceptron Dual Perceptron: f( x ) = ∑ d α d x d · x m Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 Kernel Perceptron: f( x ) = ∑ d α d φ ( x d )· φ ( x m ) = ∑ d α d K( x d · x m ) Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d K( x d · x m ) < 0: α m := α m + 1 CS446 Machine Learning 24

  25. Primal and dual representation Linear classifier (primal representation): w defines weights of features of x f( x ) = w · x Linear classifier (dual representation): Rewrite w as a (weighted) sum of training items: w = ∑ n α n y n x n f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 25

  26. The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 26

  27. The kernel matrix The kernel matrix of a data set D = { x 1 , …, x n } defined by a kernel function k ( x, z ) = φ ( x ) φ ( z ) is the n × n matrix K with K ij = k ( x i , x j ) You’ll also find the term ‘Gram matrix’ used: – The Gram matrix of a set of n vectors S = { x 1 … x n } is the n × n matrix G with G ij = x i x j – The kernel matrix is the Gram matrix of { φ ( x 1 ), …, φ ( x n )} CS446 Machine Learning 27

  28. Properties of the kernel matrix K K is symmetric: K ij = k ( x i , x j ) = φ ( x i ) φ ( x j ) = k ( x j , x i ) = K ji K is positive semi-definite ( ∀ vectors v : v T Kv ≥ 0): D D D D Proof: v T Kv = ∑ ∑ ∑ ∑ v i v j K ij v i v j φ ( x i ), φ ( x j ) = i = 1 j = 1 i = 1 j = 1 D D N N D D ∑ ∑ ∑ ∑ ∑ ∑ v i v j φ k ( x i ) ⋅ φ k ( x j ) = v i φ k ( x i ) ⋅ v j φ k ( x j ) = i = 1 j = 1 k = 1 k = 1 i = 1 j = 1 2 # & N D ∑ ∑ v i φ k ( x i ) ≥ 0 = % ( $ ' k = 1 i = 1 CS446 Machine Learning 28

  29. Quadratic kernel (1) K( x , z ) = ( xz ) 2 This corresponds to a feature space which contains only terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 , these are x 1 x 1 , x 1 x 2 , x 2 x 2 ) For x = (x 1 , x 2 ), z = (z 1 , z 2 ): K( x , z ) = ( xz ) 2 = x 1 2 z 1 2 + 2x 1 z 1 x 2 z 2 + x 2 2 z 2 2 = φ ( x )· φ ( z ) Hence, φ ( x ) = (x 1 2 , √ 2·x 1 x 2 , x 2 2 ) CS446 Machine Learning 29

  30. Quadratic kernel (2) K( x , z ) = ( xz + c) 2 This corresponds to a feature space which contains constants, linear terms (original features), as well as terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 : x 1 , x 2 , x 1 x 1 , x 1 x 2 , x 2 x 2 ) CS446 Machine Learning 30

  31. Polynomial kernels – Linear kernel: k( x , z ) = xz – Polynomial kernel of degree d : (only d th-order interactions): k( x , z ) = ( xz ) d – Polynomial kernel up to degree d : (all interactions of order d or lower: k( x , z ) = ( xz + c) d with c > 0 CS446 Machine Learning 31

  32. Constructing new kernels from one existing kernel k ( x , x’ ) You can construct new kernels k’ ( x , x’ ) from k ( x , x’ ) by: – Multiplying k ( x , x’ ) by a constant c : k’ ( x , x’ ) = ck ( x , x’ ) – Multiplying k ( x , x’ ) by a function f applied to x and x’ : k’ ( x , x’ ) = f ( x ) k ( x , x’ ) f ( x’ ) – Applying a polynomial (with non-negative coefficients) to k ( x , x’ ): k’ ( x , x’ ) = P ( k ( x , x’ ) ) with P ( z ) = ∑ i a i z i and a i ≥ 0 – Exponentiating k ( x , x’ ): k’ ( x , x’ ) = exp( k ( x , x’ )) CS446 Machine Learning 32

Recommend


More recommend