L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu

Admin CS446 Machine Learning 2

Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than two days after their due date. Let us know if there are any special circumstances (family, health, etc.)

Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 4

Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 5

Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 6

Back to linear classifiers…. CS446 Machine Learning 7

Linear classifiers so far… What we’ve seen so far is not the whole story – We’ve assumed that the data are linearly separable – We’ve ignored the fact that the perceptron just finds some decision boundary, but not necessarily an optimal decision boundary CS446 Machine Learning 8

Data are not linearly separability Noise / outliers Target function is not linear in X CS446 Machine Learning 9

Today’s key concepts Kernel trick: Dealing with target functions that are not linearly separable. This requires us to move to the dual representation. CS446 Machine Learning 10

Dual representation of linear classifiers CS446 Machine Learning 11

Dual representation Recall the Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual representation: Write w as a weighted sum of training items: w = ∑ n α n y n x n α n : how often was x n misclassified? f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 12

Dual representation Primal Perceptron update rule: If x m is misclassified, add y m · x m to w if y m ·f( x m ) = y m · w · x m < 0: w := w + y m · x m Dual Perceptron update rule: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 CS446 Machine Learning 13

Dual representation Classifying x in the primal: f( x ) = w x w = feature weights (to be learned) wx = dot product between w and x Classifying x in the dual: f( x ) = ∑ n α n y n x n x α n = weight of n -th training example (to be learned) x n x = dot product between x n and x The dual representation is advantageous when #training examples ≪ #features (requires fewer parameters to learn) CS446 Machine Learning 14

Kernels

Making data linearly separable Original feature space 2 1.5 1 0.5 x2 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x1 f( x ) = 1 iff x 1 2 + x 2 2 ≤ 1 CS446 Machine Learning 16

Making data linearly separable Transformed feature space 2 1.5 x2*x2 1 0.5 0 0 0.5 1 1.5 2 x1*x1 Transform data: x = (x 1 , x 2 ) => x’ = (x 1 2 , x 2 2 ) f( x’ ) = 1 iff x’ 1 + x’ 2 ≤ 1 CS446 Machine Learning 17

Making data linearly separable x 1 These data aren’t linearly separable in the x 1 space But adding a second dimension with x 2 = x 1 2 makes them linearly separable in 〈 x 1 , x 2 〉 : x 1 2 x 1 CS446 Machine Learning 18

Making data linearly separable It is common for data to be not linearly separable in the original feature space. We can often introduce new features to make the data linearly separable in the new space: – transform the original features (e.g. x → x 2 ) – include transformed features in addition to the original features – capture interactions between features (e.g. x 3 = x 1 x 2 ) But this may blow up the number of features CS446 Machine Learning 19

Making data linearly separable We need to introduce a lot of new features to learn the target function. Problem for the primal representation: w now has a lot of elements, and we might not have enough data to learn w The dual representation is not affected CS446 Machine Learning 20

The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 21

Quadratic kernel Original features: x = (a, b) Transformed features: φ ( x ) = (a 2 , b 2 , √ 2·ab) Dot product in transformed space: φ ( x 1 )· φ ( x 2 ) = a 1 2 a 2 2 + b 1 2 b 2 2 , 2·a 1 b 1 a 2 b 2 = ( x 1 · x 2 ) 2 Kernel: K( x 1 , x 2 ) = ( x 1 · x 2 ) 2 = φ ( x 1 )· φ ( x 2 ) CS446 Machine Learning 22

Polynomial kernels Polynomial kernel of degree p: – Basic form K( x i ,x j ) = ( x i ·x j ) p – Standard form (captures all lower order terms): K( x i ,x j ) = ( x i ·x j + 1) p CS446 Machine Learning 23

From dual to kernel perceptron Dual Perceptron: f( x ) = ∑ d α d x d · x m Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d x d · x m < 0: α m := α m + 1 Kernel Perceptron: f( x ) = ∑ d α d φ ( x d )· φ ( x m ) = ∑ d α d K( x d · x m ) Update: If x m is misclassified, add 1 to α m if y m · ∑ d α d K( x d · x m ) < 0: α m := α m + 1 CS446 Machine Learning 24

Primal and dual representation Linear classifier (primal representation): w defines weights of features of x f( x ) = w · x Linear classifier (dual representation): Rewrite w as a (weighted) sum of training items: w = ∑ n α n y n x n f( x ) = w · x = ∑ n α n y n x n · x CS446 Machine Learning 25

The kernel trick – Define a feature function φ ( x ) which maps items x into a higher-dimensional space. – The kernel function K( x i , x j ) computes the inner product between the φ ( x i ) and φ ( x j ) K( x i , x j ) = φ ( x i ) φ ( x j ) – Dual representation: We don’t need to learn w in this higher-dimensional space. It is sufficient to evaluate K( x i , x j ) CS446 Machine Learning 26

The kernel matrix The kernel matrix of a data set D = { x 1 , …, x n } defined by a kernel function k ( x, z ) = φ ( x ) φ ( z ) is the n × n matrix K with K ij = k ( x i , x j ) You’ll also find the term ‘Gram matrix’ used: – The Gram matrix of a set of n vectors S = { x 1 … x n } is the n × n matrix G with G ij = x i x j – The kernel matrix is the Gram matrix of { φ ( x 1 ), …, φ ( x n )} CS446 Machine Learning 27

Properties of the kernel matrix K K is symmetric: K ij = k ( x i , x j ) = φ ( x i ) φ ( x j ) = k ( x j , x i ) = K ji K is positive semi-definite ( ∀ vectors v : v T Kv ≥ 0): D D D D Proof: v T Kv = ∑ ∑ ∑ ∑ v i v j K ij v i v j φ ( x i ), φ ( x j ) = i = 1 j = 1 i = 1 j = 1 D D N N D D ∑ ∑ ∑ ∑ ∑ ∑ v i v j φ k ( x i ) ⋅ φ k ( x j ) = v i φ k ( x i ) ⋅ v j φ k ( x j ) = i = 1 j = 1 k = 1 k = 1 i = 1 j = 1 2 # & N D ∑ ∑ v i φ k ( x i ) ≥ 0 = % ( $ ' k = 1 i = 1 CS446 Machine Learning 28

Quadratic kernel (1) K( x , z ) = ( xz ) 2 This corresponds to a feature space which contains only terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 , these are x 1 x 1 , x 1 x 2 , x 2 x 2 ) For x = (x 1 , x 2 ), z = (z 1 , z 2 ): K( x , z ) = ( xz ) 2 = x 1 2 z 1 2 + 2x 1 z 1 x 2 z 2 + x 2 2 z 2 2 = φ ( x )· φ ( z ) Hence, φ ( x ) = (x 1 2 , √ 2·x 1 x 2 , x 2 2 ) CS446 Machine Learning 29

Quadratic kernel (2) K( x , z ) = ( xz + c) 2 This corresponds to a feature space which contains constants, linear terms (original features), as well as terms of degree 2 (products of two features) (for x = (x 1 , x 2 ) in R 2 : x 1 , x 2 , x 1 x 1 , x 1 x 2 , x 2 x 2 ) CS446 Machine Learning 30

Polynomial kernels – Linear kernel: k( x , z ) = xz – Polynomial kernel of degree d : (only d th-order interactions): k( x , z ) = ( xz ) d – Polynomial kernel up to degree d : (all interactions of order d or lower: k( x , z ) = ( xz + c) d with c > 0 CS446 Machine Learning 31

Constructing new kernels from one existing kernel k ( x , x’ ) You can construct new kernels k’ ( x , x’ ) from k ( x , x’ ) by: – Multiplying k ( x , x’ ) by a constant c : k’ ( x , x’ ) = ck ( x , x’ ) – Multiplying k ( x , x’ ) by a function f applied to x and x’ : k’ ( x , x’ ) = f ( x ) k ( x , x’ ) f ( x’ ) – Applying a polynomial (with non-negative coefficients) to k ( x , x’ ): k’ ( x , x’ ) = P ( k ( x , x’ ) ) with P ( z ) = ∑ i a i z i and a i ≥ 0 – Exponentiating k ( x , x’ ): k’ ( x , x’ ) = exp( k ( x , x’ )) CS446 Machine Learning 32

L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2

Semant ic Knowledge f or Semant ic Knowledge f or Text ual Ent ailment Text ual Ent ailment

CPSC 410/ 611: Week 7 Vir t ual Memor y Reading: Silber shat z, Chapt er 9 Vir

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

D ETECTING C OMMUNITY K ERNELS IN L ARGE S OCIAL N ETWORKS Liaoruo (Laura) Wang Cornell

Unik ikernel ernels: s: Li Libr brar ary O y Oper erating ng Sy Systems ems f for t

SVM . . . if Pr(+1|v) > 0.5 then t (v) = +1 else t(v) = -1 G RAPH K ERNELS R ESEARCH IN A N

Microk ernels Meet Recursive Virtual Machines Bry an F o rd Mik e Hibler Ja y Lep

Steams mship Mut utua ual Cosc sco B Busa san FFO FFO a and P Pollution. 7 th

OUR VI RT UAL NE W WORL D 2 DOE S PHYSI CAL CONT ACTMAT T E R I N PARE NT ALRE

Annual ual Ge Gener neral Meeting eting May 10 th th , 2017 Revi view w by Erik k Stenf

Teaching, Learning and Wellbeing Learning and Teaching Conference 2019 Thurs 21 March #LTC19

Art & Design and Photography WHAT TO EXPECT VOCATIONAL Art & Design programme of study

Ann Annua ual l Gen Gener eral M al Mee eeting ting Novembe mber 201 2012 Qu Quickstep

Post-Occupancy Evaluation: UAL University of the Arts London and POE : User Experience (UX)

Busi siness as ss as Us Usual ual is No Not Engaging and Innovating with Young Adults in

DE DE T T E E RMI RMI NAT NAT I I ON OF ON OF COGNI COGNI T T I I VE VE PROCE

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a.

Why the Best Predictive What Do We Mean by . . . Models Are Often Different Main Result: . . .

The formation of gas dwarfs and rocky planets a case for the new DISPATCH Code ke Nordlund

Symmetry breaking Suppose the system had SU(3) symmetry initially. Let perturbation break the

Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Beyond

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods for regression and classification Prof. Mike Hughes Many ideas/slides

L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: D UAL AND K ERNELS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2

Semant ic Knowledge f or Semant ic Knowledge f or Text ual Ent ailment Text ual Ent ailment

CPSC 410/ 611: Week 7 Vir t ual Memor y Reading: Silber shat z, Chapt er 9 Vir

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

D ETECTING C OMMUNITY K ERNELS IN L ARGE S OCIAL N ETWORKS Liaoruo (Laura) Wang Cornell

Unik ikernel ernels: s: Li Libr brar ary O y Oper erating ng Sy Systems ems f for t

SVM . . . if Pr(+1|v) &gt; 0.5 then t (v) = +1 else t(v) = -1 G RAPH K ERNELS R ESEARCH IN A N

Microk ernels Meet Recursive Virtual Machines Bry an F o rd Mik e Hibler Ja y Lep

Steams mship Mut utua ual Cosc sco B Busa san FFO FFO a and P Pollution. 7 th

OUR VI RT UAL NE W WORL D 2 DOE S PHYSI CAL CONT ACTMAT T E R I N PARE NT ALRE

Annual ual Ge Gener neral Meeting eting May 10 th th , 2017 Revi view w by Erik k Stenf

Teaching, Learning and Wellbeing Learning and Teaching Conference 2019 Thurs 21 March #LTC19

Art &amp; Design and Photography WHAT TO EXPECT VOCATIONAL Art &amp; Design programme of study

Ann Annua ual l Gen Gener eral M al Mee eeting ting Novembe mber 201 2012 Qu Quickstep

Post-Occupancy Evaluation: UAL University of the Arts London and POE : User Experience (UX)

Busi siness as ss as Us Usual ual is No Not Engaging and Innovating with Young Adults in

DE DE T T E E RMI RMI NAT NAT I I ON OF ON OF COGNI COGNI T T I I VE VE PROCE

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a.

Why the Best Predictive What Do We Mean by . . . Models Are Often Different Main Result: . . .

The formation of gas dwarfs and rocky planets a case for the new DISPATCH Code ke Nordlund

Symmetry breaking Suppose the system had SU(3) symmetry initially. Let perturbation break the

Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Beyond

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods for regression and classification Prof. Mike Hughes Many ideas/slides

SVM . . . if Pr(+1|v) > 0.5 then t (v) = +1 else t(v) = -1 G RAPH K ERNELS R ESEARCH IN A N

Art & Design and Photography WHAT TO EXPECT VOCATIONAL Art & Design programme of study