CSE446: Kernels and Kernelized Perceptron Winter 2015 - PowerPoint PPT Presentation

CSE446: ¡Kernels ¡and ¡ ¡ Kernelized ¡Perceptron ¡ Winter ¡2015 ¡ Luke ¡Ze@lemoyer ¡ ¡ ¡ Slides ¡adapted ¡from ¡Carlos ¡Guestrin ¡

What ¡if ¡the ¡data ¡is ¡not ¡linearly ¡separable? ¡ Use features of features of features of features….   x 1 . . .     x n     x 1 x 2   φ ( x ) =   x 1 x 3     . . .     e x 1   . . . Feature space can get really large really quickly!

Non-‑linear ¡features: ¡1D ¡input ¡ • Datasets ¡that ¡are ¡linearly ¡separable ¡with ¡some ¡noise ¡work ¡ out ¡great: ¡ ¡ x 0 • But ¡what ¡are ¡we ¡going ¡to ¡do ¡if ¡the ¡dataset ¡is ¡just ¡too ¡hard? ¡ ¡ x 0 • How ¡about… ¡mapping ¡data ¡to ¡a ¡higher-‑dimensional ¡space: ¡ x 2 x

Feature ¡spaces ¡ • General ¡idea: ¡ ¡ ¡map ¡to ¡higher ¡dimensional ¡space ¡ – if ¡ x ¡is ¡in ¡R n , ¡then ¡φ( x ) ¡is ¡in ¡R m ¡for ¡m>n ¡ – Can ¡now ¡learn ¡feature ¡weights ¡ w ¡ in ¡R m ¡ and ¡predict: ¡ ¡ y = sign ( w · φ ( x )) – Linear ¡funcXon ¡in ¡the ¡higher ¡dimensional ¡space ¡will ¡be ¡non-‑linear ¡in ¡ the ¡original ¡space ¡ x → φ ( x )

Higher ¡order ¡polynomials ¡ number of monomial terms d=4 m – input features d – degree of polynomial d=3 grows fast! d = 6, m = 100 d=2 about 1.6 billion terms number of input dimensions

Efficient ¡dot-‑product ¡of ¡polynomials ¡ Polynomials of degree exactly d d =1 � u 1 � v 1 ⇥ ⇥ � ⇥ � ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d K ( u, v ) = • Cool! Taking a dot product and an exponential gives same results as mapping into high dimensional space and then taking dot product

The ¡ “ Kernel ¡Trick ” ¡ • A ¡ kernel ¡func*on ¡defines ¡a ¡dot ¡product ¡in ¡some ¡feature ¡space. ¡ ¡ ¡ ¡K ( u , v )= ¡ φ ( u )  ¡ φ ( v ) ¡ • Example: ¡ ¡ ¡2-‑dimensional ¡vectors ¡ u =[ u 1 ¡ ¡ ¡ u 2 ] ¡and ¡ v =[ v 1 ¡ ¡ ¡ v 2 ]; ¡ ¡let ¡ K ( u,v )=(1 ¡+ ¡ u  v ) 2 , ¡ ¡Need ¡to ¡show ¡that ¡ K ( x i , x j )= ¡ φ ( x i ) ¡  φ ( x j ): ¡ ¡ ¡ K ( u , v )=(1 ¡+ ¡ u  v ) 2 , = ¡1+ ¡ u 1 2 v 1 2 ¡ + ¡ 2 ¡ u 1 v 1 ¡ u 2 v 2 + ¡u 2 2 v 2 2 ¡ + ¡2 u 1 v 1 ¡ + ¡ 2 u 2 v 2 = ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ [1, ¡ u 1 2 , ¡ ¡ √ 2 ¡ u 1 u 2 , ¡ ¡ ¡u 2 2 , ¡ ¡ √ 2 u 1 , ¡ ¡ √ 2 u 2 ] ¡  ¡ [1, ¡ ¡ v 1 2 , ¡ ¡ √ 2 v 1 v 2 , ¡ ¡ v 2 2 , ¡ ¡ √ 2 v 1 , ¡ ¡ √ 2 v 2 ] ¡= ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ φ ( u ) ¡  φ ( v ), ¡ ¡ ¡ ¡where ¡ φ ( x ) ¡= ¡ ¡ [1, ¡ ¡ x 1 2 , ¡ ¡ √ 2 ¡ x 1 x 2 , ¡ ¡ ¡x 2 2 , ¡ ¡ ¡ √ 2 x 1 , ¡ ¡ √ 2 x 2 ] ¡ • Thus, ¡a ¡kernel ¡funcXon ¡implicitly ¡ maps ¡data ¡to ¡a ¡high-‑dimensional ¡space ¡ (without ¡the ¡need ¡to ¡compute ¡each ¡ φ ( x ) ¡explicitly). ¡ • But, ¡it ¡isn’t ¡obvious ¡yet ¡how ¡we ¡will ¡incorporate ¡it ¡into ¡actual ¡learning ¡ algorithms… ¡

“Kernel ¡trick” ¡for ¡The ¡Perceptron! ¡ • Never ¡compute ¡features ¡explicitly!!! ¡ – Compute ¡dot ¡products ¡in ¡closed ¡form ¡K(u,v) ¡= ¡Φ(u) ¡  ¡Φ(v) ¡ ¡ • Kernelized ¡Perceptron: ¡ • Standard ¡Perceptron: ¡ • set ¡a i =0 ¡for ¡each ¡example ¡i ¡ • set ¡w i =0 ¡for ¡each ¡feature ¡i ¡ • For ¡t=1..T, ¡i=1..n: ¡ • set ¡a i =0 ¡for ¡each ¡example ¡i ¡ – ¡ ¡ X a k φ ( x k )) · φ ( x i )) y = sign (( • For ¡t=1..T, ¡i=1..n: ¡ k y = sign ( w · φ ( x i )) – ¡ ¡ X a k K ( x k , x i )) ¡ = sign ( – if ¡y ¡≠ ¡y i ¡ – if ¡y ¡≠ ¡y i ¡ w = w + y i φ ( x i ) k • ¡ ¡ • a i ¡+= ¡y i ¡ • ¡ a i ¡+= ¡y i ¡ ¡ • At ¡all ¡Xmes ¡during ¡learning: ¡ Exactly the same ¡ X a k φ ( x k ) computations, but can use w = K(u,v) to avoid enumerating k the features!!!

• set ¡a i =0 ¡for ¡each ¡example ¡i ¡ IniXal: ¡ • a ¡= ¡[a 1 , ¡a 2 , ¡a 3 , ¡a 4 ] ¡= ¡[0,0,0,0] ¡ • For ¡t=1..T, ¡i=1..n: ¡ t=1,i=1 ¡ – ¡ ¡ X a k K ( x k , x i )) y = sign ( • Σ k a k K(x k ,x 1 ) ¡= ¡0x4+0x0+0x4+0x0 ¡= ¡0, ¡sign(0)=-‑1 ¡ – if ¡y ¡≠ ¡y i ¡ k • a 1 ¡+= ¡y 1 à ¡a 1 +=1, ¡new ¡a= ¡[1,0,0,0] ¡ • a i ¡+= ¡y i ¡ t=1,i=2 ¡ • Σ k a k K(x k ,x 2 ) ¡= ¡1x0+0x4+0x0+0x4 ¡= ¡0, ¡sign(0)=-‑1 ¡ ¡ t=1,i=3 ¡ x 1 ¡ x 2 ¡ y ¡ ¡ • Σ k a k K(x k ,x 3 ) ¡= ¡1x4+0x0+0x4+0x0 ¡= ¡4, ¡sign(4)=1 ¡ t=1,i=4 ¡ 1 ¡ 1 ¡ 1 ¡ • Σ k a k K(x k ,x 4 ) ¡= ¡1x0+0x4+0x0+0x4 ¡= ¡0, ¡sign(0)=-‑1 ¡ -‑1 ¡ 1 ¡ -‑1 ¡ t=2,i=1 ¡ x 1 • Σ k a k K(x k ,x 1 ) ¡= ¡1x4+0x0+0x4+0x0 ¡= ¡4, ¡sign(4)=1 ¡ -‑1 ¡ -‑1 ¡ 1 ¡ … ¡ ¡ 1 ¡ -‑1 ¡ -‑1 ¡ x 2 ¡ ¡ x 1 ¡ x 2 ¡ x 3 ¡ x 4 ¡ K(u,v) ¡= ¡(u  v ) 2 ¡ K ¡ Converged!!! ¡ e.g., ¡ ¡ x 1 ¡ 4 ¡ 0 ¡ 4 ¡ 0 ¡ • y=Σ k ¡a k ¡K(x k ,x) ¡ K(x 1 ,x 2 ) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡1×K(x 1 ,x)+0×K(x 2 ,x)+0×K(x 3 ,x)+0×K(x 4 ,x) ¡ x 2 ¡ 0 ¡ 4 ¡ 0 ¡ 4 ¡ ¡ ¡ ¡ ¡= ¡K([1,1],[-‑1,1]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡K(x 1 ,x) ¡ ¡ ¡ ¡ ¡= ¡(1x-‑1+1x1) 2 ¡ x 3 ¡ 4 ¡ 0 ¡ 4 ¡ 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡K([1,1],x) ¡ ¡ ¡(because ¡x 1 =[1,1]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ = ¡0 ¡ x 4 ¡ 0 ¡ 4 ¡ 0 ¡ 4 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡(x 1 +x 2 ) 2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(because ¡ ¡ K(u,v) ¡= ¡(u  v) 2 ) ¡ ¡ ¡ ¡ ¡

Common ¡kernels ¡ • Polynomials ¡of ¡degree ¡exactly ¡ d ¡ • Polynomials ¡of ¡degree ¡up ¡to ¡ d ¡ • Gaussian ¡kernels ¡ • Sigmoid ¡ ¡ ¡ • And ¡many ¡others: ¡very ¡acXve ¡area ¡of ¡research! ¡

Overfipng? ¡ • Huge ¡feature ¡space ¡with ¡kernels, ¡what ¡about ¡ overfipng??? ¡ – Oqen ¡robust ¡to ¡overfipng, ¡e.g. ¡if ¡you ¡don’t ¡make ¡ too ¡many ¡Perceptron ¡updates ¡ – SVMs ¡(which ¡we ¡will ¡see ¡next) ¡will ¡have ¡a ¡clearer ¡ story ¡for ¡avoiding ¡overfipng ¡ – But ¡everything ¡overfits ¡someXmes!!! ¡ • Can ¡control ¡by: ¡ – Choosing ¡a ¡be@er ¡Kernel ¡ – Varying ¡parameters ¡of ¡the ¡Kernel ¡(width ¡of ¡Gaussian, ¡etc.) ¡

Kernels ¡in ¡logisXc ¡regression ¡ 1 P ( Y = 0 | X = x , w , w 0 ) = 1 + exp ( w 0 + w · x ) • Define ¡weights ¡in ¡terms ¡of ¡data ¡points: ¡ X α j φ ( x j ) w = j 1 P ( Y = 0 | X = x , w , w 0 ) = 1 + exp ( w 0 + P j α j φ ( x j ) · φ ( x )) 1 = ¡ 1 + exp ( w 0 + P j α j K ( x j , x )) • Derive ¡gradient ¡descent ¡rule ¡on ¡ α j ,w 0 ¡ • Similar ¡tricks ¡for ¡all ¡linear ¡models: ¡SVMs, ¡etc ¡

What ¡you ¡need ¡to ¡know ¡ • The ¡kernel ¡trick ¡ • Derive ¡polynomial ¡kernel ¡ • Common ¡kernels ¡ • Kernelized ¡perceptron ¡

CSE446: Kernels and Kernelized Perceptron Winter 2015 - PowerPoint PPT Presentation

CSE446: Kernels and Kernelized Perceptron Winter 2015 Luke Ze@lemoyer Slides adapted from Carlos Guestrin What if the data is not linearly

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

CSE446: Decision Trees Winter 2015 Luke Ze;lemoyer Slides

CSE446: Decision Tree Part2 Winter 2016 Ali Farhadi

Introduction to Machine Learning 4. Perceptron and Kernels Geoff Gordon and Alex Smola Carnegie

Introduction to Machine Learning 4. Perceptron and Kernels Alex Smola Carnegie Mellon University

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Kernel Methods Barnabs Pczos Outline Quick Introduction Feature space Perceptron

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Winter Outlook Heating Season 2014-2015 1 Winter Outlook: Outline Review: How did we do

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Lecture 5 Logistics Winter 2015 Richard Anderson 1/28/2015 University of Washington, Winter

2 More Paper Goals The L4 Microkernel Is this actually useful? Is the Operations:

343H: Honors AI Lecture 24: ML: Decision trees and neural networks 4/22/2014 Kristen Grauman

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Prediction in kernelized output spaces: output kernel trees and ensemble methods Pierre Geurts

Convergence of Perceptron Learning Algorithm Matthieu R. Bloch 1 Convergence of Perceptron

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Introduction to Machine Learning Multilayer Perceptron Barnabs Pczos The Multilayer

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

CSE446: Kernels and Kernelized Perceptron Winter 2015 - PowerPoint PPT Presentation

CSE446: Kernels and Kernelized Perceptron Winter 2015 Luke Ze@lemoyer Slides adapted from Carlos Guestrin What if the data is not linearly

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

CSE446: Decision Trees Winter 2015 Luke Ze;lemoyer Slides

CSE446: Decision Tree Part2 Winter 2016 Ali Farhadi

Introduction to Machine Learning 4. Perceptron and Kernels Geoff Gordon and Alex Smola Carnegie

Introduction to Machine Learning 4. Perceptron and Kernels Alex Smola Carnegie Mellon University

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Kernel Methods Barnabs Pczos Outline Quick Introduction Feature space Perceptron

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Kernels &amp; Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Winter Outlook Heating Season 2014-2015 1 Winter Outlook: Outline Review: How did we do

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Lecture 5 Logistics Winter 2015 Richard Anderson 1/28/2015 University of Washington, Winter

2 More Paper Goals The L4 Microkernel Is this actually useful? Is the Operations:

343H: Honors AI Lecture 24: ML: Decision trees and neural networks 4/22/2014 Kristen Grauman

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Prediction in kernelized output spaces: output kernel trees and ensemble methods Pierre Geurts

Convergence of Perceptron Learning Algorithm Matthieu R. Bloch 1 Convergence of Perceptron

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Introduction to Machine Learning Multilayer Perceptron Barnabs Pczos The Multilayer

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE