Kernels + Support Vector Machines (SVMs) SVM Readings: Matt - PowerPoint PPT Presentation

10-‑601 ¡Introduction ¡to ¡Machine ¡Learning Machine ¡Learning ¡Department School ¡of ¡Computer ¡Science Carnegie ¡Mellon ¡University Kernels ¡+ ¡ Support ¡Vector ¡ Machines ¡(SVMs) SVM ¡Readings: Matt ¡Gormley Murphy ¡14.5 Bishop ¡7.1 Lecture ¡12 HTF ¡12 ¡-‑ 12.38 February ¡27, ¡2016 Mitchell ¡-‑-‑ 1

Reminders • Homework 4: ¡Perceptron / ¡Kernels / ¡SVM – Release: ¡Wed, ¡Feb. ¡22 9 ¡days for ¡HW4 – Due: ¡Fri, ¡Mar. ¡03 ¡at ¡11:59pm • Midterm Exam (Evening Exam) – Tue, ¡Mar. ¡07 ¡at ¡7:00pm ¡– 9:30pm – See Piazza ¡for details about location • Grading 2

Outline • Kernels – Kernel ¡Perceptron Last ¡Lecture – Kernel ¡as ¡a ¡dot ¡product – Gram ¡matrix – Examples: ¡Polynomial, ¡RBF • Support ¡Vector ¡Machine ¡(SVM) – Background: ¡Constrained ¡ Optimization, ¡Linearly ¡Separable, ¡ Margin This ¡Lecture – SVM ¡Primal ¡(Linearly ¡Separable ¡Case) – SVM ¡Primal ¡(Non-‑linearly ¡Separable ¡ Case) – SVM ¡Dual 3

KERNELS 4

Kernels: ¡Motivation Most ¡real-‑world ¡problems ¡exhibit ¡data ¡that ¡is ¡ not ¡linearly ¡separable. Example: ¡pixel ¡representation ¡for ¡Facial ¡Recognition: Q: ¡ When ¡your ¡data ¡is ¡ not ¡linearly ¡separable , ¡ how ¡can ¡you ¡still ¡use ¡a ¡linear ¡classifier? A: Preprocess ¡the ¡data ¡to ¡produce ¡ nonlinear features 5

Kernels: ¡Motivation • Motivation ¡#1: ¡Inefficient ¡Features – Non-‑linearly ¡separable ¡data ¡requires ¡ high ¡ dimensional ¡ representation – Might ¡be ¡ prohibitively ¡expensive ¡ to ¡compute ¡or ¡ store • Motivation ¡#2: ¡Memory-‑based ¡Methods – k-‑Nearest ¡Neighbors ¡(KNN) ¡for ¡facial ¡recognition ¡ allows ¡a ¡ distance ¡metric between ¡images ¡-‑-‑ no ¡ need ¡to ¡worry ¡about ¡linearity ¡restriction ¡at ¡all 6

Kernels Whiteboard – Kernel ¡Perceptron – Kernel ¡as ¡a ¡dot ¡product – Gram ¡matrix – Examples: ¡RBF ¡kernel, ¡string ¡kernel 7

Kernel ¡Methods • Key ¡idea: ¡ Rewrite the ¡algorithm ¡so ¡that ¡we ¡only ¡work ¡with ¡ dot ¡products x T z 1. of ¡feature ¡vectors Replace the ¡ dot ¡products ¡ x T z with ¡a ¡ kernel ¡function ¡ k(x, ¡z) 2. • The ¡kernel ¡k(x,z) ¡can ¡be ¡ any legal ¡definition ¡of ¡a ¡dot ¡product: ¡ k(x, ¡z) ¡= ¡φ(x) T φ(z) ¡for ¡any ¡function ¡φ: ¡ X à R D So ¡we ¡only ¡compute ¡the ¡φ ¡dot ¡product ¡ implicitly • This ¡ “kernel ¡trick” can ¡be ¡applied ¡to ¡many ¡algorithms: – classification: ¡perceptron, ¡SVM, ¡… – regression: ¡ridge ¡regression, ¡… – clustering: ¡k-‑means, ¡… 8

Kernel ¡Methods Q: ¡ These ¡are ¡just ¡non-‑linear ¡features, ¡right? A: Yes, ¡but… Q: ¡ Can’t ¡we ¡just ¡compute ¡the ¡feature ¡ transformation ¡φ explicitly? A: That ¡depends... Q: ¡ So, ¡why ¡all ¡the ¡hype ¡about ¡the ¡kernel ¡trick? A: Because ¡the ¡ explicit ¡features ¡ might ¡either ¡ be ¡ prohibitively ¡expensive ¡ to ¡compute ¡or ¡ infinite ¡length ¡ vectors 9

Example: ¡Polynomial ¡Kernel For n=2, d=2, the kernel K x, z = x ⋅ z d corresponds to ϕ: R 2 → R 3 ϕ: R 2 → R 3 , x 1 , x 2 → Φ x = (x 1 2 , x 2 2 , 2 , 𝑦 2 2 , x 1 , x 2 → Φ x = (x 1 2x 1 x 2 ) 2x 1 x 2 ) 2 , x 2 2 , 𝑦 1 , 𝑦 2 → Φ 𝑦 = (𝑦 1 2𝑦 1 𝑦 2 ) 2 , x 2 2 , 2 , 𝑨 2 2 , 2𝑨 1 𝑨 2 ) Φ ϕ x ⋅ ϕ 𝑨 = x 1 2x 1 x 2 ⋅ (𝑨 1 2 , x 2 2 , 2 , 𝑨 2 2 , ϕ x ⋅ ϕ 𝑨 = x 1 2x 1 x 2 ⋅ (𝑨 1 2𝑨 1 𝑨 2 ) K x, z = x ⋅ z d K x, z = x ⋅ z d = x 1 𝑨 1 + x 2 𝑨 2 2 = x ⋅ 𝑨 2 = K(x, z) = x 1 𝑨 1 + x 2 𝑨 2 2 = x ⋅ 𝑨 2 = K(x, z) 2 , 𝑦 2 2 , 2 , 𝑦 2 2 , 𝑦 1 , 𝑦 2 → Φ 𝑦 = (𝑦 1 2𝑦 1 𝑦 2 ) 𝑦 1 , 𝑦 2 → Φ 𝑦 = (𝑦 1 2𝑦 1 𝑦 2 ) Original space Φ -space Φ Φ Φ x 2 X X X X X X X X X X X X O X O O X X X x 1 O O O O O X X z 1 O O O O O X O O X X O X X X X X O X X z 3 X X X X X X X X 10 Slide ¡from ¡Nina ¡Balcan

Example: ¡Polynomial ¡Kernel Feature space can grow really large and really quickly…. Crucial to think of ϕ as implicit, not explicit!!!! Polynomial kernel degreee 𝑒 , 𝑙 𝑦, 𝑨 = 𝑦 ⊤ 𝑨 𝑒 = 𝜚 𝑦 ⋅ 𝜚 𝑨 • 2 𝑦 2 … 𝑦 𝑒−1 𝑒 , 𝑦 1 𝑦 2 … 𝑦 𝑒 , 𝑦 1 – 𝑦 1 – Total number of such feature is = 𝑒 + 𝑜 − 1 ! 𝑒 + 𝑜 − 1 𝑒! 𝑜 − 1 ! 𝑒 – 𝑒 = 6, 𝑜 = 100, there are 1.6 billion terms 𝑃 𝑜 𝑑𝑝𝑛𝑞𝑣𝑢𝑏𝑢𝑗𝑝𝑜! 𝑙 𝑦, 𝑨 = 𝑦 ⊤ 𝑨 𝑒 = 𝜚 𝑦 ⋅ 𝜚 𝑨 11 Slide ¡from ¡Nina ¡Balcan

Kernel ¡Examples Side ¡Note: ¡ The ¡feature ¡space ¡might ¡not ¡be ¡unique! Explicit ¡representation ¡#1: ϕ: R 2 → R 3 , x 1 , x 2 → Φ x = (x 1 2x 1 x 2 ) 2 , x 2 2 , 2𝑨 1 𝑨 2 ) 2 , x 2 2 , 2 , 𝑨 2 2 , ϕ x ⋅ ϕ 𝑨 = x 1 2x 1 x 2 ⋅ (𝑨 1 = x 1 𝑨 1 + x 2 𝑨 2 2 = x ⋅ 𝑨 2 = K(x, z) Explicit ¡representation ¡#2: ϕ: R 2 → R 4 , x 1 , x 2 → Φ x = (x 1 2 , x 1 x 2 , x 2 x 1 ) 2 , x 2 2 , z 1 z 2 , z 2 z 1 ) 2 , x 2 2 , x 1 x 2 , x 2 x 1 ) ⋅ (z 1 2 , z 2 ϕ x ⋅ ϕ 𝑨 = (x 1 = x ⋅ 𝑨 2 = K(x, z) These ¡two ¡different ¡feature ¡representations ¡correspond ¡to ¡the ¡same ¡ kernel ¡function! 12 Slide ¡from ¡Nina ¡Balcan

Kernel ¡Examples Name Kernel ¡Function Feature ¡Space (implicit dot ¡product) (explicit ¡dot ¡product) Linear Same ¡as ¡original ¡input ¡ space Polynomial ¡(v1) All ¡polynomials of degree ¡ d Polynomial (v2) All ¡polynomials up ¡to ¡ degree ¡d Gaussian Infinite ¡dimensional ¡space Hyperbolic (With SVM, ¡this ¡is ¡ Tangent ¡ equivalent ¡to ¡a ¡2-‑layer ¡ (Sigmoid) ¡ neural ¡network) Kernel 13

RBF ¡Kernel ¡Example RBF ¡Kernel: 14

RBF ¡Kernel ¡Example KNN ¡vs. ¡SVM RBF ¡Kernel: 26

Example: ¡String ¡Kernel Setup : – Input ¡instances ¡ x are ¡strings ¡of ¡characters ¡(e.g. ¡ x (3) = ¡[‘s’, ¡‘a’, ¡‘t’], ¡ x (7) = ¡[‘c’, ¡‘a’, ¡‘t’] ¡ – Want ¡indicator ¡features ¡for ¡the ¡presence ¡/ ¡ absence ¡of ¡each ¡possible ¡substring ¡up ¡to ¡length ¡ K Questions : 1. What ¡is ¡the ¡best ¡ runtime of ¡a ¡single ¡ Standard ¡ Perceptron update? 2. What ¡is ¡the ¡best ¡ runtime of ¡a ¡single ¡ Kernel ¡ Perceptron update? 30

Kernels: ¡Discussion If all computations involving instances are in terms • of inner products then: � Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space. � Computationally, only need to modify the algo by replacing each x ⋅ z with a K x, z . How to choose a kernel: Kernels often encode domain knowledge (e.g., string kernels) • Use Cross-Validation to choose the parameters, e.g., 𝜏 for • 2 Gaussian Kernel K x, 𝑨 = exp − 𝑦−𝑨 2 𝜏 2 Learn a good kernel; e.g., [Lanckriet-Cristianini-Bartlett-El Ghaoui- • Jordan’ 04] 31 Slide ¡from ¡Nina ¡Balcan

SUPPORT ¡VECTOR ¡MACHINE ¡ (SVM) 32

SVM: ¡Optimization ¡Background Whiteboard – Constrained ¡Optimization – Linear ¡programming – Quadratic ¡programming – Example: ¡2D ¡quadratic ¡function ¡with ¡linear ¡ constraints 33

Quadratic ¡Program 34

SVM Whiteboard – SVM ¡Primal ¡(Linearly ¡Separable ¡Case) – SVM ¡Primal ¡(Non-‑linearly ¡Separable ¡Case) 39

SVM ¡QP 40

SVM ¡QP 41

SVM ¡QP 42

SVM ¡QP 43

SVM ¡QP 44

SVM ¡QP 45

Support Vector Machines (SVMs) Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; Primal 2 + 𝐷 𝜊 𝑗 Find s.t.: argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 form 𝑗 • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 𝜊 𝑗 ≥ 0 Can be kernelized!!! Which is equivalent to: Input: S={( x 1 , y 1 ) , …,( x m , y m )}; Lagrangian Dual 1 s.t.: Find 2 y i y j α i α j x i ⋅ x j − α i argmin α i j i • For all i, 0 ≤ α i ≤ C i y i α i = 0 i 46 Slide ¡from ¡Nina ¡Balcan

Kernels + Support Vector Machines (SVMs) SVM Readings: Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + Support Vector Machines (SVMs) SVM Readings: Matt Gormley Murphy

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Welcome! Parents, Middle Primary classes 17 January 2020 The slides will be uploaded onto the

FRONT All your companys external communications in one collaborative inbox. THE PROBLEM

Conjunctive networks Complexity of limit cycle problems with different schedules Julio Aracena,

Alignment and Deformation for Cryostat of CADS Injector Jiandong Yuan, Lizhen Ma, Yuan He, Bin

Support Constrained Generator Matrices of Gabidulin Codes in Characteristic Zero Hikmet Yildiz

Preconditioning techniques based on the Birkhoff-von Neumann decomposition Bora U car CNRS

CS3505/5020 Software Practice II Updated topics schedule Transformations CS 3505 L05 - 1

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M.