CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

 Or, support vector machine (SVM)  Discriminant-based method ◦ Learn class boundaries  Support vector consists of examples closest to boundary  Kernel computes similarity between examples ◦ Maps instance space to a higher-dimensional space where (hopefully) linear models suffice  Choosing the right kernel is crucial  Kernel machines among best-performing learners CptS 570 - Machine Learning 2

 Likely to underfit using only hyperplanes  But we can map the data to a nonlinear space and use hyperplanes there ◦ Φ : R d  F ◦ x x  Φ (x) Φ CptS 570 - Machine Learning 3

 + ∈ { } t 1 if C x X = = t t t  1 , r where r x − ∈ t t  1 if C x 2 find and w such that w 0 + ≥ + = + T t t w 1 for r 1 w x 0 + ≤ − = − T t t w 1 for r 1 w x 0 which can be rewritten as ( ) + ≥ + t T t r w 1 w x 0  Note we want ≥+1, not ≥0  Want instances some distance from hyperplane CptS 570 - Machine Learning 4

 Distance from instance x t to hyperplane w T x t +w 0 + T t + w t T t w x r ( w ) w x 0 0 or w w  Distance from hyperplane to closest instances is the margin w margin CptS 570 - Machine Learning 5

 Optimal separating hyperplane is the one maximizing the margin  We want to choose w maximizing ρ such that ( ) + t T t r w w x ≥ ρ ∀ 0 , t w  Infinite number of solutions by scaling w  Thus, we choose solution minimizing ‖ w ‖ ( ) 1 + ≥ + ∀ 2 t T t r w , t min subject to 1 w w x 0 2 CptS 570 - Machine Learning 6

( ) 1 2 + ≥ + ∀ t T t min subject to r w , t 1 w w x 0 2  Quadratic optimization problem with complexity polynomial in d (#features)  Kernel will eventually map d-dimensional space to higher-dimensional space  Prefer complexity not based on #dimensions CptS 570 - Machine Learning 7

 Convert optimization problem to depend on number of training examples N (not d) ◦ Still polynomial in N  But optimization will depend only on closest examples (support vector) ◦ Typically ≪N CptS 570 - Machine Learning 8

 Rewrite quadratic optimization problem using Lagrange multipliers α t , 1 ≤ t ≤ N ( ) 1 2 + ≥ + ∀ t T t min subject to r w 1 , t w w x 0 2 [ ] ( ) N 1 ∑ = 2 − α + − t t T t L r w 1 w w x p 0 2 = t 1 ( ) ∑ N N 1 ∑ = 2 − α + + α t t T t t r w w w x 0 2 = = t 1 t 1  Minimize L p CptS 570 - Machine Learning 9

 Equivalently, we can maximize L p subject to the constraints: ∂ L N ∑ = ⇒ = α p t t t 0 r w x ∂ w = t 1 ∂ L N ∑ = ⇒ α = p t t 0 r 0 ∂ w = t 1 0  Plugging these into L p … CptS 570 - Machine Learning 10

( ) 1 ∑ ∑ ∑ = − α − α + α T T t t t t t t L r w r w w w x d 0 2 t t t ( ) 1 ∑ = − + α T t w w 2 t ( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t r r x x 2 t s t ∑ α = α ≥ ∀ t t t subject to r and , t 0 0 t  Maximize L d with respect to α t only  Complexity O(N 3 ) CptS 570 - Machine Learning 11

 Most α t = 0 ◦ I.e., r t (w T x t +w 0 ) > 1 (x t lie outside margin)  Support vectors: x t such that α t > 0 ◦ I.e., r t (w T x t +w 0 ) = 1 (x t lie on margin)  w = Σ t α t r t x t  w 0 = r t – w T x t for any support vector x t ◦ Typically average over all support vectors  Resulting discriminant is called the support vector machine (SVM) CptS 570 - Machine Learning 12

O = support vectors margin CptS 570 - Machine Learning 13

 Data not linearly separable  Find hyperplane with least error  Define slack variables ξ t ≥ 0 storing deviation from the margin ( ) + ≥ − ξ t T t t r x w 1 w 0 CptS 570 - Machine Learning 14

 (a) Correctly classified example far from margin ( ξ t = 0)  (b) Correctly classified example on the margin ( ξ t = 0)  (c) Correctly classified example, but inside the margin (0 < ξ t < 1)  (d) Incorrectly classified example ( ξ t ≥ 1) ∑ ξ t  Soft error = t CptS 570 - Machine Learning 15

margin O = support vectors CptS 570 - Machine Learning 16

 Lagrangian equation with slack variables [ ] ∑ ( ) 1 ∑ ∑ = 2 + ξ − α + − + ξ − µ ξ t t t T t t t t L C r x w 1 w w p 0 t t t 2  C is penalty factor  μ t ≥ 0, new set of Lagrange multipliers  Want to minimize L p CptS 570 - Machine Learning 17

 Minimize L p by setting derivatives to zero ∂ L N ∑ = ⇒ = α p t t t 0 r w x ∂ w = t 1 ∂ L N ∑ = ⇒ α = p t t 0 r 0 ∂ w = t 1 0 ∂ L = ⇒ − α − µ = p t t 0 C 0 ∂ ξ t  Plugging these into L p yields dual L d  Maximize L d with respect to α t CptS 570 - Machine Learning 18

( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t L r r x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  Quadratic optimization problem  Support vectors have α t >0 ◦ Examples on margin: α t < C ◦ Examples inside margin or misclassified: α t = C CptS 570 - Machine Learning 19

( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t L r r x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  C is a regularization parameter ◦ High C  high penalty for non-separable examples (overfit) ◦ Low C  less penalty (underfit) ◦ Determine using validation set (C=1 typical) CptS 570 - Machine Learning 20

 To use previous approaches, data must be near linearly separable  If not, perhaps a transformation φ (x) will help  φ (x) are basis functions φ CptS 570 - Machine Learning 21

 Transform d -dimensional x space to k - dimensional z space using basis functions φ (x)  z= z= φ (x) where z j = φ j (x) , j=1,…,k = T g ( ) z w z k ∑ = = ϕ φ T g ( ) ( ) w ( ) x w x x j j = j 1  Instead of w 0 , assume z 1 = φ 1 (x) ≡1 CptS 570 - Machine Learning 22

[ ] ∑ 1 ∑ ∑ = 2 + ξ − α − + ξ − µ ξ φ t t t T t t t t L C r ( ) 1 w w x p t t t 2 ( ) 1 ∑∑ ∑ T = − α α + α φ φ t s t s t s t L r r ( ) x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  Replace inner product of basis functions φ (x t ) T φ (x s ) with kernel function K ( x t , x s ) 1 ∑∑ ∑ = − α α + α t s t s t s t L r r K ( , ) x x d 2 t s t CptS 570 - Machine Learning 23

 Kernel K ( x t , x s ) computes z -space product φ (x t ) T φ (x s ) in x -space ( ) ∑ ∑ = α = α t t t t t φ t r r w z x t t ( ) ∑ ( ) ( ) ( ) T = = α T φ t t φ t φ g r x w x x x t ( ) ( ) ∑ = α t t t g r K , x x x t  Matrix of kernel values K , where K ts = K ( x t , x s ), called the Gram matrix  K should be symmetric and positive semidefinite CptS 570 - Machine Learning 24

 Polynomial kernel of degree q ( ) ( ) q = + t T t K x , 1 x x x  If q=1, then use original features  For example, when q=2 and d=2 ) ( ) ( 2 = + T K , 1 x y x y ( ) = + + 2 x y x y 1 1 1 2 2 = + + + + + 2 2 2 2 x y x y x x y y x y x y 1 2 2 2 1 1 2 2 1 2 1 2 1 1 2 2 ( ) [ ] T φ = 2 2 , x , x , x x , x , x 1 2 2 2 x 1 2 1 2 1 2 CptS 570 - Machine Learning 25

 Polynomial kernel of degree 2 margin O = support vectors CptS 570 - Machine Learning 26

 Radial basis functions (Gaussian kernel)   2 − t ( ) x x   = − t K , exp x x   2 2 s    x t is the center  s is the radius  Larger s implies smoother boundaries CptS 570 - Machine Learning 27

CptS 570 - Machine Learning 28

 Sigmoidal functions = + t T t K ( , ) tanh( 2 1 ) x x x x tanh CptS 570 - Machine Learning 29

 Kernel K(x,y) increases with similarity between x and y  Prior knowledge can be included in the kernel function  E.g., training examples are documents ◦ K(D 1 ,D 2 ) = # shared words  E.g., training examples are strings (e.g., DNA) ◦ K(S 1 ,S 2 ) = 1 / edit distance between S 1 and S 2 ◦ Edit distance is the number of insertions, deletions and/or substitutions to transform S 1 into S 2 CptS 570 - Machine Learning 30

 E.g., training examples are nodes in a graph (e.g., social network)  K(N 1 ,N 2 ) = 1 / length of shortest path connecting nodes  K(N 1 ,N 2 ) = #paths connecting nodes  Diffusion kernel CptS 570 - Machine Learning 31

CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Or, support vector machine (SVM) Discriminant-based method Learn class boundaries Support vector consists of examples closest

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

School of EECS Washington State University CptS 570 - Machine Learning 1 Course overview

Introduction CptS 570 Machine Learning School of EECS Washington State University What is

Concept Learning Mitchell, Chapter 2 CptS 570 Machine Learning School of EECS Washington State

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning

Applied Machine Learning Spring 2018, CS 519 Prof. Liang Huang School of EECS Oregon State

Applied Machine Learning Spring 2019, CS 519 Prof. Liang Huang School of EECS Oregon State

Applied Machine Learning Spring 2019, CS 519 Prof. Liang Huang School of EECS Oregon State

CptS 360 (System Programming) Unit 19: Curses Bob Lewis School of Engineering and Applied

Lecture 1: Introduction to the Course EECS 545: Machine Learning Benjamin Kuipers EECS 545:

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Classic AI

School of EECS Washington State University Artificial Intelligence 1 I see a cougar.

Larry Holder School of EECS Washington State University 1 } Sometimes the truth or falsity of

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Course

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Full joint

CptS 360 (System Programming) Unit 16: Interprocess Communication Bob Lewis School of

Multimedia Mobile Application Development in iOS School of EECS Washington State University

Larry Holder School of EECS Washington State University Artificial Intelligence 1 }

Hashing CptS 223 Advanced Data Structures Larry Holder School of Electrical Engineering and