cpts 570 machine learning school of eecs washington state
play

CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Or, support vector machine (SVM) Discriminant-based method Learn class boundaries Support vector consists of examples closest


  1. CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

  2.  Or, support vector machine (SVM)  Discriminant-based method ◦ Learn class boundaries  Support vector consists of examples closest to boundary  Kernel computes similarity between examples ◦ Maps instance space to a higher-dimensional space where (hopefully) linear models suffice  Choosing the right kernel is crucial  Kernel machines among best-performing learners CptS 570 - Machine Learning 2

  3.  Likely to underfit using only hyperplanes  But we can map the data to a nonlinear space and use hyperplanes there ◦ Φ : R d  F ◦ x x  Φ (x) Φ CptS 570 - Machine Learning 3

  4.  + ∈ { } t 1 if C x X = = t t t  1 , r where r x − ∈ t t  1 if C x 2 find and w such that w 0 + ≥ + = + T t t w 1 for r 1 w x 0 + ≤ − = − T t t w 1 for r 1 w x 0 which can be rewritten as ( ) + ≥ + t T t r w 1 w x 0  Note we want ≥+1, not ≥0  Want instances some distance from hyperplane CptS 570 - Machine Learning 4

  5.  Distance from instance x t to hyperplane w T x t +w 0 + T t + w t T t w x r ( w ) w x 0 0 or w w  Distance from hyperplane to closest instances is the margin w margin CptS 570 - Machine Learning 5

  6.  Optimal separating hyperplane is the one maximizing the margin  We want to choose w maximizing ρ such that ( ) + t T t r w w x ≥ ρ ∀ 0 , t w  Infinite number of solutions by scaling w  Thus, we choose solution minimizing ‖ w ‖ ( ) 1 + ≥ + ∀ 2 t T t r w , t min subject to 1 w w x 0 2 CptS 570 - Machine Learning 6

  7. ( ) 1 2 + ≥ + ∀ t T t min subject to r w , t 1 w w x 0 2  Quadratic optimization problem with complexity polynomial in d (#features)  Kernel will eventually map d-dimensional space to higher-dimensional space  Prefer complexity not based on #dimensions CptS 570 - Machine Learning 7

  8.  Convert optimization problem to depend on number of training examples N (not d) ◦ Still polynomial in N  But optimization will depend only on closest examples (support vector) ◦ Typically ≪N CptS 570 - Machine Learning 8

  9.  Rewrite quadratic optimization problem using Lagrange multipliers α t , 1 ≤ t ≤ N ( ) 1 2 + ≥ + ∀ t T t min subject to r w 1 , t w w x 0 2 [ ] ( ) N 1 ∑ = 2 − α + − t t T t L r w 1 w w x p 0 2 = t 1 ( ) ∑ N N 1 ∑ = 2 − α + + α t t T t t r w w w x 0 2 = = t 1 t 1  Minimize L p CptS 570 - Machine Learning 9

  10.  Equivalently, we can maximize L p subject to the constraints: ∂ L N ∑ = ⇒ = α p t t t 0 r w x ∂ w = t 1 ∂ L N ∑ = ⇒ α = p t t 0 r 0 ∂ w = t 1 0  Plugging these into L p … CptS 570 - Machine Learning 10

  11. ( ) 1 ∑ ∑ ∑ = − α − α + α T T t t t t t t L r w r w w w x d 0 2 t t t ( ) 1 ∑ = − + α T t w w 2 t ( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t r r x x 2 t s t ∑ α = α ≥ ∀ t t t subject to r and , t 0 0 t  Maximize L d with respect to α t only  Complexity O(N 3 ) CptS 570 - Machine Learning 11

  12.  Most α t = 0 ◦ I.e., r t (w T x t +w 0 ) > 1 (x t lie outside margin)  Support vectors: x t such that α t > 0 ◦ I.e., r t (w T x t +w 0 ) = 1 (x t lie on margin)  w = Σ t α t r t x t  w 0 = r t – w T x t for any support vector x t ◦ Typically average over all support vectors  Resulting discriminant is called the support vector machine (SVM) CptS 570 - Machine Learning 12

  13. O = support vectors margin CptS 570 - Machine Learning 13

  14.  Data not linearly separable  Find hyperplane with least error  Define slack variables ξ t ≥ 0 storing deviation from the margin ( ) + ≥ − ξ t T t t r x w 1 w 0 CptS 570 - Machine Learning 14

  15.  (a) Correctly classified example far from margin ( ξ t = 0)  (b) Correctly classified example on the margin ( ξ t = 0)  (c) Correctly classified example, but inside the margin (0 < ξ t < 1)  (d) Incorrectly classified example ( ξ t ≥ 1) ∑ ξ t  Soft error = t CptS 570 - Machine Learning 15

  16. margin O = support vectors CptS 570 - Machine Learning 16

  17.  Lagrangian equation with slack variables [ ] ∑ ( ) 1 ∑ ∑ = 2 + ξ − α + − + ξ − µ ξ t t t T t t t t L C r x w 1 w w p 0 t t t 2  C is penalty factor  μ t ≥ 0, new set of Lagrange multipliers  Want to minimize L p CptS 570 - Machine Learning 17

  18.  Minimize L p by setting derivatives to zero ∂ L N ∑ = ⇒ = α p t t t 0 r w x ∂ w = t 1 ∂ L N ∑ = ⇒ α = p t t 0 r 0 ∂ w = t 1 0 ∂ L = ⇒ − α − µ = p t t 0 C 0 ∂ ξ t  Plugging these into L p yields dual L d  Maximize L d with respect to α t CptS 570 - Machine Learning 18

  19. ( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t L r r x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  Quadratic optimization problem  Support vectors have α t >0 ◦ Examples on margin: α t < C ◦ Examples inside margin or misclassified: α t = C CptS 570 - Machine Learning 19

  20. ( ) 1 ∑∑ ∑ T = − α α + α t s t s t s t L r r x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  C is a regularization parameter ◦ High C  high penalty for non-separable examples (overfit) ◦ Low C  less penalty (underfit) ◦ Determine using validation set (C=1 typical) CptS 570 - Machine Learning 20

  21.  To use previous approaches, data must be near linearly separable  If not, perhaps a transformation φ (x) will help  φ (x) are basis functions φ CptS 570 - Machine Learning 21

  22.  Transform d -dimensional x space to k - dimensional z space using basis functions φ (x)  z= z= φ (x) where z j = φ j (x) , j=1,…,k = T g ( ) z w z k ∑ = = ϕ φ T g ( ) ( ) w ( ) x w x x j j = j 1  Instead of w 0 , assume z 1 = φ 1 (x) ≡1 CptS 570 - Machine Learning 22

  23. [ ] ∑ 1 ∑ ∑ = 2 + ξ − α − + ξ − µ ξ φ t t t T t t t t L C r ( ) 1 w w x p t t t 2 ( ) 1 ∑∑ ∑ T = − α α + α φ φ t s t s t s t L r r ( ) x x d 2 t s t ∑ α = ≤ α ≤ ∀ t t t subject to r 0 and 0 C , t t  Replace inner product of basis functions φ (x t ) T φ (x s ) with kernel function K ( x t , x s ) 1 ∑∑ ∑ = − α α + α t s t s t s t L r r K ( , ) x x d 2 t s t CptS 570 - Machine Learning 23

  24.  Kernel K ( x t , x s ) computes z -space product φ (x t ) T φ (x s ) in x -space ( ) ∑ ∑ = α = α t t t t t φ t r r w z x t t ( ) ∑ ( ) ( ) ( ) T = = α T φ t t φ t φ g r x w x x x t ( ) ( ) ∑ = α t t t g r K , x x x t  Matrix of kernel values K , where K ts = K ( x t , x s ), called the Gram matrix  K should be symmetric and positive semidefinite CptS 570 - Machine Learning 24

  25.  Polynomial kernel of degree q ( ) ( ) q = + t T t K x , 1 x x x  If q=1, then use original features  For example, when q=2 and d=2 ) ( ) ( 2 = + T K , 1 x y x y ( ) = + + 2 x y x y 1 1 1 2 2 = + + + + + 2 2 2 2 x y x y x x y y x y x y 1 2 2 2 1 1 2 2 1 2 1 2 1 1 2 2 ( ) [ ] T φ = 2 2 , x , x , x x , x , x 1 2 2 2 x 1 2 1 2 1 2 CptS 570 - Machine Learning 25

  26.  Polynomial kernel of degree 2 margin O = support vectors CptS 570 - Machine Learning 26

  27.  Radial basis functions (Gaussian kernel)   2 − t ( ) x x   = − t K , exp x x   2 2 s    x t is the center  s is the radius  Larger s implies smoother boundaries CptS 570 - Machine Learning 27

  28. CptS 570 - Machine Learning 28

  29.  Sigmoidal functions = + t T t K ( , ) tanh( 2 1 ) x x x x tanh CptS 570 - Machine Learning 29

  30.  Kernel K(x,y) increases with similarity between x and y  Prior knowledge can be included in the kernel function  E.g., training examples are documents ◦ K(D 1 ,D 2 ) = # shared words  E.g., training examples are strings (e.g., DNA) ◦ K(S 1 ,S 2 ) = 1 / edit distance between S 1 and S 2 ◦ Edit distance is the number of insertions, deletions and/or substitutions to transform S 1 into S 2 CptS 570 - Machine Learning 30

  31.  E.g., training examples are nodes in a graph (e.g., social network)  K(N 1 ,N 2 ) = 1 / length of shortest path connecting nodes  K(N 1 ,N 2 ) = #paths connecting nodes  Diffusion kernel CptS 570 - Machine Learning 31

Recommend


More recommend