Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 – Machine Learning School of Electrical Engineering and Computer Science Washington State University
References � Muller et al., “An Introduction to Kernel-Based Learning Algorithms,” IEEE Transactions on Neural Networks , 12(2):181-201, 2001.
Learning Problem � Estimate function f : R N � { -1,+ 1} using n training data ( x i ,y i ) sampled from P( x ,y) � Want f minimizing expected error (risk) R[f] ∫ = R [ f ] loss ( f ( x ), y ) dP ( x , y ) � P( x ,y) unknown, so compute empirical risk R emp [f] 1 n ∑ = R [ f ] loss ( f ( x ), y ) emp i i n = i 1
Overfit � Using R emp [f] to estimate R[f] for small n may lead to overfit
Overfit � Can restrict the class F of f � I.e., restrict the VC dimension h of F � Model selection � Find F such that learned f ∈ F minimizes R emp [f] ’s overestimate of R[f] � With probability 1- δ and n> h : 2 n + − δ h (ln 1 ) ln( / 4 ) h ≤ + R [ f ] R [ f ] emp n
Overfit � Tradeoff between empirical risk R emp [f] and uncertainty in estimate of R[f] Expected Risk Uncertainty Empirical Risk Complexity of F
Margins � Consider a training sample separable by the hyperplane f( x ) = ( w · x ) + b � Margin is the minimal distance of a sample to the decision surface � We can bound the VC dimension of the set of hyperplanes by bounding the margin w margin
Nonlinear Algorithms � Likely to underfit using only hyperplanes � But we can map the data to a nonlinear space and use hyperplanes there � Φ : R N � F � x � Φ ( x ) Φ
Curse of Dimensionality � Difficulty of learning increases with the dimensionality of the problem � I.e., Harder to learn with more features � But, difficulty based on complexity of learning algorithm and VC of hypothesis class � Hyperplanes are easy to learn � Still mapping to extremely high dimensional spaces makes even hyperplane learning difficult
Kernel Functions � For some feature spaces F and mappings Φ there is a “trick” for efficiently computing scalar products � Kernel functions compute scalar products in F without mapping data to F or even knowing Φ
Kernel Functions � Example: kernel k Φ → 2 3 : R R → = 2 2 ( x , x ) ( z , z , z ) ( x , 2 x x , x ) 1 2 1 2 3 1 1 2 2 Τ Φ ⋅ Φ = 2 2 2 2 ( ( x ) ( y )) ( x , 2 x x , x )( y , 2 y y , y ) 1 1 2 2 1 1 2 2 Τ = 2 (( x , x )( y , y ) ) 1 2 1 2 = ⋅ 2 ( x y ) = k ( x , y )
Kernel Functions ⎛ ⎞ − − 2 || x y || ⎜ ⎟ = Gaussian RBF : k ( x , y ) exp ⎜ ⎟ c ⎝ ⎠ ⋅ + θ d Polynomial : (( x y ) ) κ ⋅ + θ Sigmoidal : tanh( ( x y ) ) 1 Inverse multiquadr atic : − + 2 2 || x y || c
Support Vector Machines � Supervised learning ⋅ x + ≥ = y (( w ) b ) 1 , i 1 , , n K i i � Mapping to nonlinear space ⋅ Φ + ≥ = y (( w ( x ) b ) 1 , i 1 , , n (Eq. 8) K i i � Minimize (subject to Eq. 8) 1 2 min || w || 2 w b ,
Support Vector Machines � Problem: w resides in F , where computation is difficult � Solution: remove dependency on w � Introduce Lagrange multipliers � α i ≥ 0, i = 1,…,n � One for each constraint in Eq. 8 � And use kernel function
Support Vector Machines 1 n ∑ = − α ⋅ Φ + − 2 L ( w , b , α ) || w || ( y (( w ( x )) b ) 1 ) i i i 2 = i 1 ∂ L n ∑ = → α = 0 y 0 i i ∂ b = i 1 ∂ n L ∑ = → = α Φ 0 w y ( x ) i i i ∂ w = i 1 Substituting last two equations into first and replacing ( Φ (x i ) · Φ (x j )) with kernel function k(x i ,x j ) …
Support Vector Machines n 1 n ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : α ≥ = 0 , i 1 ,..., n i n ∑ α = y 0 i i = i 1 This is a quadratic optimization function.
Support Vector Machines � Once we have α , we have w and can perform classification ⎛ ⎞ n ∑ = ⎜ α Φ ⋅ Φ + ⎟ f ( x ) sgn y ( ( x ) ( x )) b i i i ⎝ ⎠ = i 1 ⎛ ⎞ n ∑ = α + ⎜ ⎟ sgn y k ( x , x ) b , where i i i ⎝ ⎠ = i 1 ⎛ ⎞ 1 n n ∑ ∑ ⎜ ⎟ = − α b y y k ( x , x ) ⎜ ⎟ i j j i j n ⎝ ⎠ = = i 1 j 1
SVMs with Noise � Until now, assuming problem is linearly separable in some space � But if noise is present, this may be a bad assumption � Solution: Introduce noise terms (slack variables ξ i ) into the classification ⋅ + ≥ − ξ ξ ≥ = y (( w x ) b ) 1 , 0 , i 1 , , n K i i i i
SVMs with Noise � Now, we want to minimize 1 n ∑ + ξ 2 min || w || C i 2 w , b , ξ = i 1 � Where C > 0 determines tradeoff between empirical error and hypothesis complexity
SVMs with Noise n n 1 ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : ≤ α ≥ = 0 C , i 1 ,..., n i n ∑ α = y 0 i i = i 1 where C is limiting the size of the Lagrange multipliers α i
Sparsity � Note that many training examples will be outside the margin w � Therefore, their optimal α i = 0 margin α = ⇒ ≥ ξ = 0 y f ( x ) 1 and 0 i i i i < α < ⇒ = ξ = 0 C y f ( x ) 1 and 0 i i i i α = ⇒ ≤ ξ ≥ C y f ( x ) 1 and 0 i i i i � This reduces the optimization problem from n variables down to the number of examples on or inside the margin
Kernel Methods � Fisher’s linear discriminant � Find a linear projection of the feature space such that classes are well separated � “Well separated” defined as a large difference in the means and a small variance along the discriminant � Can be solved using kernel methods to find nonlinear discriminants
Applications � Optical pattern and object recognition � Invariant SVM achieved best error rate (0.6%) on USPS handwritten digit recognition problem � Better than humans (2.5%) � Text categorization � Time-series prediction
Applications � Gene expression profile analysis � DNA and protein analysis � SVM method (13%) of classifying DNA translation initiation sites outperforms best neural network (15%) � Virtual SVMs, incorporating prior biological knowledge, reached 11-12% error rate
Kernel Methods for Unsupervised Learning � Principal Components Analysis (PCA) used in unsupervised learning � PCA is a linear method � Kernel-based PCA can achieve non-linear components using standard kernel techniques � Application to USPS data to reduce noise indicated a factor of 8 performance improvement over linear PCA method
Summary � (+ ) Kernel-based methods allow linear-speed learning in non-linear spaces � (+ ) Support vector machines ignore all but the most differentiating training data (those on or inside the margin) � (+ ) Kernel-based methods and SVMs in particular, are among the best performing classifiers on many learning problems � (-) Choosing an appropriate kernel can be difficult � (-) High dimensionality of original learning problem can still be a computational bottleneck
Recommend
More recommend