Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin
Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1
Pegasos vs. Perceptron Perceptron Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 0 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1
Much faster than previous methods • 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Pegasos SVM-Perf SVM-Light 2 77 20,075 Training Time Reuters (in seconds): 6 85 25,514 Covertype 2 5 80 Astro-Physics
Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Note: w 0 is redefined in this context (see below) – does not refer to initial weight err(w 0 ) vector • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision
Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Pegasos ✓ 1 err(w 0 ) ◆ T = ˜ After updates: O ��✏ • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer err(w T ) < err(w 0 ) + ✏ w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss With probability 1- δ w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision
Extending to multi-class classification
One versus all classification w + Learn 3 classifiers: w - • - vs {o,+}, weights w - • + vs {o,-}, weights w + • o vs {+,-}, weights w o w o Predict label using: Any problems? Could we learn this (1-D) dataset? � -1 1 0
Multi-class SVM w + Simultaneously learn 3 sets of weights: w - • How do we guarantee the correct labels? w o • Need new constraints! The “score” of the correct class must be better than the “score” of wrong classes:
Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it? � -1 1 0
How to deal with imbalanced data? • In many practical applications we may have imbalanced data sets • We may want errors to be equally distributed between the positive and negative classes • A slight modification to the SVM objective does the trick! Class-specific weighting of the slack variables
What if the data is not linearly separable? Use features of features of features of features…. x (1) . . . x ( n ) x (1) x (2) φ ( x ) = x (1) x (3) . . . e x (1) . . . Feature space can get really large really quickly!
Key idea #3: the kernel trick • High dimensional feature spaces at no extra cost! • After every update (of Pegasos), the weight vector can be written in the form: X w = α i y i x i i • As a result, prediction can be performed with: y ← sign( w · φ ( x )) ˆ ⇣ ⌘ X = sign ( α i y i φ ( x i )) · φ ( x ) i ⇣ X ⌘ = sign α i y i ( φ ( x i ) · φ ( x )) i ⇣ X ⌘ = sign α i y i K ( x i , x ) where K ( x , x 0 ) = φ ( x ) · φ ( x 0 ) . i
Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!
Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d
Quadratic kernel [Tommi Jaakkola]
Gaussian kernel Level sets, i.e. for some r Support vectors [Cynthia Rudin] [mblondel.org]
Kernel algebra Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: To see that this is a kernel, use the Taylor series expansion of the Then, apply (e) from above exponential, together with repeated application of (a), (b), and (c): The feature mapping is infinite dimensional! [Justin Domke]
Dual SVM interpretation: Sparsity w . x + b = +1 w . x + b = 0 w . x + b = -1 Final solution tends to be sparse • α j =0 for most j • don’t need to store these points to compute w or make predictions Non-support Vectors: • α j =0 Support Vectors: • moving them will not • α j ≥ 0 change w
Overfitting? • Huge feature space with kernels: should we worry about overfitting? – SVM objective seeks a solution with large margin • Theory says that large margin leads to good generalization (we will see this in a couple of lectures) – But everything overfits sometimes!!! – Can control by: • Setting C • Choosing a better Kernel • Varying parameters of the Kernel (width of Gaussian, etc.)
Recommend
More recommend