Lecture 17: − Multi-class SVMs − Kernels Aykut Erdem December 2016 Hacettepe University
Administrative • We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . • Project progress reports are due today! 2
Last time… Support Vector Machines h w, x i + b � 1 h w, x i + b � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 3
Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 4
Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b
Last time… Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i
Last time… Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w support vectors α i > 0 = ) 7
Last time… Soft-margin Classifier h w, x i + b � 1 h w, x i + b � 1 minimum error separator is impossible Theorem (Minsky & Papert) slide by Alex Smola Finding the minimum error separating hyperplane is NP hard
Last time… Adding Slack Variables ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ minimize amount of slack slide by Alex Smola Convex optimization problem
Last time… Adding Slack Variables • for point is between the margin and correctly 0 < ξ ≤ 1 classified • for point is misclassified ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ adopted from Andrew Zisserman minimize amount of slack Convex optimization problem
Last time… Adding Slack Variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: (also yields upper bound) w = 0 and b = 0 and ξ i = 1 slide by Alex Smola
Soft-margin classifier • Optimisation problem: 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 C is a regularization parameter: • small C allows constraints to be easily ignored → large margin • large C makes constraints hard to ignore adopted from Andrew Zisserman → narrow margin • C = ∞ enforces all constraints: hard margin
Demo time… 13
This week • Multi-class classification • Introduction to kernels 14
Multi-class classification slide by Eric Xing 15
Multi-class classification slide by Eric Xing 16
Multi-class classification slide by Eric Xing 17
One versus all classification • Learn&3&classifiers:& w + – &.&vs.&{o,+},&weights&w .& w - – +&vs.&{o,.},&weights&w +& – o&vs.&{+,.},&weights&w o& w o • Predict&label&using:& • Any&problems?& slide by Eric Xing • Could&we&learn&this&dataset?& 18
Multi-class SVM • Simultaneously-learn-3-sets-- w + of-weights:-- w - • How-do-we-guarantee-the-- correct-labels?-- w o • Need-new-constraints!-- The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:-- slide by Eric Xing 19
Multi-class SVM • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### ? slide by Eric Xing 20
21 Kernels slide by Alex Smola
Non-linear features • Regression We got nonlinear functions by preprocessing • Perceptron • Map data into feature space x → φ ( x ) • Solve problem in this space • Query replace by for code h x, x 0 i h φ ( x ) , φ ( x 0 ) i • Feature Perceptron • Solution in span of φ ( x i ) slide by Alex Smola
Non-linear features • Separating surfaces are Circles, hyperbolae, parabolae slide by Alex Smola
Solving XOR ( x 1 , x 2 ) ( x 1 , x 2 , x 1 x 2 ) • XOR not linearly separable • Mapping into 3 dimensions makes it easily solvable slide by Alex Smola 24
Linear Separation with Quadratic Kernels slide by Alex Smola 25
Quadratic Features Quadratic Features in Quadratic Features in R 2 p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 Φ ( x ) := 1 , 2 Dot Product Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . Insight Insight via Trick works for any polynomials of order Trick works for any polynomials of order d via h x, x 0 i d . slide by Alex Smola 26
Computational E ffi ciency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solu%on Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Defini%on Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . slide by Alex Smola If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . . 27
Recap: The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i • Classifier is linear combination of i ∈ I inner products X f ( x ) = y i h x i , x i + b slide by Alex Smola i ∈ I 28
Recap: The Perceptron on features initialize w, b = 0 repeat Pick ( x i , y i ) from data if y i ( w · Φ ( x i ) + b ) 0 then w 0 = w + y i Φ ( x i ) b 0 = b + y i until y i ( w · Φ ( x i ) + b ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) • Classifier is linear combination of i ∈ I slide by Alex Smola inner products X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b 29 i ∈ I
The Kernel Perceptron initialize f = 0 repeat Pick ( x i , y i ) from data if y i f ( x i ) ≤ 0 then f ( · ) ← f ( · ) + y i k ( x i , · ) + y i until y i f ( x i ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) i ∈ I • Classifier is linear combination of inner products slide by Alex Smola X X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b = y i k ( x i , x ) + b 30 i ∈ I i ∈ I
Processing Pipeline • Original data • Data in feature space (implicit) • Solve in feature space using kernels slide by Alex Smola 31
Polynomial Kernels Idea We want to extend k ( x, x 0 ) = h x, x 0 i 2 to k ( x, x 0 ) = ( h x, x 0 i + c ) d where c > 0 and d 2 N . Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. m ✓ d ◆ k ( x, x 0 ) = ( h x, x 0 i + c ) d = ( h x, x 0 i ) i c d � i X i i =0 slide by Alex Smola Individual terms ( h x, x 0 i ) i are dot products for some Φ i ( x ) . 32
Kernel Conditions Computability We have to be able to compute k ( x, x 0 ) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learn- ing problem at hand. Quite often this means smooth functions. Symmetry Obviously k ( x, x 0 ) = k ( x 0 , x ) due to the symmetry of the dot product h Φ ( x ) , Φ ( x 0 ) i = h Φ ( x 0 ) , Φ ( x ) i . Dot Product in Feature Space slide by Alex Smola Is there always a Φ such that k really is a dot product? 33
Mercer’s Theorem The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z k ( x, x 0 ) f ( x ) f ( x 0 ) dxdx 0 � 0 for all f 2 L 2 ( X ) X ⇥ X there exist φ i : X ! R and numbers λ i � 0 where λ i φ i ( x ) φ i ( x 0 ) for all x, x 0 2 X . X k ( x, x 0 ) = i Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have slide by Alex Smola X X k ( x i , x j ) α i α j � 0 34
Properties Distance in Feature Space Distance between points in feature space via d ( x, x 0 ) 2 := k Φ ( x ) � Φ ( x 0 ) k 2 = h Φ ( x ) , Φ ( x ) i � 2 h Φ ( x ) , Φ ( x 0 ) i + h Φ ( x 0 ) , Φ ( x 0 ) i = k ( x, x ) + k ( x 0 , x 0 ) � 2 k ( x, x ) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h Φ ( x i ) , Φ ( x j ) i = k ( x i , x j ) where x i are the training patterns. Similarity Measure slide by Alex Smola The entries K ij tell us the overlap between Φ ( x i ) and Φ ( x j ) , so k ( x i , x j ) is a similarity measure. 35
Recommend
More recommend