Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019
Announcement • Midterm exam on Nov 29 Dec 6, 2019 at 09.00 in rooms D3 & D4 • No class next Wednesday! Extra o ffi ce hour. • No class class on Friday! Make-up class on Dec 2 (Monday), 15:00-17:00 • No change in the due date of your Assg 3! 2
Last time… AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: Details/Retrospectives: [227x227x3] INPUT - first use of ReLU [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - used Norm layers (not common [27x27x96] MAX POOL1: 3x3 filters at stride 2 anymore) [27x27x96] NORM1: Normalization layer - heavy data augmentation [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - dropout 0.5 [13x13x256] MAX POOL2: 3x3 filters at stride 2 - batch size 128 [13x13x256] NORM2: Normalization layer slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson - SGD Momentum 0.9 [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4 [6x6x256] MAX POOL3: 3x3 filters at stride 2 - 7 CNN ensemble: 18.2% -> 15.4% [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) 3
Last time.. Understanding ConvNets slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf 4
Last time… Data Augmentation Random mix/combinations of: - translation - rotation - stretching - shearing, - lens distortions, … 5
Last time… Transfer Learning with Convolutional Networks 3. Medium dataset: 2. Small dataset: 1. Train on finetuning feature extractor Imagenet more data = retrain more of the network (or all of it) Freeze these tip: use only ~1/10th of Freeze the original learning rate these in finetuning top layer, and ~1/100th on slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson intermediate layers Train this Train this 6
Today • Support Vector Machines - Large Margin Separation - Optimization Problem - Support Vectors 7
Recap: Binary Classification Problem et • Training data: sample drawn i.i.d. from set X ⊆ R N according to some distribution D , S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × { − 1 , +1 } . in • Problem: find hypothesis in H thesis in h : X �� { � 1 , +1 } or . (classifier) with small generalization error eneralization error R D ( h ) • Linear classification: - Hypotheses based on hyperplanes. - Linear separation in high-dimensional space. slide by Mehryar Mohri 8
Example: Spam • Imagine 3 features (spam is “positive” class): � � 1. free (number of occurrences of “free”) � � � � 2. money (occurrences of “money”) � � � � � � 3. BIAS (intercept, always has value 1) � � � � � � BIAS : 1 BIAS : -3 � � free : 1 free : 4 � free money � money : 1 money : 2 ... ... � � slide by David Sontag w.f(x)'>'0' 'SPAM!!!' w ・ f ( x )>0 ➞ SPAM !!! 9
Binary Decision Rule • In the space of feature vectors - Examples are points - Any weight vector is a hyperplane - One side corresponds to Y = +1 - Other corresponds to Y = -1 Y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2 ... 0 -1 = HAM slide by David Sontag 0 1 free 10
The perceptron algorithm • Start with weight vector = ~ 0 • For each training instance : ( x i , y ∗ i ) t - Classify with current weights (x i ) i i i (x i ) i - If correct (i.e. ), no change! y = y ∗ e! i - If wrong: update w = w + y ∗ i f ( x i ) slide by David Sontag 11
Properties of the perceptron algorithm • Separability: some parameters get the training set perfectly correct • Convergence: if the training is linearly separable, perceptron will eventually converge slide by David Sontag 12
Problems with the perceptron algorithm • Noise : if the data isn’t linearly separable, no guarantees of convergence or accuracy • Frequently the training data is linearly separable! Why ? - When the number of features is much larger than the number of data points, there is lots of flexibility - As a result, Perceptron can significantly overfit the data • Averaged perceptron is an algorithmic slide by David Sontag modification that helps with both issues - Averages the weight vectors across all iterations 13
Linear Separators • Which of these linear separators is optimal? slide by David Sontag 14
Support Vector Machines 15
Linear Separator Ham Spam slide by Alex Smola 16
Large Margin Classifier Ham Spam slide by Alex Smola 17
Review: Normal to a plane w !!"unit"vector"normal"to"w" w . x + b = 0 k w k " !!"projec9on"of" x j " ¯ x j onto"the"plane" " x j = λ w x j � ¯ k w k Is"the"length"of"the"vector,"i.e." λ λ slide by David Sontag x j � ¯ k w kk w k = λ x j = 18
Scale invariance Any other ways of writing the same dividing line? • w . x +b=0 • 2 w . x +2b=0 • 1000 w . x + 1000b = 0 • .... slide by David Sontag 19
Scale invariance During'learning,'we'set'the' scale'by'asking'that,'for'all' t ,'' ''for' y t = +1 ,' w · x t + b ≥ 1 and'for' y t = -1 ,'' w · x t + b ≤ − 1 ' That'is,'we'want'to'sa8sfy'all'of' the' linear' constraints'' ' y t ( w · x t + b ) ≥ 1 ∀ t slide by David Sontag 20
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 21
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 slide by Alex Smola = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k 22
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 23
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b
Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L maximize L ( x ( α ) , α ) α slide by Alex Smola (keep explicit constraints)
Dual Problem • Primal optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b • Lagrange function constraint L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w , b is at saddle point with α • Derivatives in w , b need to vanish slide by Alex Smola
Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w , b need to vanish X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i slide by Alex Smola X α i y i = 0 and α i � 0 subject to i
Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i
Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition slide by Alex Smola α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0
Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter − Quadratic program slide by Alex Smola − We can replace the inner product by a kernel • Keeps instances away from the margin
Example slide by Alex Smola
Example slide by Alex Smola
Why Large Margins? • Maximum robustness relative o to uncertainty r o • Symmetry o + breaking • Independent of o correctly classified + ρ instances + • Easy to find for easy problems slide by Alex Smola +
Watch: Patrick Winston, Support Vector Machines https://www.youtube.com/watch?v=_PwhiWxHK8o 34
Next Lecture: Soft Margin Classification, Multi-class SVMs 35
Recommend
More recommend