BBM406 Fundamentals of Machine Learning Lecture 15: Support - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of   Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019

Announcement • Midterm exam on Nov 29 Dec 6, 2019   at 09.00 in rooms D3 & D4   • No class next Wednesday! Extra o ffi ce hour. • No class class on Friday! Make-up class on   Dec 2 (Monday), 15:00-17:00 • No change in the due date of your Assg 3! 2

Last time… AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: Details/Retrospectives: [227x227x3] INPUT - first use of ReLU [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - used Norm layers (not common [27x27x96] MAX POOL1: 3x3 filters at stride 2 anymore) [27x27x96] NORM1: Normalization layer - heavy data augmentation [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - dropout 0.5 [13x13x256] MAX POOL2: 3x3 filters at stride 2 - batch size 128 [13x13x256] NORM2: Normalization layer slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson - SGD Momentum 0.9 [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4 [6x6x256] MAX POOL3: 3x3 filters at stride 2 - 7 CNN ensemble: 18.2% -> 15.4% [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) 3

Last time.. Understanding ConvNets slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf 4

Last time… Data Augmentation Random mix/combinations of: - translation - rotation - stretching - shearing, - lens distortions, … 5

Last time… Transfer Learning with Convolutional Networks 3. Medium dataset: 2. Small dataset: 1. Train on finetuning feature extractor Imagenet more data = retrain more of the network (or all of it) Freeze these tip: use only ~1/10th of Freeze the original learning rate these in finetuning top layer, and ~1/100th on slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson intermediate layers Train this Train this 6

Today • Support Vector Machines - Large Margin Separation - Optimization Problem - Support Vectors 7

Recap: Binary Classification Problem et • Training data: sample drawn i.i.d. from set   X ⊆ R N according to some distribution D , S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × { − 1 , +1 } . in • Problem: find hypothesis in H thesis in h : X �� { � 1 , +1 } or . (classifier) with small generalization error eneralization error R D ( h ) • Linear classification: - Hypotheses based on hyperplanes. - Linear separation in high-dimensional space. slide by Mehryar Mohri 8

Example: Spam • Imagine 3 features (spam is “positive” class): � � 1. free (number of occurrences of “free”) � � � � 2. money (occurrences of “money”) � � � � � � 3. BIAS (intercept, always has value 1) � � � � � � BIAS : 1 BIAS : -3 � � free : 1 free : 4 � free money � money : 1 money : 2 ... ... � � slide by David Sontag w.f(x)'>'0'  'SPAM!!!' w ・ f ( x )>0 ➞ SPAM !!! 9

Binary Decision Rule • In the space of feature vectors - Examples are points - Any weight vector is a hyperplane - One side corresponds to Y = +1 - Other corresponds to Y = -1 Y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2 ... 0 -1 = HAM slide by David Sontag 0 1 free 10

The perceptron algorithm • Start with weight vector = ~ 0 • For each training instance : ( x i , y ∗ i ) t - Classify with current weights (x i ) i i i (x i ) i - If correct (i.e. ), no change! y = y ∗ e! i - If wrong: update w = w + y ∗ i f ( x i ) slide by David Sontag 11

Properties of the perceptron algorithm • Separability: some parameters   get the training set perfectly   correct • Convergence: if the training is   linearly separable, perceptron   will eventually converge slide by David Sontag 12

Problems with the perceptron algorithm • Noise : if the data isn’t linearly separable, no guarantees of convergence or accuracy • Frequently the training data is linearly separable! Why ? - When the number of features is much larger than the number of data points, there is lots of flexibility - As a result, Perceptron can significantly overfit the data • Averaged perceptron is an algorithmic slide by David Sontag modification that helps with both issues - Averages the weight vectors across all iterations 13

Linear Separators • Which of these linear separators is optimal? slide by David Sontag 14

Support Vector Machines 15

Linear Separator Ham Spam slide by Alex Smola 16

Large Margin Classifier Ham Spam slide by Alex Smola 17

Review: Normal to a plane w !!"unit"vector"normal"to"w" w . x + b = 0 k w k " !!"projec9on"of" x j " ¯ x j onto"the"plane" " x j = λ w x j � ¯ k w k Is"the"length"of"the"vector,"i.e." λ λ slide by David Sontag x j � ¯ k w kk w k = λ x j = 18

Scale invariance Any other ways of writing the same dividing line?   • w . x +b=0 • 2 w . x +2b=0 • 1000 w . x + 1000b = 0 • .... slide by David Sontag 19

Scale invariance During'learning,'we'set'the' scale'by'asking'that,'for'all' t ,'' ''for' y t = +1 ,' w · x t + b ≥ 1 and'for' y t = -1 ,'' w · x t + b ≤ − 1 ' That'is,'we'want'to'sa8sfy'all'of' the' linear' constraints'' ' y t ( w · x t + b ) ≥ 1 ∀ t slide by David Sontag 20

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 21

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 slide by Alex Smola = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k 22

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 23

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b

  Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L   maximize L ( x ( α ) , α ) α slide by Alex Smola (keep explicit constraints)

      Dual Problem • Primal optimization problem   1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b • Lagrange function   constraint L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w , b is at saddle point with α • Derivatives in w , b need to vanish slide by Alex Smola

    Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w , b need to vanish   X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i slide by Alex Smola X α i y i = 0 and α i � 0 subject to i

Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i

Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition slide by Alex Smola α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0

Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter − Quadratic program slide by Alex Smola − We can replace the inner product by a kernel • Keeps instances away from the margin

Example slide by Alex Smola

Why Large Margins? • Maximum robustness relative o to uncertainty r o • Symmetry o + breaking • Independent of o correctly classified + ρ instances + • Easy to find for easy problems slide by Alex Smola +

Watch: Patrick Winston, Support Vector Machines https://www.youtube.com/watch?v=_PwhiWxHK8o 34

Next Lecture: Soft Margin Classification, Multi-class SVMs 35

BBM406 Fundamentals of Machine Learning Lecture 15: Support - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019 Announcement Midterm exam on Nov 29 Dec 6, 2019

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

Numerical Linear Algebra EECS 442 David Fouhey Fall 2019, University of Michigan

Fukaya categories of symmetric products and bordered Heegaard-Floer homology Denis AUROUX

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

Machine Learning for Signal Processing Fundamentals of Linear Algebra - 2 Class 3. 8 Sep 2016

Help help For PCs, Matlab should be a program. help command For Suns: Eg ., help

Correlation Cohen Chapter 9 EDUC/PSY 6600 "Statistics is not a discipline like physics,

Following the Energy Sectors Roadmap Carol Hawk CEDS R&D Program Manager Energy Sector

Causal Embeddings For Recommendation Stephen Bonner & Flavian Vasile Criteo Research

BBM406 Fundamentals of Machine Learning Lecture 15: Support - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019 Announcement Midterm exam on Nov 29 Dec 6, 2019

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

Numerical Linear Algebra EECS 442 David Fouhey Fall 2019, University of Michigan

Fukaya categories of symmetric products and bordered Heegaard-Floer homology Denis AUROUX

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

Machine Learning for Signal Processing Fundamentals of Linear Algebra - 2 Class 3. 8 Sep 2016

Help help For PCs, Matlab should be a program. help command For Suns: Eg ., help

Correlation Cohen Chapter 9 EDUC/PSY 6600 &quot;Statistics is not a discipline like physics,

Following the Energy Sectors Roadmap Carol Hawk CEDS R&amp;D Program Manager Energy Sector

Causal Embeddings For Recommendation Stephen Bonner &amp; Flavian Vasile Criteo Research

Correlation Cohen Chapter 9 EDUC/PSY 6600 "Statistics is not a discipline like physics,

Following the Energy Sectors Roadmap Carol Hawk CEDS R&D Program Manager Energy Sector

Causal Embeddings For Recommendation Stephen Bonner & Flavian Vasile Criteo Research