MACHINE LEARNING
1
Classifiers: Support Vector Machine 1 MACHINE LEARNING What is - - PowerPoint PPT Presentation
MACHINE LEARNING MACHINE LEARNING Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult Children Detecting facial attributes He & Zhang, Pattern Recognition, 2011 Sony (Make Believe) Training set
MACHINE LEARNING
1
MACHINE LEARNING
2
Detecting facial attributes Sony (Make Believe) He & Zhang, Pattern Recognition, 2011
Children Female Adult
Training set must be as unambiguous as possible Not easy, especially as members of different classes may share similar attributes Learning implies generalization; which of the features of each member of class makes the class most distinguishable from the other classes.
MACHINE LEARNING
3
Children Female Adult Male Adult
Whenever possible, the classes should be balanced Garbage model: Male adult versus anything that is neither a female adult nor a child Classes can no longer be balanced!
MACHINE LEARNING
4
There is a plethora of classifiers, e.g:
In this class, we will see only SVM and Boosting for mixture of classifiers Each classifier type has its pros and cons:
the optimal classifier
MACHINE LEARNING
5
Brief history: SVM was invented by Vladimir Vapnik Started with the invention of the statistical learning theory (Vapnik1979) The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995) Textbooks: An easy introduction to SVM is given in Learning with Kernels by Bernhard Scholkopf and Alexander Smola. A good survey of the theory behind SVM is given in Support Vector Machines and other Kernel Based Learning methods by Nello Cristianini and John-Shawe Taylor.
MACHINE LEARNING
6
The success of SVM is mainly due to:
Was applied to numerous classification problems:
etc)
MACHINE LEARNING
7
‘good’ ‘OK’ ‘bad’
MACHINE LEARNING
8
denotes -1 denotes +1 How would you classify this data?
(W, b)
MACHINE LEARNING
9
denotes -1 denotes +1 How would you classify this data?
(W, b)
MACHINE LEARNING
10
denotes -1 denotes +1 How would you classify this data?
(W, b)
MACHINE LEARNING
11
denotes -1 denotes +1 How would you classify this data?
(W, b)
MACHINE LEARNING
12
denotes -1 denotes +1 Any of these would be fine.. ..but which is best?
(W, b)
MACHINE LEARNING
13
denotes -1 denotes +1
(W, b)
MACHINE LEARNING
14
denotes -1 denotes +1
Linear SVM
(W, b)
MACHINE LEARNING
15
denotes -1 denotes +1
Support Vectors are those datapoints that the margin pushes up against Linear SVM
(W, b)
MACHINE LEARNING
16
denotes -1 denotes +1
(W, b)
Need to determine a measure of the margin
MACHINE LEARNING
17
denotes -1 denotes +1
(W, b)
Need to determine a measure of the margin To maximize this measure
MACHINE LEARNING
18
: , 1 x w x b
: , 1 x w x b
Definition: The margin on either side of the hyperplane satisfy
: , x w x b
MACHINE LEARNING
19
: , 2 x w x b Class with label y=-1 Class with label y=+1
: , 2 x w x b
Points on either side of the separating plane have negative and positive coordinates, respectively .
: , 3 x w x b
: , 3 x w x b
MACHINE LEARNING
20
?
x
MACHINE LEARNING
21
w x
2 unitary vector
' . . , ' , ' , ' , , , ' , . Projection of x-x' onto w: , , x s t w x b w x x w x w x w x x b w x w x b w x b w w w w w
x’-x
x’
The margin between two classes is at least 2/||w||.
MACHINE LEARNING
22
Class with label y=-1 Class with label y=+1
1 2 1 2 1 2
Two points on either side of the margin: , 1 , 1 , x x 2 2 x x w x b w x b w w
x1 x2 The margin between two classes is at least 2/||w||.
MACHINE LEARNING – 2012
23
2
MACHINE LEARNING – 2012
24
2 ,
i=1,2,....,M.
i i i i i i
w b
MACHINE LEARNING – 2012
25
Rephrase the minimization under constraint problem in terms of the Lagrange Multipliers ai, i = 1, ..., M (M, # of data points), one for each of the inequality constraints and we get the dual problem:
2 1
i i
M i i i
(Minimization of convex function under linear constraints through Lagrange gives the optimal solution)
MACHINE LEARNING – 2012
26
, 2 1
i i
w b M i i
a
The solution of this problem is found when maximizing over a and minimizing over w and b:
MACHINE LEARNING – 2012
27
Requesting that the gradient of L vanishes with w.
1
i i
M i i
The vector defining the hyperplane is determined by the training points.
Note that while w is unique (minimization of convex function), the alpha are not unique.
MACHINE LEARNING – 2012
28
Requesting that the gradient of L vanishes with w. Requires minimum one datapoint on each class
1
i
M i i
MACHINE LEARNING – 2012
29
Requesting that the gradient of L vanishes with w. Original constraints
, , , 1 , , , 1
i i i i
L w b y w x b L w b y w x b a a a a
MACHINE LEARNING – 2012
30
1 1
Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-
i i i i i
M i i M i i
L w b w y x w L w b y b L w b y w x b a a a a a a
Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..
i i
i i
y w x b i M i M a a
MACHINE LEARNING – 2012
31
1 1
Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-
i i i i i
M i i M i i
L w b w y x w L w b y b L w b y w x b a a a a a a
Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..
i i
i i
y w x b i M i M a a
i i i
i
Points correctly classified
MACHINE LEARNING – 2012
32
1 1
Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-
i i i i i
M i i M i i
L w b w y x w L w b y b L w b y w x b a a a a a a
Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..
i i
i i
y w x b i M i M a a
The , 1,.. determine the solutions to the constraints All the pairs of data points , for which 0 are the support vectors All the pairs of data points , for which 0 are "irrelevant" w
i i i i
i i i
i M x y x y a a a hen computing the margin.
MACHINE LEARNING – 2012
34
Consider 3 cases: 1 1 <1 inside the margin 1
>1 for 1 outside the margin >0 for
<0 for
i i i i i i i i i i i
T T T T i T i T T i T
y w x b w x b y w x b w x b y w x b y y w x b w x b y w x b 1 do not satisfy the constraint
i
y
: , 1 x w x b
: , 1 x w x b
: , x w x b
MACHINE LEARNING – 2012
35
1
i i
M i i
1
, ,
i i
M i i
L w b w y x w a a
Use , 1 to compute b.
i i
i y
w x b a
MACHINE LEARNING
36
denotes +1 denotes -1
MACHINE LEARNING
37
i T i i i T i i i
denotes +1 denotes -1
1
2
3
Can again be expressed through compact notation: y 1
i T i i
w x b
MACHINE LEARNING
39
2 1 ,
M j j w
denotes +1 denotes -1
1
2
3
C>0 weights the influence of the penalty term
MACHINE LEARNING
40
2 1 ,
j j
M j j w T j j
The hyperplane has the same solution as for the separable case: 1
j j
M j j
MACHINE LEARNING
42
denotes -1 denotes +1
1
T
w x b 1
T
w x b w
Support Vectors Decision boundary is determined only by those support vectors !
1 M i i i i
i i
T i i
ai = 0 for non-support vectors ai 0 for support vectors (i=0)
MACHINE LEARNING – 2012
43
What if the points in the input space cannot be separated by a linear hyperplane?
MACHINE LEARNING – 2012
44
As usual, observe that the decision function of the linear SVM computes an inner product across pairs of observations:
i j
1
i i
M i i
MACHINE LEARNING – 2012
49
1
i i
M i i
1
M i i i i
W lives in feature space now!
MACHINE LEARNING – 2012
50
1
M i i i i
Use the kernel trick by exploiting the fact that the decision function depends on a dot product in feature space
i j i j
MACHINE LEARNING – 2012
51
1
M i i i i
Use the kernel trick by exploiting the fact that the decision function depends on a dot product in feature space
i j i j
MACHINE LEARNING – 2012
52
1
i i
M i i
1 , 1 1
N
M M i j i j i i j i i j M i i i i
a
The optimization problem in feature space becomes: The decision function in feature space is computed as follows:
Kernel Kernel
MACHINE LEARNING – 2012
53
1 1
j j i i
M M i j i
MACHINE LEARNING
55
Hyperplane Support vectors Color gradient = distance to the hyperplane
MACHINE LEARNING
56
Hyperplane Support vectors The margin
MACHINE LEARNING
57
1
i i i
M i i
2
' 2
x x
The kernel has several open parameters (Hyperparameters) that need to be determined before running SVM
Kernel width Order of the polynomial Usually d=1
p
MACHINE LEARNING
58
2 1 ,
j j
M j j w T j j
C that determines the costs associated to incorrectly classifying datapoints is an open parameter of the optimization function
MACHINE LEARNING
59
MACHINE LEARNING
60
RBF kernel width=0.20; C=1000; several misclassified datapoints
MACHINE LEARNING
61
RBF kernel width=0.20; C=2000; less misclassified datapoints
MACHINE LEARNING
62
RBF kernel width =0.001; C=1000; 113 support vectors out of 345 total nm of datapoints
MACHINE LEARNING
63
RBF kernel width =0.008; C=1000; 64 support vectors out of 345 total nm of datapoints
MACHINE LEARNING
64
RBF kernel width = 0.02; C=1000; 33 support vectors out of 345 total nm of datapoints
MACHINE LEARNING
65
Several combination of the support vectors yield the same optimum
MACHINE LEARNING
66
Several combination of the support vectors yield the same optimum
MACHINE LEARNING
67
2 1 ,
M j j w
Determining C may be difficult in practice
MACHINE LEARNING
68
2 , 1
M i w i i i i i
-SVM is an alternative that optimizes for the best tradeoff between model complexity (the largest margin) and penalty on the error automatically.
MACHINE LEARNING
69
MACHINE LEARNING
70
-svm =0.001, rbf kernel width 0.1
MACHINE LEARNING
71
Increase in the number of SV-s with =0.2
MACHINE LEARNING
72
Increase in the number of SV-s with =0.9
MACHINE LEARNING
73
Increase in the error with =0.2
MACHINE LEARNING
74
Increase in the error with =0.9
MACHINE LEARNING
75
SVM have wide range application for all type of data (vision, text, handwriting, etc). SVM is very powerful for large scale classification. Optimized solvers for the training stage. Rapid during recall. One issue is that the computation grows linearly with number of SV and the algorithm is not very sparse in SV. Another issue is that it can predict only two classes. For multi- class classification, one needs to run several two-class classifiers.
MACHINE LEARNING
76
Children Female Adult Male Adult
1
Construct a set of K binary classifiers f ,....,f , each trained to separate one class from the rest.
K 3
f
1
f
2
f
1,... 1
Compute the class label in a winner-take-all approach: j= , arg max
i i
M j j i j M i
y k x x b a
Sufficient to compute only K- 1 classifier for K classes But computing the K’th classifier may provide tighter bounds on the Kth class.
MACHINE LEARNING
78
MACHINE LEARNING
79
MACHINE LEARNING
80
Drawbacks of combining multiple binary classifiers:
model or threshold on minimum of associated classifier function; but difficult to compare the scale of each classifier functions)
examples than others); one can play with the C penalty to give a relative influence as a function of the number of patterns, e.g. C=10*M. Alternative is to compute all the classes as part of the optimization
and C. Watkins. Multi-class support vector machines, 1998.)
MACHINE LEARNING
81
Even though SVM usually results in a relatively small number of support vectors compared to total number of data-points, nothing ensures to obtain a sparse solution. SVM requires also to finds hyper-parameters (C, ) and to have special form for the basis function (the kernel must satisfy the Mercer conditions). RVM relaxes these two assumptions, by taking a Bayesian approach.
MACHINE LEARNING
82
Start from the solution of SVM
1
sgn , , ....... 1
M i i i T T
y x f x k x x b x i y x x x M a a
Rewrite the solution of SVM as a linear combination over M basis functions A sparse solution has majority of entries in alpha zero.
1 1
1 . . . ... . 1
M
a a a a
MACHINE LEARNING
83
1
1
Model the distribution on the class label with Bernoulli distribuion ; 1 Replace sign function with continuous but steep Sigmoid function 1/ 1
M i i i i i z
i i
y y
p y x g x b g x b g z e a a a
Doing maximum likelihood would lead to overfitting, as we have more parameters than datapoints. Approximate the distribution of the alpha with a probability density function which reduces the number of parameters to estimate. Cannot compute the optimal alpha in close form. Must perform an iterative method similar to expectation maximization (see Tipping 2001, supplementary material).
MACHINE LEARNING
84
This parameter determines how good the fit is. The smaller the value, the closer the true parameters are fitted.
MACHINE LEARNING
85
MACHINE LEARNING
86
RVM with l=0.04, kernel width = 0.01 477 datapoints, 19 support vectors
MACHINE LEARNING
87
SVM with C=1000, kernel width = 0.01 477 datapoints, 51 support vectors
MACHINE LEARNING
89
Pros:
membership (uncertainty on prediction of class label) Cons:
precision)
iteration grow with square and cube of the number of basis functions, i.e. with number of datapoints)
MACHINE LEARNING
90
Pros:
membership (uncertainty on prediction of class label) Cons:
precision)
iteration grow with square and cube of the number of basis functions, i.e. with number of datapoints)