Classifiers: Support Vector Machine 1 MACHINE LEARNING What is - - PowerPoint PPT Presentation

classifiers support vector machine 1 machine learning
SMART_READER_LITE
LIVE PREVIEW

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is - - PowerPoint PPT Presentation

MACHINE LEARNING MACHINE LEARNING Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult Children Detecting facial attributes He & Zhang, Pattern Recognition, 2011 Sony (Make Believe) Training set


slide-1
SLIDE 1

MACHINE LEARNING

1

MACHINE LEARNING Classifiers: Support Vector Machine

slide-2
SLIDE 2

MACHINE LEARNING

2

What is Classification?

Detecting facial attributes Sony (Make Believe) He & Zhang, Pattern Recognition, 2011

Children Female Adult

Training set must be as unambiguous as possible Not easy, especially as members of different classes may share similar attributes Learning implies generalization; which of the features of each member of class makes the class most distinguishable from the other classes.

slide-3
SLIDE 3

MACHINE LEARNING

3

Multi-Class Classification

Children Female Adult Male Adult

Whenever possible, the classes should be balanced Garbage model: Male adult versus anything that is neither a female adult nor a child Classes can no longer be balanced!

slide-4
SLIDE 4

MACHINE LEARNING

4

Classifiers

There is a plethora of classifiers, e.g:

  • Neural networks (feed-forward with backpropagation, multi-layer perceptron)
  • Decision trees (C4.5, random forest)
  • Kernel methods (support vector machine, gaussian process classifier)
  • Mixtures of linear classifiers (boosting)

In this class, we will see only SVM and Boosting for mixture of classifiers Each classifier type has its pros and cons:

  • Complex model: embed non-linearity but heavy computation
  • Simple models: often high number of models, hence high stack memory
  • Number of hyperparameters: high  extensive crossvalidation to determine

the optimal classifier

  • Some of the classifiers come with guarantees for global optimal solution;
  • ther have only local optimality guarantee
slide-5
SLIDE 5

MACHINE LEARNING

5

Support Vector Machine

Brief history: SVM was invented by Vladimir Vapnik Started with the invention of the statistical learning theory (Vapnik1979) The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995) Textbooks: An easy introduction to SVM is given in Learning with Kernels by Bernhard Scholkopf and Alexander Smola. A good survey of the theory behind SVM is given in Support Vector Machines and other Kernel Based Learning methods by Nello Cristianini and John-Shawe Taylor.

slide-6
SLIDE 6

MACHINE LEARNING

6

Support Vector Machine

The success of SVM is mainly due to:

  • Its ease of use (lots of software available, good documentation)
  • Excellent performance on variety of datasets
  • Good solvers making optimization (learning phase) very quick
  • Very fast at retrieval time – does not hinder practical applications

Was applied to numerous classification problems:

  • Computer vision (face detection, object recognition, feature categorization,

etc)

  • Bioinformatics (categorization of gene expression, of microarray data)
  • WWW (categorization of websites)
  • Production (control of quality, detection of defaults)
  • Robotics (categorization of sensor readings)
  • Finance (bankruptcy prediction)
slide-7
SLIDE 7

MACHINE LEARNING

7

Optimal Linear Classification

  • Which choice is better?
  • How could we formulate this problem?

‘good’ ‘OK’ ‘bad’

slide-8
SLIDE 8

MACHINE LEARNING

8

Linear Classifiers

x

denotes -1 denotes +1 How would you classify this data?

f

(W, b)

yest

 

 

; , sgn , f x w b w x b  

slide-9
SLIDE 9

MACHINE LEARNING

9

f

x

yest

denotes -1 denotes +1 How would you classify this data?

(W, b)

Linear Classifiers

 

 

; , sgn , f x w b w x b  

slide-10
SLIDE 10

MACHINE LEARNING

10

f

x

yest

denotes -1 denotes +1 How would you classify this data?

(W, b)

Linear Classifiers

 

 

; , sgn , f x w b w x b  

slide-11
SLIDE 11

MACHINE LEARNING

11

f

x

yest

denotes -1 denotes +1 How would you classify this data?

(W, b)

Linear Classifiers

 

 

; , sgn , f x w b w x b  

slide-12
SLIDE 12

MACHINE LEARNING

12

 

 

; , sgn , f x w b w x b  

f

x

yest

denotes -1 denotes +1 Any of these would be fine.. ..but which is best?

(W, b)

Linear Classifiers

slide-13
SLIDE 13

MACHINE LEARNING

13

Classifier Margin

f

x

yest

denotes -1 denotes +1

Define the margin

  • f a linear

classifier as the width that the boundary could be increased by before hitting a datapoint.

(W, b)

 

 

; , sgn , f x w b w x b  

slide-14
SLIDE 14

MACHINE LEARNING

14

f

x

yest

denotes -1 denotes +1

The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM)

Linear SVM

(W, b)

Classifier Margin

 

 

; , sgn , f x w b w x b  

slide-15
SLIDE 15

MACHINE LEARNING

15

f

x

yest

denotes -1 denotes +1

The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against Linear SVM

(W, b)

Classifier Margin

 

 

; , sgn , f x w b w x b  

slide-16
SLIDE 16

MACHINE LEARNING

16

f

x

yest

denotes -1 denotes +1

(W, b)

Classifier Margin

 

 

; , sgn , f x w b w x b  

Need to determine a measure of the margin

slide-17
SLIDE 17

MACHINE LEARNING

17

f

x

yest

denotes -1 denotes +1

(W, b)

Classifier Margin

 

 

; , sgn , f x w b w x b  

Need to determine a measure of the margin  To maximize this measure

slide-18
SLIDE 18

MACHINE LEARNING

18

Determining the Optimal Separating Hyperplane

 

: , 1 x w x b   

 

: , 1 x w x b   

Definition: The margin on either side of the hyperplane satisfy

, 1. w x b   

 

: , x w x b  

slide-19
SLIDE 19

MACHINE LEARNING

19

Determining the Optimal Separating Hyperplane

 

: , 2 x w x b    Class with label y=-1 Class with label y=+1

 

: , 2 x w x b   

Points on either side of the separating plane have negative and positive coordinates, respectively .

 

: , 3 x w x b   

 

: , 3 x w x b   

 

 

Decision function: ; , sgn , f x w b w x b  

slide-20
SLIDE 20

MACHINE LEARNING

20

What is the distance from a point x to the hyperplane <w,x>+b= 0?

?

Determining the Optimal Separating Hyperplane

x

, w x b  

slide-21
SLIDE 21

MACHINE LEARNING

21

Determining the Optimal Separating Hyperplane

w x

2 unitary vector

' . . , ' , ' , ' , , , ' , . Projection of x-x' onto w: , , x s t w x b w x x w x w x w x x b w x w x b w x b w w w w w              

x’-x

x’

, Distance to hyperplane w x b w  

The margin between two classes is at least 2/||w||.

slide-22
SLIDE 22

MACHINE LEARNING

22

Determining the Optimal Separating Hyperplane

Class with label y=-1 Class with label y=+1

 

1 2 1 2 1 2

Two points on either side of the margin: , 1 , 1 , x x 2 2 x x w x b w x b w w            

x1 x2 The margin between two classes is at least 2/||w||.

slide-23
SLIDE 23

MACHINE LEARNING – 2012

23

2

2 Separating condition is measured by . To maximize this condition is equivalent to minimizing . 2 Better even is to minimize the convex form . 2 w w w

Determining the Optimal Separating Hyperplane

slide-24
SLIDE 24

MACHINE LEARNING – 2012

24

  • Finding the Optimal Separating Hyperplane turns out to be

an optimization problem of the following form:

  • N+1 parameters (N: dimension of data)
  • M constraints (M: nm of datapoints)
  • It is called the primal problem.

 

2 ,

i=1,2,....,M.

1 2 , 1 when 1 , 1, , 1 when 1

min

i i i i i i

w b

w w x b y y w x b w x b y                 

Determining the Optimal Separating Hyperplane

slide-25
SLIDE 25

MACHINE LEARNING – 2012

25

Rephrase the minimization under constraint problem in terms of the Lagrange Multipliers ai, i = 1, ..., M (M, # of data points), one for each of the inequality constraints and we get the dual problem:

 

 

 

2 1

1 , , , 1 2 with

i i

M i i i

L w b w y w x b a a a

    

Determining the Optimal Separating Hyperplane

(Minimization of convex function under linear constraints through Lagrange gives the optimal solution)

slide-26
SLIDE 26

MACHINE LEARNING – 2012

26

 

 

 

 

 

, 2 1

max min , , where 1 , , , 1 2

i i

w b M i i

L w b L w b w y w x b

a

a a a

 

   

The solution of this problem is found when maximizing over a and minimizing over w and b:

Determining the Optimal Separating Hyperplane

slide-27
SLIDE 27

MACHINE LEARNING – 2012

27

Requesting that the gradient of L vanishes with w.

 

1

, ,

i i

M i i

L w b w y x w a a

    

The vector defining the hyperplane is determined by the training points.

Note that while w is unique (minimization of convex function), the alpha are not unique.

Determining the Optimal Separating Hyperplane

slide-28
SLIDE 28

MACHINE LEARNING – 2012

28

Requesting that the gradient of L vanishes with w. Requires minimum one datapoint on each class

Determining the Optimal Separating Hyperplane

 

1

, ,

i

M i i

L w b y b a a

    

slide-29
SLIDE 29

MACHINE LEARNING – 2012

29

Requesting that the gradient of L vanishes with w. Original constraints

Determining the Optimal Separating Hyperplane

 

 

 

 

, , , 1 , , , 1

i i i i

L w b y w x b L w b y w x b a a a a            

slide-30
SLIDE 30

MACHINE LEARNING – 2012

30

     

 

1 1

Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-

i i i i i

M i i M i i

L w b w y x w L w b y b L w b y w x b a a a a a a

 

               

 

 

 

Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..

i i

i i

y w x b i M i M a a        

Determining the Optimal Separating Hyperplane

slide-31
SLIDE 31

MACHINE LEARNING – 2012

31

     

 

1 1

Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-

i i i i i

M i i M i i

L w b w y x w L w b y b L w b y w x b a a a a a a

 

               

 

 

 

Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..

i i

i i

y w x b i M i M a a        

Determining the Optimal Separating Hyperplane  

Feasibility conditions from KKT require 0 for all points x for the primal constraints to be satisfied: , 1

i i i

i

y w x b a   

Points correctly classified

slide-32
SLIDE 32

MACHINE LEARNING – 2012

32

     

 

1 1

Complete optimization problem: , , (Dual feasibility) , , 0 (Dual feasibility) , , , 1 (Primal feasibility) Karush-

i i i i i

M i i M i i

L w b w y x w L w b y b L w b y w x b a a a a a a

 

               

 

 

 

Kuhn-Tucker conditions: , 1 1,.. (Complementarity conditions) 0, 1,..

i i

i i

y w x b i M i M a a        

Determining the Optimal Separating Hyperplane

   

The , 1,.. determine the solutions to the constraints All the pairs of data points , for which 0 are the support vectors All the pairs of data points , for which 0 are "irrelevant" w

i i i i

i i i

i M x y x y a a a    hen computing the margin.

slide-33
SLIDE 33

MACHINE LEARNING – 2012

34

               

Consider 3 cases: 1 1 <1 inside the margin 1

  • 1 for
  • 1 or

>1 for 1 outside the margin >0 for

  • 1 or

<0 for

i i i i i i i i i i i

T T T T i T i T T i T

y w x b w x b y w x b w x b y w x b y y w x b w x b y w x b                       1 do not satisfy the constraint

i

y  

Determining the Optimal Separating Hyperplane

 

: , 1 x w x b   

 

: , 1 x w x b   

 

: , x w x b  

slide-34
SLIDE 34

MACHINE LEARNING – 2012

35

 

 

The decision function is then expressed in terms of the support vectors: , f x sgn w x b  

Determining the Optimal Separating Hyperplane

1

,

i i

M i i

sgn y x x b a

       

 

1

, ,

i i

M i i

L w b w y x w a a

    

  

 

Use , 1 to compute b.

i i

i y

w x b a   

To determine how good the hyperplane is crossvalidation to get an estimation of the error on the testing set. 

slide-35
SLIDE 35

MACHINE LEARNING

36

Non-Separable Data Sets

denotes +1 denotes -1

This is going to be a problem! What should we do? Idea : Introduce some slack on the constraints

slide-36
SLIDE 36

MACHINE LEARNING

37

Support Vector Machine for non-separable datasets

   

The constraints are relaxed by introducing slack variables , 1,... : 1 if y 1 1 if y 1

i T i i i T i i i

i M w x b w x b                 

denotes +1 denotes -1

1

2

3

 

 

Can again be expressed through compact notation: y 1

i T i i

w x b      

slide-37
SLIDE 37

MACHINE LEARNING

39

2 1 ,

The objective function gives a penalty for too large slack variables: 1 min 2 Find a trade off between maximizing the margin and minimizing the classification errors.

M j j w

C w M

      

denotes +1 denotes -1

1

2

3

C>0 weights the influence of the penalty term

Support Vector Machine for non-separable datasets

slide-38
SLIDE 38

MACHINE LEARNING

40

 

2 1 ,

Optimization under constraint problem of the form: 1 min 2 u.c. 1 , 0 j=1,...M

j j

M j j w T j j

C w M y w x b

  

            

Support Vector Machine for non-separable datasets

The hyperplane has the same solution as for the separable case: 1

j j

M j j

α y

  w x

slide-39
SLIDE 39

MACHINE LEARNING

42

denotes -1 denotes +1

1

T

w x b     1

T

w x b     w

Support Vectors Decision boundary is determined only by those support vectors !

1 M i i i i

α y

  w x

  

 

: 1

i i

T i i

i y w x b a       

ai = 0 for non-support vectors ai  0 for support vectors (i=0)

Support Vector Machine for non-separable datasets

slide-40
SLIDE 40

MACHINE LEARNING – 2012

43

Non-Linear Classification

What if the points in the input space cannot be separated by a linear hyperplane?

slide-41
SLIDE 41

MACHINE LEARNING – 2012

44

The Kernel Trick for SVM

As usual, observe that the decision function of the linear SVM computes an inner product across pairs of observations:

,

i j

x x

 

 

1

The decision function: , ,

i i

M i i

f x sgn w x b sgn y x x b a

         

slide-42
SLIDE 42

MACHINE LEARNING – 2012

49

 

 

1

The decision function in linear classification with SVM was given by: , ,

i i

M i i

f x sgn w x b sgn y x x b a

         

Non-Linear Classification with Support Vector Machines        

 

 

 

1

Replace linear plane in original space , by linear plane in feature space: and , The decision function becomes: , ,

M i i i i

w x b x x w x b f x sgn w x b y x x b    a  

     

W lives in feature space now!

slide-43
SLIDE 43

MACHINE LEARNING – 2012

50

Non-Linear Classification with Support Vector Machines        

 

 

 

1

Replace linear plane in original space , by linear plane in feature space: and , The decision function becomes: , ,

M i i i i

w x b x x w x b f x sgn w x b y x x b    a  

     

Use the kernel trick by exploiting the fact that the decision function depends on a dot product in feature space

     

, ,

i j i j

k x x x x   

slide-44
SLIDE 44

MACHINE LEARNING – 2012

51

Non-Linear Classification with Support Vector Machines        

 

 

1

Replace linear plane in original space , by linear plane in feature space: and , The decision function becomes: , ,

M i i i i

w x b x x w x b f x sgn w x b y k x x b    a

     

Use the kernel trick by exploiting the fact that the decision function depends on a dot product in feature space

     

, ,

i j i j

k x x x x   

slide-45
SLIDE 45

MACHINE LEARNING – 2012

52

 

 

1

f ,

i i

M i i

x sgn y k x x b a

       

 

 

1 , 1 1

1 max , 2 subject to 0 for all 1,... and 0.

N

M M i j i j i i j i i j M i i i i

L y y k x x i M y

a

a a a a a a

   

    

  

The optimization problem in feature space becomes: The decision function in feature space is computed as follows:

Kernel Kernel

Non-Linear Classification with Support Vector Machines

slide-46
SLIDE 46

MACHINE LEARNING – 2012

53

 

1 1

The offset b can be estimated from the KKT conditions. Best is to compute the expectation over the constraints and hence to compute an estimate of through regression: 1

  • ,

j j i i

M M i j i

b b y y k x x M a

 

      

 

Non-Linear Classification with Support Vector Machines

slide-47
SLIDE 47

MACHINE LEARNING

55

How to read out the result of SVM

Hyperplane Support vectors Color gradient = distance to the hyperplane

slide-48
SLIDE 48

MACHINE LEARNING

56

How to read out the result of SVM

Hyperplane Support vectors The margin

slide-49
SLIDE 49

MACHINE LEARNING

57

 

 

1

SVM decision function: f sgn ,

i i i

M i i

x y k x x b a

         

The hyperparameters of SVM

 

2

' 2

Gaussian kernel: , ' , .

x x

k x x e

 

 

The kernel has several open parameters (Hyperparameters) that need to be determined before running SVM

Kernel width Order of the polynomial Usually d=1

 

 

Inhomogeneous polynomial kernel: , ' , ' , ,

p

k x x x x d p d    

slide-50
SLIDE 50

MACHINE LEARNING

58

 

2 1 ,

Optimization under constraint problem of the form: 1 min 2 u.c. 1 , 0 j=1,...M

j j

M j j w T j j

C w M y w x b

  

            

The hyperparameters of SVM

C that determines the costs associated to incorrectly classifying datapoints is an open parameter of the optimization function

slide-51
SLIDE 51

MACHINE LEARNING

59

Non-Linear Support Vector Machines: Examples

slide-52
SLIDE 52

MACHINE LEARNING

60

Effect of the penalty factor C

RBF kernel width=0.20; C=1000; several misclassified datapoints

slide-53
SLIDE 53

MACHINE LEARNING

61

RBF kernel width=0.20; C=2000; less misclassified datapoints

Effect of the penalty factor C

slide-54
SLIDE 54

MACHINE LEARNING

62

RBF kernel width =0.001; C=1000; 113 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

slide-55
SLIDE 55

MACHINE LEARNING

63

RBF kernel width =0.008; C=1000; 64 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

slide-56
SLIDE 56

MACHINE LEARNING

64

RBF kernel width = 0.02; C=1000; 33 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

slide-57
SLIDE 57

MACHINE LEARNING

65

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

slide-58
SLIDE 58

MACHINE LEARNING

66

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

slide-59
SLIDE 59

MACHINE LEARNING

67

2 1 ,

The original objective function: 1 min 2

M j j w

C w M

      

Support Vector Machine for non-separable datasets

Determining C may be difficult in practice

slide-60
SLIDE 60

MACHINE LEARNING

68

 

2 , 1

1 min , subject to y , and 0, 0.

M i w i i i i i

w M w x b

     

            

Support Vector Machine for non-separable datasets

-SVM is an alternative that optimizes for the best tradeoff between model complexity (the largest margin) and penalty on the error automatically.

slide-61
SLIDE 61

MACHINE LEARNING

69

Support Vector Machine for non-separable datasets

 is an upper bound on the fraction of margin error (i.e. the number of datapoints misclassified in the margin)  is a lower bound on the number of support vectors

slide-62
SLIDE 62

MACHINE LEARNING

70

Support Vector Machine for non-separable datasets

-svm =0.001, rbf kernel width 0.1

slide-63
SLIDE 63

MACHINE LEARNING

71

Support Vector Machine for non-separable datasets

Increase in the number of SV-s with =0.2

slide-64
SLIDE 64

MACHINE LEARNING

72

Support Vector Machine for non-separable datasets

Increase in the number of SV-s with =0.9

slide-65
SLIDE 65

MACHINE LEARNING

73

Support Vector Machine for non-separable datasets

Increase in the error with =0.2

slide-66
SLIDE 66

MACHINE LEARNING

74

Increase in the error with =0.9

Support Vector Machine for non-separable datasets

slide-67
SLIDE 67

MACHINE LEARNING

75

SVM have wide range application for all type of data (vision, text, handwriting, etc). SVM is very powerful for large scale classification. Optimized solvers for the training stage. Rapid during recall. One issue is that the computation grows linearly with number of SV and the algorithm is not very sparse in SV. Another issue is that it can predict only two classes. For multi- class classification, one needs to run several two-class classifiers.

Summary: When to use SVM?

slide-68
SLIDE 68

MACHINE LEARNING

76

Multi-Class SVM

Children Female Adult Male Adult

1

Construct a set of K binary classifiers f ,....,f , each trained to separate one class from the rest.

K 3

f

1

f

2

f

 

1,... 1

Compute the class label in a winner-take-all approach: j= , arg max

i i

M j j i j M i

y k x x b a

 

      

Sufficient to compute only K- 1 classifier for K classes But computing the K’th classifier may provide tighter bounds on the Kth class.

slide-69
SLIDE 69

MACHINE LEARNING

78

Multi-Class SVM

slide-70
SLIDE 70

MACHINE LEARNING

79

Multi-Class SVM

slide-71
SLIDE 71

MACHINE LEARNING

80

Multi-Class SVM

Drawbacks of combining multiple binary classifiers:

  • How to reject an instance if it belongs to none of the classes (garbage

model or threshold on minimum of associated classifier function; but difficult to compare the scale of each classifier functions)

  • Asymmetric classification (some classes have many more positive

examples than others); one can play with the C penalty to give a relative influence as a function of the number of patterns, e.g. C=10*M. Alternative is to compute all the classes as part of the optimization

  • function. Performance seem however comparable to one against all
  • ptimization (Frank & Hlavak, Multi-class support vector machine, 2002; J. Weston

and C. Watkins. Multi-class support vector machines, 1998.)

slide-72
SLIDE 72

MACHINE LEARNING

81

Relevance Vector Machine

Even though SVM usually results in a relatively small number of support vectors compared to total number of data-points, nothing ensures to obtain a sparse solution. SVM requires also to finds hyper-parameters (C, ) and to have special form for the basis function (the kernel must satisfy the Mercer conditions). RVM relaxes these two assumptions, by taking a Bayesian approach.

slide-73
SLIDE 73

MACHINE LEARNING

82

Relevance Vector Machine

Start from the solution of SVM

   

 

       

1

sgn , , ....... 1

M i i i T T

y x f x k x x b x i y x x x M a  a  

                      

Rewrite the solution of SVM as a linear combination over M basis functions A sparse solution has majority of entries in alpha zero.

1 1

1 . . . ... . 1

M

a a a a                                              

slide-74
SLIDE 74

MACHINE LEARNING

83

Relevance Vector Machine

 

 

 

 

 

 

 

 

 

1

1

Model the distribution on the class label with Bernoulli distribuion ; 1 Replace sign function with continuous but steep Sigmoid function 1/ 1

M i i i i i z

i i

y y

p y x g x b g x b g z e a a a

 

     

Doing maximum likelihood would lead to overfitting, as we have more parameters than datapoints.  Approximate the distribution of the alpha with a probability density function which reduces the number of parameters to estimate. Cannot compute the optimal alpha in close form. Must perform an iterative method similar to expectation maximization (see Tipping 2001, supplementary material).

slide-75
SLIDE 75

MACHINE LEARNING

84

Relevance Vector Machine

This parameter determines how good the fit is. The smaller the value, the closer the true parameters are fitted.

[0,1]  

slide-76
SLIDE 76

MACHINE LEARNING

85

Relevance Vector Machine

slide-77
SLIDE 77

MACHINE LEARNING

86

Relevance Vector Machine

RVM with l=0.04, kernel width = 0.01 477 datapoints, 19 support vectors

slide-78
SLIDE 78

MACHINE LEARNING

87

Relevance Vector Machine

SVM with C=1000, kernel width = 0.01 477 datapoints, 51 support vectors

slide-79
SLIDE 79

MACHINE LEARNING

89

Pros and Cons of RVM

Pros:

  • Yields sparser solutions (less support vectors)
  • Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label) Cons:

  • Iterative method (need a stopping criterion; arbitrary value on

precision)

  • Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis functions, i.e. with number of datapoints)

slide-80
SLIDE 80

MACHINE LEARNING

90

Other classifiers

Pros:

  • Yields sparser solutions (less support vectors)
  • Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label) Cons:

  • Iterative method (need a stopping criterion; arbitrary value on

precision)

  • Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis functions, i.e. with number of datapoints)