Human-Oriented Robotics Prof. Kai Arras Social Robotics Lab Human-Oriented Robotics Supervised Learning Part 2/3 Kai Arras Social Robotics Lab, University of Freiburg 1
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • So far, we have considered probabilistic classi fi ers that compute a p ( w | x ) posterior probability distribution over the world state, for example, a discrete distribution over di ff erent class labels • We can also learn the discriminant function directly (even y = f ( x ) more “directly” than a probabilistic discriminant classi fi er). For instance, | f ( x ) = 0 in a two-class problem, f (.) might be binary-valued such that f ( x ) = 1 C 1 C 2 represents class and represents class • Inference and decision stages are combined • Choosing a model for f (.) and using training data to learn y = f ( x ) corresponds to learning the decision boundary directly • This is unlike probabilistic classi fi ers where the decision boundary followed indirectly from our choices for the involved models 2
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • Let us consider linear discriminant functions . This choice y = f ( x ) implies the assumption that our data are linearly separable • Let us again consider a binary classi fi cation problem, y 2 {–1, +1} • The representation of a linear function is y = f ( x ) = w T x + b w where is the normal to the hyperplane (sometimes called weight vector) and b is called bias w T x + b = 0 • The hyperplane itself is described by b • The perpendicular distance from the plane to the origin is k w k • (Notice the change in notation: in this section, we adopt the standard w notion to denote the normal to the hyperplane, not the world state) 3
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • Figure shows geometry x 2 y > 0 = f ( x ) = w T x + b of y = 0 R 1 in 2 dimensions y < 0 R 2 T x A T x B • Consider two points , that both lie on the plane x w f ( x A ) = w T x A + b = 0 f ( x ) = y ( x ) k w k f ( x B ) = w T x B + b = 0 x ? w T x A + b = w T x B + b x 1 w T ( x A � x B ) = 0 + b � w 0 k w k • Thus, vector is orthogonal to every vector lying ) = w ) = w within the hyperplane, and so determines the orientation of the plane 4
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions T x + • Consider a point and its orthogonal projection x 2 y > 0 y = 0 T x ⊥ R 1 onto the plane . Then y < 0 � R 2 x = x ⊥ + r w x w y ( x ) f ( x ) = k w k k w k x ? x 1 • Let us solve for r , the signed perpendicular distance from + b � w 0 k w k w T ( T x + k k to the plane. Multiplying both sides by and adding b k w k + b = w T x ⊥ + b + r w T w k w k = r w T w w T x + b = w T x ⊥ + w T r w k k k w k f ( x ) = r k w k 2 r = f ( x ) k w k = r k w k , k w k • Note that distance r is signed b T x + • For = 0, the perpendicular distance from the plane to the origin is k w k 5
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions t • This can also be seen o d x 2 y > 0 θ from the de fi nition s o y = 0 c θ s k R 1 o of the dot product y < 0 x c k k = R 2 k k x w w T x = k w k k x k cos θ = k w k x w x f ( x ) = w T x + b = k w k x w + b k k w f ( x ) = f ( x ) = y ( x ) k w k r = f ( x ) b k w k = x w + cos θ x ? k w k k k k k x 1 � b > 0 if x w > + b � w 0 k w k k w k � b = 0 if x w = r k w k � b < 0 if x w < k w k 6
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • Consider a linearly separable classi fi cation problem with two classes and y 2 { � 1 , +1 } outputs y 1 = +1 y 2 = � 1 ( x 1 , y ) ( x 3 , y y 3 = +1 y N = � 1 ) ( x N , y ) ( x 2 , y • How to separate the classes? 7
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • Consider a linearly separable classi fi cation problem with two classes and y 2 { � 1 , +1 } outputs • There is an in fi nite number of decision boundaries that perfectly separate the classes in the training set • Which one to choose? 8
Human-Oriented Robotics Supervised Learning Prof. Kai Arras Social Robotics Lab Non-Probabilistic Discriminant Functions • The one with the smallest generalization error! • This is what Support Vector Machines (SVM) do. The approach to minimize the generalization error is to maximize the margin 9
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Margin and Support Vectors • The margin is de fi ned as the perpendicular distance between the decision boundary and the closest data points support vectors ) = w margin • The closest data points are called support vectors • The aim of Support Vector Machines is to orientate a hyperplane in such a way as to be as far as possible from the support vectors of both classes 10
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Margin and Support Vectors • This amounts to the estimation of the normal vector and the bias b ) = w • We have seen that determines the orientation of the hyperplane and ) = w b the ratio its position from the origin k w k • Thus, in addition to the direction y> +1 ) = w of and the value for b , y =+1 there is one more degree y =0 of freedom, namely the y = � 1 magnitude of the normal y< � 1 k w k vector • We can thus de fi ne in k w k support vectors ) = w a way that, without loss of k generality, for support vectors margin | f ( x ) | = | y | =1 holds 11
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Margin and Support Vectors • We then de fi ne two planes H 1 , H 2 through the support vectors. y> +1 They are described by y =+1 y =0 H 1 : w T x + b = +1 y = � 1 H 2 : w T x + b = � 1 y< � 1 • Our training data for ( x i , y i ) all i can thus be described by ) = w H 1 w T x i + b � +1 for y i = +1 | w T x i + b � 1 for y i = � 1 H 2 which can be combined to y i ( w T x i + b ) � 1 � 0 12
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Margin and Support Vectors Does not maximize the margin • Let us look at this expression y =+1 y =0 y i ( w T x i + b ) � 1 � 0 y = � 1 • It is a set of N constraints on and b to be satis fi ed ) = w during the learning phase ) = w | H 1 • However, the constraints H 2 alone do not maximize the margin • From our choice of it follows that the margin is k w k r = f ( x ) 1 k w k = k w k • Thus, maximizing the margin is equivalent to minimizing k w k 13
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Learning • SVM learning consists in minimizing subject to the constraints k w k y i ( w T x i + b ) � 1 � 0 1 • Instead of minimizing we can also minimize which leads to k w k 2 k w k 2 the formulation 1 2 k w k 2 y i ( w T x i + b ) � 1 � 0 arg min s . t . w ,b • This is a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints • In order to solve this constrained optimization problem, we will need to introduce Lagrange multipliers 14
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Lagrange Multipliers • The method of Lagrange multipliers is a strategy for fi nding the local maxima and minima of a function subject to equality constraints • Consider, for instance, the constraint optimization problem f ( x, y ) maximize g ( x, y ) = c subject to • Let us visualize contours of f given by f ( x, y ) = d Source [6] for various values of f and the g ( x, y ) = c contour of g given by 15
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Lagrange Multipliers • Following the contour lines of g = c , we want to fi nd the point on it with the largest value of f . Then, f will be stationary as we move along g = c Source [6] • In general, contour lines of g = c will cross/intersect the contour lines of f . This is equivalent to saying that the value of f varies while moving along g = c • Only when the line for g = c meets contour lines of f tangentially , that is, Source [6] the lines touch but do not cross, the value of f does not change along g = c 16
Human-Oriented Robotics Support Vector Machines Prof. Kai Arras Social Robotics Lab Lagrange Multipliers • Contour lines touch when their tangent vectors are parallel. This is the same as saying that the gradients are parallel , because the gradient is always Source [6] perpendicular to the contour • This can be formally expressed as r x,y f = � λ r x,y g � r r � r with ✓ ◆ ✓ ∂ f ◆ ✓ ∂g ◆ ∂ x, ∂ f ∂x, ∂g r x,y f = r x,y g = Source [6] ∂ y ∂y • In general r x f ( x ) = � λ r x g ( x ) ✓ ◆ 17
Recommend
More recommend