learning to compare using operator valued large margin
play

learning to compare using operator-valued large-margin classiers - PowerPoint PPT Presentation

learning to compare using operator-valued large-margin classiers andreas maurer a binary classication task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam ( X ) 1 . . . . a binary


  1. learning to compare using operator-valued large-margin classi…ers andreas maurer

  2. a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . . . .

  3. a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � is the probability to encounter the two inputs x; x 0 2 X being � x; x 0 ; r � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . . .

  4. a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � is the probability to encounter the two inputs x; x 0 2 X being � x; x 0 ; r � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . �� � �� � � m � x m ; x 0 X 2 � f� 1 ; 1 g x 1 ; x 0 S = 1 ; r 1 ; :::; m ; r m 2 training sample, generated in m independent, identical trials of � , i.e. S � � m . .

  5. a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � x; x 0 ; r � is the probability to encounter the two inputs x; x 0 2 X being � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . �� � �� � � m � x m ; x 0 X 2 � f� 1 ; 1 g x 1 ; x 0 S = 1 ; r 1 ; :::; m ; r m 2 training sample, generated in m independent, identical trials of � , i.e. S � � m . Goal: Use S to …nd a pair classi…er with low error probability.

  6. pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � Tx � Tx 0 � � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � . .

  7. pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � � Tx � Tx 0 � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � A choice of T 2 L ( H ) then implies a choice of � the pair classi…er f T , � x; x 0 � = � � Tx � Tx 0 � � � the pseudo-metric d � the Mahalanobis distance d 2 � x; x 0 � = � x � x 0 � ; x � x 0 � and � T � T � x; x 0 � = � T � Tx; x 0 � � the positive semide…nite kernel � .

  8. pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � � Tx � Tx 0 � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � A choice of T 2 L ( H ) then implies a choice of � the pair classi…er f T , � x; x 0 � = � � Tx � Tx 0 � � � the pseudo-metric d � the Mahalanobis distance d 2 � x; x 0 � = � T � T � x � x 0 � ; x � x 0 � and � x; x 0 � = � T � Tx; x 0 � � the positive semide…nite kernel � The risk of the operator T is the error probability of the classi…er f T � � 2 � � � Tx � Tx 0 � � n � x; x 0 � o � � R ( T ) = Pr f T 6 = r = Pr r 1 � � 0 � ( x;x 0 ;r ) � � ( x;x 0 ;r ) s �

  9. estimation and generalization Let f : R ! R , f � 1 ( �1 ; 0] with Lipschitz constant L . �� � �� � x m ; x 0 x 1 ; x 0 For a training sample S = 1 ; r 1 ; :::; m ; r m de…ne the empirical risk estimate � � 2 �� m � � �� X R f ( T; S ) = 1 � � x i � x 0 ^ f r i 1 � � T : � i m i =1 .

  10. estimation and generalization Let f : R ! R , f � 1 ( �1 ; 0] with Lipschitz constant L . �� � �� � x m ; x 0 x 1 ; x 0 For a training sample S = 1 ; r 1 ; :::; m ; r m de…ne the empirical risk estimate � � 2 �� m � � �� X R f ( T; S ) = 1 � � x i � x 0 ^ f r i 1 � � T : � i m i =1 Theorem: 8 � > 0 , with probability greater 1 � � in a sample S � � m 8 T 2 L ( H ) with k T � T k 2 � 1 q 8 L k T � T k 2 + ln (2 k T � T k 2 =� ) R ( T ) � ^ R f ( T; S ) + : p m where k A k 2 = Tr ( A � A ) 1 = 2 is the Hilbert-Schmidt- or Frobenius- norm of A .

  11. regularized objectives The theorem suggests to minimize the regularized objective � � 2 �� m � �� � + � k T � T k 2 X � f;� ( T ) := 1 � � x i � x 0 f r i 1 � � T : � p m i m i =1 Since k T � T k 2 � k T k 2 2 we can also use k T k 2 2 as a stronger regularizer (computationally more e¢cient, but slightly inferior in experiments).

  12. regularized objectives The theorem suggests to minimize the regularized objective � � 2 �� m � �� � + � k T � T k 2 X � f;� ( T ) := 1 � � x i � x 0 f r i 1 � � T : � p m i m i =1 Since k T � T k 2 � k T k 2 2 we can also use k T k 2 2 as a stronger regularizer (computationally more e¢cient, but slightly inferior in experiments). For f we take the hinge loss f � with margin � : f � has Lipschitz constant 1 =� and is convex. � � x � x 0 �� � x � x 0 � ; x � x 0 � is linear in T � T , � T � T � 2 = � T Since the objective � f � ;� ( T ) is a convex function of T � T:

  13. optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � �� � X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : f � r i 1 � � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . .

  14. optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � �� � X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : 1 � f � r i � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . First possibility: Solve convex optimization problem for � on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T . .

  15. optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � � �� X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : 1 � f � r i � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . First possibility: Solve convex optimization problem for � on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T . Second possibility ( my choice ): Do gradient-descent of � f � ;� in T No problems with local minima: If T is a stable local minimizer of � f � ;� , then T � T is a stable local minimizer of � .

  16. algorithm Given sample S , regularization parameter � , margin � , learning rate � initialize � 0 = �= p m (where m = j S j ) initialize T = ( v 1 ; :::; v m ) (where the v i are row-vectors) repeat �P E 2 � 1 = 2 D Compute k T � T k 2 = v i ; v j ij D E P For i = 1 ; :::; d compute w i = 2 k T � T k � 1 v i ; v j v i j 2 � from sample S � x; x 0 ; r Fetch � v i ; x � x 0 � For i = 1 ; :::; d compute a i Compute b P d i =1 a 2 i If r (1 � b ) < � � r � � x � x 0 � + � 0 w i then for i := 1 ; :::; d do v i v i � � � a i else for i := 1 ; :::; d do v i v i � �� 0 w i until convergence

  17. experiments with invariant character-recognition, spatial rotations (COIL100) and face recognition (ATT). 1. training T from one task/group of tasks 2. training nearest-neighbour test-classi…ers with a single example/class on a test task, using both the input metric and the metric induced by T . 3. recording the error rates of the test classi…ers The pixel vectors x are embedded in the space H with the Gaussian rbf-kernel: 0 2 1 � � � � x 1 x 2 � � � ( x 1 ; x 2 ) = 2 � 1 exp A : @ � 4 � k x 1 k � � � � k x 2 k The parameters � = 1 and � = 0 : 05 are used throughout.

  18. rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) .

  19. rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er .

  20. rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er Some digits used to test the classi…er:

  21. results for rotation/scale-invariant OCR ROC-Area input 0.539 ROC-Area T 0.982 1-NN Error input 0.822 1-NN Error T 0.093 � 1 � 4 � 0.005 Sample size 4000 Iterations 1000k

  22. norms and singular-value-spectrum of T k T k 1 = 61.5 k T k 2 = 27.7 k T k 1 = 17.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

  23. Thank you!

Recommend


More recommend