learning to compare using operator-valued large-margin classiers - PowerPoint PPT Presentation

learning to compare using operator-valued large-margin classi…ers andreas maurer

a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . . . .

a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � is the probability to encounter the two inputs x; x 0 2 X being � x; x 0 ; r � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . . .

a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � is the probability to encounter the two inputs x; x 0 2 X being � x; x 0 ; r � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . �� m � x m ; x 0 X 2 � f� 1 ; 1 g x 1 ; x 0 S = 1 ; r 1 ; :::; m ; r m 2 training sample, generated in m independent, identical trials of � , i.e. S � � m . .

a binary classi…cation task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X � H and diam ( X ) � 1 . � = a probability measure on X 2 � f� 1 ; 1 g , the pair oracle � x; x 0 ; r � is the probability to encounter the two inputs x; x 0 2 X being � � homonymous (same label) for r = 1 and � heteronymous (di¤erent labels) for r = � 1 . A pair classi…er is a function on X 2 to predict the third argument of � . �� m � x m ; x 0 X 2 � f� 1 ; 1 g x 1 ; x 0 S = 1 ; r 1 ; :::; m ; r m 2 training sample, generated in m independent, identical trials of � , i.e. S � � m . Goal: Use S to …nd a pair classi…er with low error probability.

pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � Tx � Tx 0 � � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � . .

pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � � Tx � Tx 0 � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � A choice of T 2 L ( H ) then implies a choice of � the pair classi…er f T , � x; x 0 � = � � Tx � Tx 0 � � � the pseudo-metric d � the Mahalanobis distance d 2 � x; x 0 � = � x � x 0 � ; x � x 0 � and � T � T � x; x 0 � = � T � Tx; x 0 � � the positive semide…nite kernel � .

pair classi…ers induced by linear transformations We will select our classi…ers from the hypothesis space � � Tx � Tx 0 � n � x; x 0 � � � o � � f T : 7! sgn 1 � : T 2 L ( H ) � A choice of T 2 L ( H ) then implies a choice of � the pair classi…er f T , � x; x 0 � = � � Tx � Tx 0 � � � the pseudo-metric d � the Mahalanobis distance d 2 � x; x 0 � = � T � T � x � x 0 � ; x � x 0 � and � x; x 0 � = � T � Tx; x 0 � � the positive semide…nite kernel � The risk of the operator T is the error probability of the classi…er f T � � 2 � � � Tx � Tx 0 � � n � x; x 0 � o � � R ( T ) = Pr f T 6 = r = Pr r 1 � � 0 � ( x;x 0 ;r ) � � ( x;x 0 ;r ) s �

estimation and generalization Let f : R ! R , f � 1 ( �1 ; 0] with Lipschitz constant L . �� x m ; x 0 x 1 ; x 0 For a training sample S = 1 ; r 1 ; :::; m ; r m de…ne the empirical risk estimate � � 2 �� m � � �� X R f ( T; S ) = 1 � � x i � x 0 ^ f r i 1 � � T : � i m i =1 .

estimation and generalization Let f : R ! R , f � 1 ( �1 ; 0] with Lipschitz constant L . �� x m ; x 0 x 1 ; x 0 For a training sample S = 1 ; r 1 ; :::; m ; r m de…ne the empirical risk estimate � � 2 �� m � � �� X R f ( T; S ) = 1 � � x i � x 0 ^ f r i 1 � � T : � i m i =1 Theorem: 8 � > 0 , with probability greater 1 � � in a sample S � � m 8 T 2 L ( H ) with k T � T k 2 � 1 q 8 L k T � T k 2 + ln (2 k T � T k 2 =� ) R ( T ) � ^ R f ( T; S ) + : p m where k A k 2 = Tr ( A � A ) 1 = 2 is the Hilbert-Schmidt- or Frobenius- norm of A .

regularized objectives The theorem suggests to minimize the regularized objective � � 2 �� m � �� + � k T � T k 2 X � f;� ( T ) := 1 � � x i � x 0 f r i 1 � � T : � p m i m i =1 Since k T � T k 2 � k T k 2 2 we can also use k T k 2 2 as a stronger regularizer (computationally more e¢cient, but slightly inferior in experiments).

regularized objectives The theorem suggests to minimize the regularized objective � � 2 �� m � �� + � k T � T k 2 X � f;� ( T ) := 1 � � x i � x 0 f r i 1 � � T : � p m i m i =1 Since k T � T k 2 � k T k 2 2 we can also use k T k 2 2 as a stronger regularizer (computationally more e¢cient, but slightly inferior in experiments). For f we take the hinge loss f � with margin � : f � has Lipschitz constant 1 =� and is convex. � � x � x 0 �� x � x 0 � ; x � x 0 � is linear in T � T , � T � T � 2 = � T Since the objective � f � ;� ( T ) is a convex function of T � T:

optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � �� X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : f � r i 1 � � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . .

optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � �� X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : 1 � f � r i � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . First possibility: Solve convex optimization problem for � on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T . .

optimization problem Find T 2 L ( H ) to minimize � � 2 �� m � � �� X � f � ;� ( T ) = � ( T � T ) = 1 + � � � x i � x 0 p m k T � T k 2 : 1 � f � r i � T � i m i =1 � f � ;� is not convex in T , but � is convex in T � T . First possibility: Solve convex optimization problem for � on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T . Second possibility ( my choice ): Do gradient-descent of � f � ;� in T No problems with local minima: If T is a stable local minimizer of � f � ;� , then T � T is a stable local minimizer of � .

algorithm Given sample S , regularization parameter � , margin � , learning rate � initialize � 0 = �= p m (where m = j S j ) initialize T = ( v 1 ; :::; v m ) (where the v i are row-vectors) repeat �P E 2 � 1 = 2 D Compute k T � T k 2 = v i ; v j ij D E P For i = 1 ; :::; d compute w i = 2 k T � T k � 1 v i ; v j v i j 2 � from sample S � x; x 0 ; r Fetch � v i ; x � x 0 � For i = 1 ; :::; d compute a i Compute b P d i =1 a 2 i If r (1 � b ) < � � r � � x � x 0 � + � 0 w i then for i := 1 ; :::; d do v i v i � � � a i else for i := 1 ; :::; d do v i v i � �� 0 w i until convergence

experiments with invariant character-recognition, spatial rotations (COIL100) and face recognition (ATT). 1. training T from one task/group of tasks 2. training nearest-neighbour test-classi…ers with a single example/class on a test task, using both the input metric and the metric induced by T . 3. recording the error rates of the test classi…ers The pixel vectors x are embedded in the space H with the Gaussian rbf-kernel: 0 2 1 � � � � x 1 x 2 � � � ( x 1 ; x 2 ) = 2 � 1 exp A : @ � 4 � k x 1 k � � � � k x 2 k The parameters � = 1 and � = 0 : 05 are used throughout.

rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) .

rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er .

rotation- and scale-invariant character recognition Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er Some digits used to test the classi…er:

results for rotation/scale-invariant OCR ROC-Area input 0.539 ROC-Area T 0.982 1-NN Error input 0.822 1-NN Error T 0.093 � 1 � 4 � 0.005 Sample size 4000 Iterations 1000k

norms and singular-value-spectrum of T k T k 1 = 61.5 k T k 2 = 27.7 k T k 1 = 17.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Thank you!

learning to compare using operator-valued large-margin classiers - PowerPoint PPT Presentation

learning to compare using operator-valued large-margin classiers andreas maurer a binary classication task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam ( X ) 1 . . . . a binary

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Many-Valued Logic Daniel Bonevac February 27, 2013 Daniel Bonevac Many-Valued Logic Rationales

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Operator-Valued Chordal Loewner Chains and Non-Commutative Probability David A. Jekel University

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

On Operator-Valued Bi-Free Distributions Paul Skoufranis TAMU March 22, 2016 Paul Skoufranis

Large Deviations for Multi-valued Stochastic Differential Equations Large Deviations for

Topic #28 Nyquist plots: Gain and phase margin Reference textbook : Control Systems, Dhanesh N.

Dont Let PROC COMPARE Catch You Unaware Joshua Horstman Roger Muller Data to Events.com

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Cardinal Newman EVERYONE IS VALUED Cardinal Newman EVERYONE IS VALUED Main Changes to the GCSE

Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE

Holts exponential smoothing model for interval-valued time series This work is part of a paper

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

Lecture 2: Linear Regression Jan 27th 2020 Lecturer: Steven Wu Scribe: Steven Wu A curious

Classification from Pairwise Similarity and Unlabeled Data Han Bao 1,2 , Gang Niu 2 , Masashi

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning

Presenting empirical research 1 Goals Enough info to be replicable Enough info for