distance metric learning beyond 0 1 loss
play

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, - PowerPoint PPT Presentation

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1 Outline Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation Mahalanobis metric for clustering


  1. Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1

  2. Outline Distances and Similarities Distance Metric Learning Mahalanobis Distances Metric Learning Formulation Mahalanobis metric for clustering Large Margin Nearest Neighbor Distance Metric Learning using CNNs Siamese Network Contrastive loss function Applications Triplet Network Triplet Loss Applications Mining Triplets Adaptive Density Distribution Magnet loss 2

  3. Distances and Similarities Distance Functions The concept of distance function d ( ., . ) is inherent to any pattern recognition problem. E.g. clustering (kmeans), classification (kNN, SVM) etc. Typical Choices 1 p . ◮ Minkowski Distance: L p ( P , Q ) = ( � i | P i − Q i | p ) ◮ Cosine: L ( P , Q ) = P T Q | P || Q | ◮ Earth Mover: Uses an optimization algorithm ◮ Edit distance: Uses dynamic programming between sequences. i P i log P i ◮ KL Divergence: KL ( P � Q ) = � Q i . (Not Symmetric!) ◮ many more ... (depending on type of problem) 3

  4. Distances and Similarities Choosing the right distance function? Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 4

  5. Metric Learning Distance Metric Learning Learn a function that maps input patterns into a target space such that the simple distance in the target space (Euclidean) approximates the “semantic” distance in the input space. Figure 1: Hadsell et. al. CVPR’06 5

  6. Metric Learning Many applications Figure 2: A subset of applications using metric learning. ◮ Scale to large number of #categories. [Schroff et al., 2015] ◮ Fine grained classification . [Rippel et. al., 2015] ◮ Visualization of high-dimensional data. [van der Maaten and Hinton, 2008] ◮ Ranking and retrieval. [Wang et. al. CVPR’14] 6

  7. Properties of a Metric What defines a metric? 1. Non-negativity: D ( P , Q ) ≥ 0 2. Identity of indiscernibles: D ( P , Q ) = 0 iff P = Q 3. Symmetry: D ( P , Q ) = D ( Q , P ) 4. Triangle Inequality: D ( P , Q ) ≤ D ( P , K ) + D ( K , Q ) Pseudo/Semi Metric If the second property is not followed strictly i.e. “iff → if” 7

  8. Metric learning as learning transformations ◮ Feature Weighting ◮ Learn weightings over the features, then use standard distance (e.g.,Euclidean) after re-weighting ◮ Full linear transformation ◮ In addition to scaling of features, also rotates the data ◮ For transformations to r < d dimensions, this is linear dimensionality reduction ◮ Non Linear Transformation ◮ Neural nets ◮ Kernelization of linear transformations Slide Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 8

  9. Supervised Metric Learning Main focus of this talk. ◮ Constraints or labels given to the algorithm. E.g. set of similarity and dissimilarity constraints ◮ Recent popular methods uses CNN architecture for non-linear transformation. Before getting into deep architectures, let us explore some basic and classical works. 9

  10. Mahalanobis Distances ◮ Assume the data is represented as N vectors of length d: X = [ x 1 , x 2 , · · · , x N ] ◮ Squared Euclidean distance d ( x 1 , x 2 ) = || x 1 − x 2 || 2 2 (1) = ( x 1 − x 2 ) T ( x 1 − x 2) ◮ Let Σ = � i , j ( x i − µ )( x j − µ ) T ◮ The original Mahalanobis distance is given as:- d M ( x 1 , x 2 ) = ( x 1 − x 2 ) T Σ − 1 ( x 1 − x 2) (2) 10

  11. Mahalanobis Distances Equivalent to applying a whitening transform Image Credit: Brian Kulis, ECCV’10 Tutorial on Distance Functions and Metric Learning. 11

  12. Mahalanobis Distances Mahalanobis distances for metric learning In general distance can be parameterized by d × d positive semi-definite matrix A : d A ( x 1 , x 2 ) = ( x 1 − x 2 ) T A ( x 1 − x 2 ) (3) Metric learning as linear transformation Derives a family of metrics over X by computing Euclidean distances after performing a linear transformation x ′ = Lx A = LL T [Cholesky decomposition] (4) d A ( x 1 , x 2 ) = || L T ( x 1 − x 2 ) || 2 2 12

  13. Mahalanobis Distances Why is A positive semi-definite (PSD)? 13

  14. Mahalanobis Distances Why is A positive semi-definite (PSD)? ◮ If A is not PSD, then d A could be negative. ◮ Suppose v = x 1 − x 2 is an eigen vector corresponding to a negative eigenvalue λ of A d A ( x 1 , x 2 ) = ( x 1 − x 2 ) T A ( x 1 − x 2 ) = v T Av (5) = λ v T v < 0 13

  15. Metric Learning Formulation Two main components:- ◮ A set of constraints on the distance ◮ A regularizer on the distance / objective function Constrained Case min A r ( A ) s . t . c i ( A ) ≤ 0 0 ≤ i ≤ C (6) A ≥ 0 Here r is the regularizer. Popular one is || A || 2 F . A ≥ 0 for positive semi-definiteness. Unconstrained Case C � min A ≥ 0 r ( A ) + λ c i ( A ) (7) i =1 14

  16. Metric Learning Formulation: Defining Constraints Similarity / Dissimilarity constraints Given a set of pairs ( x i , x j ) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. d A ( x i , x j ) ≤ l ∀ ( i , j ) ∈ S (8) d A ( x i , x j ) ≥ u ∀ ( i , j ) ∈ D Popular in verification problems. 15

  17. Metric Learning Formulation: Defining Constraints Similarity / Dissimilarity constraints Given a set of pairs ( x i , x j ) S of points that should be similar, and a set of pairs of points D of points that should be dissimilar. d A ( x i , x j ) ≤ l ∀ ( i , j ) ∈ S (8) d A ( x i , x j ) ≥ u ∀ ( i , j ) ∈ D Popular in verification problems. Relative distance constraints Given a triplet ( x i , x j , x k ) such that the distance between x i and x j should be smaller than the distance between x i and x k :- d A ( x i , x j ) ≤ d A ( x i , x k ) − m (9) Here m is the margin. It is popular for ranking problems. 15

  18. Mahalanobis metric for clustering Key Components ◮ A convex objective function for distance metric learning. ◮ Similar to linear discriminant analysis. � � d A ( x i , x j ) max A ( x i , x j ) ∈ D � (10) s . t . c ( A ) = d A ( x i , x j ) ≤ 1 ( x i , x j ) ∈ S A ≥ 0 ◮ Here, D is a set of pairs of dissimilar pairs, S is a set of similar pairs ◮ Objective tries to maximize sum of dissimilar distances ◮ Constraint keeps sum of similar distances small Xing et. al.’s NIPS’02 16

  19. Large Margin Nearest Neighbor Key Components Learns a Mahalanobis distance metric using:- ◮ convex loss function ◮ margin maximization ◮ constraints imposed for accurate kNN classification. ◮ Promotes local distance notion instead of global similarity. Intuition ◮ Each training input x i should share the same label y i as its k nearest neighbors and, ◮ Training inputs with different labels should be widely separated. Weinberger et. al. JMLR’09 17

  20. Large Margin Nearest Neighbor Target Neighbors Use prior knowledge or compute k nearest neighbors using Euclidean distance. Neighbors does not change while training. Imposters Differently labeled inputs that invade the perimeter plus unit margin. || L T ( x i − x l ) || 2 ≤ || L T ( x i − x j ) || 2 + 1 (11) Here x i and x j have label y i and x l is an imposter with label y l � = y i Weinberger et. al. JMLR’09 18

  21. Large Margin Nearest Neighbor Loss Function � || L T ( x i − x j ) || 2 ε pull ( L ) = j � i (1 − y il )[1 + || L T ( x i − x j ) || 2 − || L T ( x i − x l ) || 2 ] + � � ε push ( L ) = i , j � i l (12) Here [ z ] + = max (0 , z ), denotes the standard hinge loss. ε ( L ) = (1 − µ ) ε pull ( L ) + µε push ( L ) . (13) Here ( x i , x j , x l ) forms a triplet sample. The above loss function is non-convex and the original paper discuss a convex formulation using semi-definite programming. Weinberger et. al. JMLR’09 19

  22. Distance Metric Learning using CNNs 20

  23. Distance Metric Learning using CNNs Siamese Network Siamese is an informal term for conjoined or fused. ◮ Contains two or more identical sub-networks with shared set of parameters and weights ◮ Popularly used for similarity learning tasks such as verification and ranking. Figure 3: Signature verification. Bromley et. al. NIPS’93 20

  24. Siamese Architecture Given a family of functions G W ( X ) parameterized by W , find W such that the similarity metric D W ( X 1 , X 2 ) is small for similar pairs and large for disimilar pairs:- D W ( X 1 , X 2 ) = || G W ( X 1 ) − G W ( X 2 ) || (14) Chopra et. al. CVPR’05 and Hadsell et. al. CVPR’06 21

  25. Contrastive Loss Function Let X 1 , X 2 ∈ I , pair of input vectors and Y be the binary label where Y = 0 means the pair is similar and Y = 1 means dissimilar. We define a parameterized distance function D W as:- D W ( X 1 , X 2 ) = || G W ( X 1 ) − G W ( X 2 ) || 2 (15) The contrastive loss function is given as:- L ( W , Y , X 1 , X 2 ) = (1 − Y )1 2( D W ) 2 + ( Y )1 2 { max (0 , m − D W ) } 2 (16) Here m > 0 is the margin which enforces the robustness. 22

  26. Contrastive loss function Spring model analogy: F = − KX Attraction ∂ L S ∂ D W ∂ W = D W ∂ W Repulsion ∂ L D ∂ W = − ( m − D W ) ∂ D W ∂ W The force is absent when D W ≥ m . 23

  27. Dimensionality Reduction 24

  28. Face Verification Discriminative Deep Metric Learning ◮ Face verification in the wild. ◮ Defines a threshold for both positive and negative face pairs. Hu et. al. CVPR’14 25

Recommend


More recommend