Formulation Algorithm Experiments Information-Theoretic Metric Learning Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon The University of Texas at Austin December 9, 2006 Presenter: Jason V. Davis Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Introduction ◮ Problem: Learn a Mahalanobis distance function subject to linear constraints Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Introduction ◮ Problem: Learn a Mahalanobis distance function subject to linear constraints ◮ Information-theoretic viewpoint ◮ Bijection between Gaussian distributions and Mahalanobis distances ◮ Natural entropy-based objective Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Introduction ◮ Problem: Learn a Mahalanobis distance function subject to linear constraints ◮ Information-theoretic viewpoint ◮ Bijection between Gaussian distributions and Mahalanobis distances ◮ Natural entropy-based objective ◮ Connections with kernel learning Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Introduction ◮ Problem: Learn a Mahalanobis distance function subject to linear constraints ◮ Information-theoretic viewpoint ◮ Bijection between Gaussian distributions and Mahalanobis distances ◮ Natural entropy-based objective ◮ Connections with kernel learning ◮ Fast and simple methods ◮ Based on Bregman’s method for convex optimization ◮ No eigenvalue computations are needed! Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Learning a Mahalanobis Distance ◮ Given n points { x 1 , ..., x n } in ℜ d ◮ Given inequality constraints relating pairs of points ◮ Similarity constraints: d A ( x i , x j ) ≤ u ◮ Dissimilarity constraints: d A ( x i , x j ) ≥ ℓ Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Learning a Mahalanobis Distance ◮ Given n points { x 1 , ..., x n } in ℜ d ◮ Given inequality constraints relating pairs of points ◮ Similarity constraints: d A ( x i , x j ) ≤ u ◮ Dissimilarity constraints: d A ( x i , x j ) ≥ ℓ ◮ Problem: Learn a Mahalanobis distance that satisfies these constraints: d A ( x i , x j ) = ( x i − x j ) T A ( x i − x j ) Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Learning a Mahalanobis Distance ◮ Given n points { x 1 , ..., x n } in ℜ d ◮ Given inequality constraints relating pairs of points ◮ Similarity constraints: d A ( x i , x j ) ≤ u ◮ Dissimilarity constraints: d A ( x i , x j ) ≥ ℓ ◮ Problem: Learn a Mahalanobis distance that satisfies these constraints: d A ( x i , x j ) = ( x i − x j ) T A ( x i − x j ) ◮ Applications ◮ k -means clustering ◮ Nearest neighbor searches Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Mahalanobis Distance and the Multivariate Gaussian ◮ Problem: How to choose the ‘best’ Mahalanobis distance from the feasible set? Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Mahalanobis Distance and the Multivariate Gaussian ◮ Problem: How to choose the ‘best’ Mahalanobis distance from the feasible set? ◮ Solution: Regularize by choosing that which is ‘closest’ to Euclidean distance Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Mahalanobis Distance and the Multivariate Gaussian ◮ Problem: How to choose the ‘best’ Mahalanobis distance from the feasible set? ◮ Solution: Regularize by choosing that which is ‘closest’ to Euclidean distance ◮ Bijection between the multivariate Gaussian and the Mahalanobis Distance p ( x ; m , A ) = 1 Z exp ( − 1 2( x − m ) T A ( x − m )) Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Mahalanobis Distance and the Multivariate Gaussian ◮ Problem: How to choose the ‘best’ Mahalanobis distance from the feasible set? ◮ Solution: Regularize by choosing that which is ‘closest’ to Euclidean distance ◮ Bijection between the multivariate Gaussian and the Mahalanobis Distance p ( x ; m , A ) = 1 Z exp ( − 1 2( x − m ) T A ( x − m )) ◮ Allows for comparison of two Mahalanobis distances ◮ Differential relative entropy between the associated Gaussians: p ( x ; m 1 , A 1 ) log p ( x ; m 1 , A 1 ) � KL( p ( x ; m 1 , A 1 ) � p ( x ; m 2 , A 2 )) = p ( x ; m 2 , A 2 ) d x . Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Algorithm Experiments Problem Formulation Goal: Minimize differential relative entropy subject to pairwise inequality constraints min KL( p ( x ; m , A ) � p ( x ; m , I )) subject to d A ( x i , x j ) ≤ u ( i , j ) ∈ S , d A ( x i , x j ) ≥ ℓ ( i , j ) ∈ D A ≻ 0 Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Overview: Optimizing the Model ◮ Show an equivalence between our problem and a low-rank kernel learning problem [Kulis, 2006] ◮ Yields closed-form solutions to compute the problem objective ◮ Shows that the problem is convex Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Overview: Optimizing the Model ◮ Show an equivalence between our problem and a low-rank kernel learning problem [Kulis, 2006] ◮ Yields closed-form solutions to compute the problem objective ◮ Shows that the problem is convex ◮ Use this equivalence to solve our problem efficiently Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Low-Rank Kernel Learning ◮ Given X = [ x 1 x 2 ... x n ], x i ∈ ℜ d , define K 0 = X T X ◮ Constraints: similarity ( S ) or dissimilarity ( D ) between pairs of points ◮ Objective: Learn K that minimizes the divergence to K 0 Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Low-Rank Kernel Learning ◮ Given X = [ x 1 x 2 ... x n ], x i ∈ ℜ d , define K 0 = X T X ◮ Constraints: similarity ( S ) or dissimilarity ( D ) between pairs of points ◮ Objective: Learn K that minimizes the divergence to K 0 min D Burg ( K , K 0 ) subject to K ii + K jj − 2 K ij ≤ u ( i , j ) ∈ S , K ii + K jj − 2 K ij ≥ ℓ ( i , j ) ∈ D , K � 0 ◮ D Burg is the Burg divergence D Burg ( K , K 0 ) = Tr( KK − 1 0 ) − log det( KK − 1 0 ) − n Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Equivalence to Kernel Learning [Kulis, 2006] Let K be the optimal solution to the low-rank kernel learning problem. ◮ Then K has the same range space as K 0 ◮ K = X T W T WX Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Equivalence to Kernel Learning [Kulis, 2006] Let K be the optimal solution to the low-rank kernel learning problem. ◮ Then K has the same range space as K 0 ◮ K = X T W T WX Theorem : Let K = X T W T WX be an optimal solution to the low-rank kernel learning problem. ◮ Then A = W T W is an optimal solution to the corresponding metric learning problem Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Formulation Equivalence to Kernel Learning Algorithm Optimization via Bregman’s Method Experiments Extensions Proof Sketch Lemma 1 : D Burg ( K , K 0 ) = 2KL( p ( x ; m , A ) � p ( x ; m , I )) + c ◮ Establishes that the objectives for the problem are the same ◮ Builds on a recent connection relating the relative entropy between Gaussians and the Burg divergence [Davis, 2006] Jason V. Davis, Brian Kulis, Suvrit Sra, and Inderjit Dhillon Information-Theoretic Metric Learning
Recommend
More recommend