Similarity Learning for Provably Accurate Sparse Linear Classification (ICML 2012) Aur´ elien Bellet Amaury Habrard Marc Sebban Laboratoire Hubert Curien, UMR CNRS 5516, Universit´ e de Saint-Etienne Alicante - September 2012 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 1 / 34
Introduction: Supervised Classification, Similarity Learning Introduction Supervised Classification, Similarity Learning Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 2 / 34
Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity functions in classification Common approach in supervised classification: learn to classify objects using a pairwise similarity (or distance) function . Successful examples: k -Nearest Neighbor ( k -NN), Support Vector Machines (SVM). ? Best way to get a “good” similarity function for a specific task: learn it from data! Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 3 / 34
Introduction: Supervised Classification, Similarity Learning Similarity Learning Similarity learning Similarity learning overview Learning a similarity function K ( x , x ′ ) implying a new instance space where the performance of a given algorithm is improved. Learn K Very popular approach Learn a positive semi-definite matrix (PSD) M ∈ R d × d that parameterizes M ( x , y ) = ( x − x ′ ) T M ( x − x ′ ) a (squared) Mahalanobis distance d 2 according to local constraints. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 4 / 34
Introduction: Supervised Classification, Similarity Learning Similarity Learning Mahalanobis distance learning Existing methods typically use 2 types of constraints (from labels): equivalence constraints ( x and x ′ are similar/dissimilar), or relative constraints ( x is more similar to x ′ than to x ′′ ). Goal: find M that best satisfies the constraints. d M is then plugged in a k -NN classifier (or in a clustering algorithm) and is expected to improve results (w.r.t. Euclidean distance). Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 5 / 34
Introduction: Supervised Classification, Similarity Learning Similarity Learning Motivation of our work Limitations of Mahalanobis distance learning Must enforce M � 0 (costly). No theoretical link between the learned metric and the error of the classifier. d M is learned using local constraints. Works well in practice with k -NN (based on a local neighborhood). Not really appropriate for global classifiers? Goal of our work Learn a non PSD similarity function, designed to improve global linear classifiers , with theoretical guarantees on the classifier error. Theory of ( ǫ, γ, τ ) -good similarity functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 6 / 34
( ǫ, γ, τ )-Good Similarity Functions ( ǫ, γ, τ ) -Good Similarity Functions Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 7 / 34
( ǫ, γ, τ )-Good Similarity Functions Definition Definition The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear classification . They proposed the following definition. Definition (Balcan et al., 2008) A similarity function K ∈ [ − 1 , 1] is an ( ǫ, γ, τ ) -good similarity function for a learning problem P if there exists an indicator function R ( x ) defining a set of “reasonable points” such that the following conditions hold: A 1 − ǫ probability mass of examples ( x , ℓ ) satisfy: 1 ℓℓ ′ K ( x , x ′ ) | R ( x ′ ) � � E ( x ′ ,ℓ ′ ) ∼ P ≥ γ Pr x ′ [ R ( x ′ )] ≥ τ. ǫ, γ, τ ∈ [0 , 1] 2 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34
( ǫ, γ, τ )-Good Similarity Functions Definition Definition The theory of Balcan et al. (2006, 2008) bridges the gap between the properties of a similarity function and its performance in linear classification . They proposed the following definition. Definition (Balcan et al., 2008) A similarity function K ∈ [ − 1 , 1] is an ( ǫ, γ, τ ) -good similarity function for a learning problem P if there exists an indicator function R ( x ) defining a set of “reasonable points” such that the following conditions hold: A 1 − ǫ probability mass of examples ( x , ℓ ) satisfy: 1 ℓℓ ′ K ( x , x ′ ) | R ( x ′ ) � � E ( x ′ ,ℓ ′ ) ∼ P ≥ γ Pr x ′ [ R ( x ′ )] ≥ τ. ǫ, γ, τ ∈ [0 , 1] 2 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 8 / 34
( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 0 , γ = 0 . 03 , τ = 3 / 8 l x ⇒ ∀ ( x , l x ) : 3 ( K ( x , A ) + K ( x , C ) − K ( x , G )) ≥ 0 . 03 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 9 / 34
( ǫ, γ, τ )-Good Similarity Functions Intuition behind the definition Intuition behind the definition H G E B Positive class A D F Negative class Reasonable point C K ( x , x ′ ) = −� x − x ′ � 2 is good with ǫ = 1 / 8 , γ = 0 . 12 , τ = 3 / 8 − 1 With example ( E , − 1) : 3 ( K ( E , A ) + K ( E , C ) − K ( E , G )) < 0 . 12 Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 10 / 34
( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Strategy Each example is mapped to the space of “the similarity scores with the reasonable points”. K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 11 / 34
( ǫ, γ, τ )-Good Similarity Functions Implications for learning Implications for learning Theorem (Balcan et al., 2008) Given K is ( ǫ, γ, τ ) -good, there exists a linear separator α in the above-defined projection space that has error close to ǫ at margin γ . K ( x , G ) K ( x , G ) K ( x , A ) K ( x , C ) Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 12 / 34
( ǫ, γ, τ )-Good Similarity Functions Hinge loss definition Hinge loss definition Hinge loss version of the definition. Definition (Balcan et al., 2008) A similarity function K is an ( ǫ, γ, τ ) -good similarity function in hinge loss for a learning problem P if there exists a (random) indicator function R ( x ) defining a (probabilistic) set of “reasonable points” such that the following conditions hold: E ( x ,ℓ ) ∼ P [[1 − ℓ g ( x ) /γ ] + ] ≤ ǫ , 1 where g ( x ) = E ( x ′ ,ℓ ′ ) ∼ P [ ℓ ′ K ( x , x ′ ) | R ( x ′ )] and [1 − c ] + = max (1 − c , 0) is the hinge loss, Pr x ′ [ R ( x ′ )] ≥ τ . 2 · Expectation on the amount of margin violations ⇒ Easier to optimize Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 13 / 34
( ǫ, γ, τ )-Good Similarity Functions Balcan et al.’s learning rule Learning rule Learning the separator α with a linear program d l d u � � α j ℓ i K ( x i , x ′ min 1 − j ) + λ � α � 1 α i =1 j =1 + Advantage: sparsity Thanks to the L 1 -regularization , α will have some zero-coordinates (depending on λ ). Makes prediction much faster than (for instance) k -NN. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 14 / 34
( ǫ, γ, τ )-Good Similarity Functions L 1 -norm and Sparsity L 1 -norm and Sparsity Why does L 1 -norm constraint/regularization induce sparsity? Geometric interpretation: L 2 constraint L 1 constraint Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 15 / 34
Learning Good Similarity Functions for Linear Classification Learning Good Similarity Functions for Linear Classification Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 16 / 34
Learning Good Similarity Functions for Linear Classification Form of similarity function Form of similarity function We propose to optimize a bilinear similarity K A : K A ( x , x ′ ) = x T Ax ′ parameterized by the matrix A ∈ R d × d (not constrained to be PSD nor symmetric). K A is efficiently computable for sparse inputs. To ensure K A ∈ [ − 1 , 1], we assume the inputs are normalized such that || x || 2 ≤ 1, and we require || A || F ≤ 1. Bellet, Habrard and Sebban (LaHC) Similarity Learning for Linear Classification Alicante September 2012 17 / 34
Recommend
More recommend