Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of Lyon 1, France September 11, 2015 1/20
Outline ◮ Multi-label classification ◮ Unified probabilistic framework ◮ Hamming loss vs Subset 0 / 1 loss ◮ Factorization of the joint conditional distribution of the labels ◮ Irreducible label factors ◮ The ILF-Compo algorithm ◮ Experimental results ◮ Toy problem ◮ Benchmark data sets This work was recently presented at ICML (Gasse, Aussem, and Elghazel 2015). 2/20
Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y 3/20
Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h 3/20
Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y 3/20
Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y The current trend is to exploit label dependence to improve MLC... under which loss function? 3/20
Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 4/20
Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 4/20
Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 p ( Y | x ) much harder to estimate than p ( Y i | x )... can we use the label dependencies to better model p ( Y | x ) ? 4/20
Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) 5/20
Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.46 1 0 0.44 1 1 0.08 HLoss optimal : J = 1, R = 1 (52% , 54%) SLoss optimal : J = 0, R = 1 (46%) 5/20
Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . 6/20
Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. 6/20
Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. We seek a factorization into (unique) irreducible label factors ILF . 6/20
Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 7/20
Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. 7/20
Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. Much easier if we assume the Composition property. 7/20
The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z 8/20
The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). 8/20
The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). Typical counter-example The exclusive OR relationship, A = B ⊕ C ⇒ { A } �⊥ ⊥ { B , C } ∧ { A } ⊥ ⊥ { B } ∧ { A } ⊥ ⊥ { C } 8/20
Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 9/20
Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, 9/20
Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, Theorem Suppose p supports the Composition property and consider M i an arbitrary Markov blanket of Y i in X . Then, { Y i } �⊥ ⊥ { Y j } | X is true iff { Y i } �⊥ ⊥ { Y j } | M i . 9/20
ILF-Compo algorithm Generic procedure ◮ For each label Y i compute M i a Markov boundary in X . ◮ For each pair of labels ( Y i , Y j ) check { Y i } �⊥ ⊥ { Y j } | M i to build G . ◮ Extract the partition ILF = { Y LF 1 , . . . , Y LF m } from G . ◮ Decompose the multi-label problem into a series of independent multi-class problems. 10/20
Recommend
More recommend