Theoretical results • Estimation of the threshold on a validation set is statistically consistent with provable regret bounds. 2 2 N. Nagarajan, S. Koyejo, R. Ravikumar, and I. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS 27 , pages 2744–2752, 2014 H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NIPS , 2014 Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classification. In NIPS 27 , pages 2123–2131, 2014 Wojciech Kot� lowski and Krzysztof Dembczynski. Surrogate regret bounds for generalized clas- sification performance metrics. In ACML , 2015 14 / 36
Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36
Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36
Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36
Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. • Can be either applied on a validation set or run simultaneously with training of the class probability model. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36
Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. • Can be either applied on a validation set or run simultaneously with training of the class probability model. • For large validation sets one pass over data should get an accurate estimate of the threshold. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36
Online F-measure Maximization • In each round t : 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, x 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , x 1 η ( x 1 ) ˆ 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � x 1 η ( x 1 ) ˆ y 1 ˆ 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. x 1 η ( x 1 ) ˆ y 1 ˆ y 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 η ( x 1 ) ˆ y 1 ˆ y 1 τ 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ y 1 ˆ y 1 τ 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 1 τ 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 τ 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 τ 1 τ 2 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 τ 5 16 / 36
Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ . . . y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 τ 5 16 / 36
Beyond binary problems • All the above approaches are working well. 17 / 36
Beyond binary problems • All the above approaches are working well. • Computational issues can almost be ignored in binary problems. 17 / 36
Beyond binary problems • All the above approaches are working well. • Computational issues can almost be ignored in binary problems. • Scaling to X-MLC? 17 / 36
Macro-averaging of the F-measure • m labels. 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36
Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : ◮ To compute the F-measure we need only true positive labels ( y ij = 1 ) and predicted positive labels ( ˆ y ij = 1 ). 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : ◮ To compute the F-measure we need only true positive labels ( y ij = 1 ) and predicted positive labels ( ˆ y ij = 1 ). ◮ Therefore to reduce the complexity we need to deliver sparse probability estimates (SPEs). 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36
Outline 1 Extreme multi-label classification 2 The F-measure 3 Efficient sparse probability estimators 4 Experimental results 5 Summary 20 / 36
Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36
Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold • We need multi-label classifiers that efficiently deliver SPEs: Efficient sparse probability estimators 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36
Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold • We need multi-label classifiers that efficiently deliver SPEs: Efficient sparse probability estimators • Two examples: FastXML 5 and PLT 6 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36
FastXML • Based on standard decision trees . 7 • Uses an ensemble of trees to improve predictive performance. • Sparse linear classifiers trained to maximize nDCG in internal nodes. • Empirical distributions in leaves. • Very efficient training procedure. w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . 7 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees . Wadsworth and Brooks, Monterey, CA, 1984 22 / 36
FastXML w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . • Most importantly: FastXML delivers SPEs . 23 / 36
FastXML w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . • Most importantly: FastXML delivers SPEs . ◮ Leaf nodes cover only small feature space 23 / 36
Recommend
More recommend