Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft Research 1
Classification Pick one Pick one Pick all that apply Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3 … … Label L Label L Multi-label Binary Multi-class 2
Extreme Multi-label Learning • Learning with millions of labels Predict the set of monetizable Bing queries that might lead to a click on this ad geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code 3 MLRF: Multi-label Random Forests [Agrawal, Gupta, Prabhu, Varma WWW 2013]
Research Problems • Defining millions of labels • Obtaining good quality training data • Training using limited resources • Log time and log space prediction • Obtaining discriminative features at scale • Performance evaluation • Dealing with tail labels and label correlations • Dealing with missing and noisy labels • Statistical guarantees • Applications 4
Extreme Multi-label Learning - People • Which people are present in this selfie? 5
Extreme Multi-label Learning – Wikipedia Labels: Living people, American computer scientists, Formal methods people, Carnegie Mellon University faculty, Massachusetts Institute of Technology alumni, Academic journal editors, Women in technology, Women computer scientists. 6
Reformulating ML Problems • Ranking or recommending millions of items Labels Items Label 1 Label 2 Label 3 Label 4 Label 5 … … Label 1M 7
FastXML A Fast, Accurate & Stable Tree-classifier for eXtreme Multi-label Learning Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research) 8
FastXML • Logarithmic time prediction in milliseconds • Ensemble of balanced tree classifiers • Accuracy gains upto 25% over competing methods • Nodes partitioned using nDCG • Upto 1000x faster training over the state-of-the-art • Alternating minimization based optimization • Proof of convergence to a stationary point 9
Extreme Multi-Label Learning • Problem formulation f : X → 2 Y Y: Items X: Users 10
Extreme Multi-Label Learning • Problem formulation f ( ) 11
Tree Based Extreme Classification • Prediction in logarithmic time
Tree Based Extreme Classification • Prediction in logarithmic time
FastXML Architecture 2 2 0 0 14
FastXML • Logarithmic time prediction in milliseconds • Ensemble of balanced tree classifiers • Accuracy gains upto 25% over competing methods • Nodes partitioned using nDCG • Upto 1000x faster training over the state-of-the-art • Alternating minimization based optimization • Proof of convergence to a stationary point 15
FastXML Architecture 2 2 0 0 16
Learning to Partition a Node Training data 4 2 0
Learning to Partition a Node Min 𝐱 𝐱 1 − 𝐷 nDCG 𝐲 𝒋 , 𝐳 𝒋 , 𝐱 𝑗∈Users 𝐱 𝑢 𝐲 < 0 𝐱 𝑢 𝐲 > 0 𝐱 𝑢 𝐲 < 0 𝐱 𝑢 𝐲 > 0 𝐱 X: Space of Users 18
FastXML • Logarithmic time prediction in milliseconds • Ensemble of balanced tree classifiers • Accuracy gains upto 25% over competing methods • Nodes partitioned using nDCG • Upto 1000x faster training over the state-of-the-art • Alternating minimization based optimization • Proof of convergence to a stationary point 19
Optimizing nDCG • nDCG is hard to optimize • nDCG is non-convex and non-smooth • Large input variations → No change in nDCG • Small input variations → Large changes in nDCG 𝑀 like(𝑗, 𝐬 𝑚 ) nDCG ∝ like(𝑗, 𝐬 1 ) + log(𝑚 + 1) 𝑚=2 like 𝑗, 𝐬 𝑚 = { 1 If user i likes the item with rank 𝐬 𝑚 0 otherwise 20
Optimizing nDCG Min 𝐱 𝐱 1 − 𝐷 nDCG 𝐲 𝒋 , 𝐳 𝒋 , 𝐱 𝑗∈Users 21
Optimizing nDCG – Reformulation nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 22
Optimizing nDCG – Initialization nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝜀 𝑗 ~ Bernoulli(0.5), ∀𝑗
Optimizing nDCG – Initialization nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝜀 𝑗 ~ Bernoulli(0.5), ∀𝑗
Optimizing nDCG – Initialization nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝐬 ±∗ = rank 𝑂 𝐳 𝑗 𝐳 𝑗 𝑗: 𝜀 𝑗 =±1
Optimizing nDCG – Repartitioning Users nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 ∗ = sign 𝑤 𝑗 − − 𝑤 𝑗 + 𝜀 𝑗 ± = 𝐷 𝜀 ±1 log 1 + 𝑓 ∓𝐱 𝑢 𝐲 𝑗 − 𝐷 𝑠 nDCG 𝐬 ± 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝑤 𝑗
Optimizing nDCG – Repartitioning Users nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 ∗ = sign 𝑤 𝑗 − − 𝑤 𝑗 + 𝜀 𝑗 ± = 𝐷 𝜀 ±1 log 1 + 𝑓 ∓𝐱 𝑢 𝐲 𝑗 − 𝐷 𝑠 nDCG 𝐬 ± 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝑤 𝑗
Optimizing nDCG – Reranking Items nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗 𝐬 ±∗ = rank 𝑂 𝐳 𝑗 𝐳 𝑗 𝑗: 𝜀 𝑗 =±1
Optimizing nDCG nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗
Optimizing nDCG – Hyperplane Separator nDCG 𝐬 𝜺 𝒋 𝑢 𝑂 𝐳 𝑗 𝐳 𝑗 𝐷 𝜀 𝜀 𝑗 log 1 + 𝑓 −𝜀 𝑗 𝐱 𝑢 𝐲 𝑗 Min 𝐱,𝛆,𝐬 ± 𝐱 1 + − 𝐷 𝑠 𝑗 𝑗
Data Set Statistics Small data sets # of Training # of Test # of Data Set # of Labels Points Points Dimensions Delicious 12,920 3,185 500 983 MediaMill 30,993 12,914 120 101 RCV1-X 781,265 23,149 47,236 2,456 BibTeX 4,880 2,515 1,836 159 Large data sets # of Training # of Test # of # of Labels Data Set Points (M) Points (M) Dimensions (M) (M) WikiLSHTC 1.89 0.47 1.62 0.33 Ads-430K 1.12 0.50 0.088 0.43 Ads-1M 3.92 1.56 0.16 1.08 Ads-9M 70.46 22.63 2.08 8.84 31
Results on Small Data Sets Delicious MediaMill 70 90 FastXML FastXML 65 80 MLRF MLRF 60 70 LPSR 55 LPSR 60 50 1-vs-All 50 1-vs-All 45 40 LEML LEML 40 30 CS CS P1 P3 P5 P1 P3 P5 BibTeX RCV1-X 95 70 FastXML FastXML 60 85 MLRF MLRF 50 75 LPSR LPSR 40 65 1-vs-All 1-vs-All 30 55 LEML LEML 20 CS CS 45 10 32 P1 P3 P5 P1 P3 P5
Large Data Sets - WikiLSHTC Precision at K 60 Dataset Statistics 50 40 Training Points 1,892,600 FastXML 30 Features 1,617,899 ( sparse ) LPSR-NB 20 LEML Labels 325,056 10 Test Points 472,835 0 P1 P3 P5 Test Time (millisec) Training Time (hr) 30 243.00 FastXML FastXML 20 LPSR-NB LPSR-NB 9.00 10 LEML LEML 0 33 0.33
Large Data Sets - Ads Ads-430K Ads-1M Ads-9M Precision at K Precision at K Precision at K 25 30 16 25 20 11 20 FastXML FastXML 15 15 LPSR-NB 6 FastXML LPSR-NB 10 LEML 10 5 1 0 5 P1 P3 P5 P1 P3 P5 P1 P3 P5 -4 Test Time (millisec) Test Time (millisec) Test Time (millisec) 4 1.5 1 0.8 3 FastXML 1 0.6 FastXML 2 LPSR-NB FastXML 0.4 LPSR-NB 0.5 LEML 1 0.2 0 0 0 34
Training Times in Hours Versus Cores 4 6 1 5 1 1 3 1 0.8 4 2 2 2 0.6 2 3 4 4 4 0.4 2 1 8 8 8 0.2 1 16 16 16 0 0 0 WikiLSHTC Ads-430K Ads-1M 35
Conclusions • Extreme classification • Tackle applications with millions of labels • A new paradigm for recommendation • FastXML • Significantly higher prediction accuracy • Can train on a single desktop • Publications and code • WWW13, KDD14, NIPS15 paperps • Code and data available from my website 36
Unbiased Performance Evaluation Himanshu Jain (IIT Delhi) Yashoteja Prabhu (IIT Delhi) Manik Varma (Microsoft Research) 37
Traditional Loss/Gain Functions • Hamming loss • Subset 0/1 loss • Precision • Recall • F-score • Jaccard distance 38
Recommend
More recommend