Efficiency/Effectiveness Trade-offs in Learning to Rank Tutorial @ ECML PKDD 2018 http://learningtorank.isti.cnr.it/ Claudio Lucchese Franco Maria Nardini Ca’ Foscari University of Venice HPC Lab, ISTI-CNR Venice, Italy Pisa, Italy l a b o r a t o r y
The Ranking Problem Ranking is at the core of several IR Tasks: • Document Ranking in Web Search • Ads Ranking in Web Advertising • Query suggestion & completion • Product Recommendation • Song Recommendation • … Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 2
The Ranking Problem Definition: Given a query q and a set of objects/documents D , to rank D so as to maximize users’ satisfaction Q . Goal #1: Effectiveness Goal #2: Efficiency • Maximize Q ! • Make sure the ranking process is feasible and not too expensive • but how to measure Q ? • In Bing ... “every 100msec improves revenue by 0.6%. Every millisecond counts.” [KDF+13] [KDF+13] Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013, August). Online controlled experiments at large scale . In Proceedings of the 19th ACM SIGKDD interna:onal conference on Knowledge discovery and data mining (pp. 1168-1176). ACM. Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 3
Agenda 1. Introduction to Learning to Rank (LtR) Background, algorithms, sources of cost in LtR, multi-stage ranking • 2. Dealing with the Efficiency/Effectiveness trade-off Feature Selection, Enhanced Learning, Approximate scoring, Fast Scoring • 3. Hands-on I Software, data and publicly available tools • Traversing Regression Forests, SoA tools and analysis • 4. Hands-on II Training models, Pruning strategies, Efficient scoring • At the end of the day you’ll be able to train a high quality ranking model, and to exploit SoA tools and techniques to reduce its computational cost up to 18x ! Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 4
Document Representations and Ranking Document Representa/ons Ranking Functions A document is a mul/-set of words Term-weighting [SJ72] Vector Space Model [SB88] A document may have fields, it can be split into zones, it can be enriched with external text data BM25 [JWR00] , BM25f [ RZT04 ] (e.g., anchors) Language Modeling [PC98] Addi/onal informa/on may be useful, such as In- Links, Out-Links, PageRank, # clicks, social links, etc. Linear Combination of features [MC07] Hundred signals in public LtR Datasets How to combine hundreds of signals? [SJ72] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval . Journal of documentation, 28(1):11–21, 1972. [SB88] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval . Information processing & management, 24(5):513–523, 1988. [JWR00] K Sparck Jones, Steve Walker, and Stephen E. Robertson . A probabilistic model of information retrieval: development and comparative experiments . Information processing & management, 36(6):809–840, 2000 [RZT04] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields . In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42–49. ACM, 2004. [PC98] Jay M Ponte and W Bruce Croft. A language modeling approach to information retrieval . In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–281. ACM, 1998. [MC07] Donald Metzler and W Bruce Croft . Linear feature-based models for information retrieval . Information Retrieval, 10(3):257–274, 2007. Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 5
Ranking as a Supervised Learning Task Training Instance … … d i d 1 d 2 d 3 q … … y i y 1 y 2 y 3 Machine Learning Algo (NeuralNet, SVM, Decision-Tree) Loss Func)on Ranking Model Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 6
Ranking as a Supervised Learning Task Training Instance Run-Time Instance … … … … q d i d i d 1 d 2 d 3 d 1 d 2 d 3 q … … y i y 1 y 2 y 3 Top- k Results d 3 Ranking d 4 Machine Learning Algo Model d 7 (NeuralNet, SVM, Decision-Tree) d 9 Loss Function Scored Documents d 6 … … d i d 1 d 2 d 3 sort d 8 … … s i s 3 s 1 s 2 Ranking d 2 Model Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 7
Relevance Labels Query/Document Generation Representation Useful signals • Explicit Feedback q y • Link Analysis [H+00] • Thousands of Search Quality d • Term proximity [RS03] Raters • Query classification [BSD10] • Absolute vs. Relative • Query intent mining [JLN16, LOP+13] Judgments [CBCD08] • Finding entities documents [MW08] • Implicit Feedback and in queries [BOM15] • Document recency [DZK+10] • clicks/query chains [JGP+05, Joa02, RJ05] • Distributed representations of • Unbiased learning-to-rank [JSS17] words and their compositionality • Minimize annotation cost [MSC+13] • Convolutional neural networks • Active Learning [LCZ+10] [SHG+14] • Deep versus Shallow labelling [YR09] • …. Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 8
Evaluation Measures for Ranking Top 10 Binary Graded Retrieved Relevance Relevance Rank Documents Labels Labels ☆ ☆ ☆ ☆ ✓ y 3 1 y 3 d 3 Account for labels: ✗ y 4 ✗ y 4 2 d 4 Q @10 = 4 + 1 + 3 ☆ ✓ y 7 3 y 7 d 7 Precision @10 ✗ y 9 ✗ 4 y 9 d 9 Account for ✗ y 6 ✗ P @10 = 3 5 y 6 d 6 labels and ranks: ✗ y 8 ✗ 6 y 8 10 d 8 ☆ ☆ ☆ Q @10 = 4 1 + 1 3 + 3 ✓ y 2 7 y 2 d 2 ✗ y 5 ✗ 7 8 y 5 d 5 ✗ y 1 ✗ 9 y 1 d 1 ✗ y 10 ✗ 10 y 10 d 10 Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 9
Evaluation Measures for Ranking X Gain ( d r ) · Discount ( r ) Q @ k = Many are in the form: ranks r =1 ...k 1 • (N)DCG [JK00] : Gain ( d ) = 2 y − 1 Discount ( r ) = log( r + 1) • RBP [MZ08] : Discount ( r ) = (1 − p ) p r − 1 Gain ( d ) = I ( y ) • ERR [CMZG09] : i − 1 Y (1 − R j ) with R i = (2 y − 1) / 2 y max Gain ( d ) = R i Discount ( r ) = 1 /r j =1 Do they match User satisfaction ? • ERR correlates better with user satisfaction (clicks and editorials) [CMZG09] • Results Interleaving to compare two rankings [CJRY12] • “major revisions of the web search rankers [Bing] ... The differences between these rankers involve changes of over half a percentage point , in absolute terms, of NDCG” [JK00] Kalervo J arvelin and Jaana Kekalainen. IR evalua)on methods for retrieving highly relevant documents . In Proceedings of the 23rd annual interna[onal ACM SIGIR conference on Research and development in informa[on retrieval, pages 41–48. ACM, 2000. [MZ08] Alistair Moffat and Jus[n Zobel. Rank-biased precision for measurement of retrieval effec)veness . ACM Transac[ons on Informa[on Systems (TOIS), 27(1):2, 2008. [CMZG09] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance . In Proceedings of the 18th ACM conference on Informa[on and knowledge management, pages 621–630. ACM, 2009. [CJRY12] Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. Large-scale valida)on and analysis of interleaved search evalua)on . ACM Transac[ons on Informa[on Systems (TOIS), 30(1):6, 2012. Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 10
Is it an easy or difficult task? d 2 NDCG@k d 3 Gradient descent cannot be applied directly d 1 Rank-based measures (NDCG, ERR, MAP, …) d 0 depend on documents sorted order d i document score gradient is either 0 (sorted order did not change) • (model parameters) or undefined (discontinuity) Proxy Quality Function d 2 Solution: we need a proxy Loss function • it should be differentiable d 3 • and with a similar behavior of the original cost function d 1 d 0 d i document score (model parameters) Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 11
Point-Wise Algorithms Training Instance Each document is considered d i y i independently from the others • No information about other candidates for the same query is used at training time A different cost-function is optimized Training Algo: GB GBRT • Several approaches: Regression, Multi-Class Loss Function: SSE SSE Classification, Ordinal regression, … [Liu11] Among Regression-Based: Gradient Boosting Regression Trees [Fri01] • Sum of Squared Errors (SSE) is minimized … [Liu11] Tie-Yan Liu. Learning to rank for informa-on retrieval , 2011. Springer. [Fri01] Jerome H Friedman. Greedy func-on approxima-on: a gradient boos-ng machine . Annals of staSsScs, pages 1189–1232, 2001. Lucchese C., Nardini F.M. Efficiency/Effectiveness Trade-offs in Learning to Rank 12
Gradient Boosting Regression Trees Error y-F(d) y Iterative algorithm : X F ( d ) = f i ( d ) Weak t 3 Learner i f 3 (d) predicted document score Each f i is regarded as a step in the best optimization direction, i.e., a steepest descent step : y-f 1 (d) negative gradient t 2 by line-search ∂ L ( y, f ( d )) � f 2 (d) f i ( d ) = − ρ i g i ( d ) − g i ( d ) = − ∂ f ( d ) f = P j<i f j Given L = SSE/2 : t 1 pseudo-response − ∂ [ 1 = − ∂ [ 1 P ( y − f ( d )) 2 ] 2 SSE ( y, f ( d ))] f 1 (d) 2 = y − f ( d ) ∂ f ( d ) ∂ f ( d ) Gradient g i is approximated by a Regression Tree t i d
Recommend
More recommend