Quality-biased Ranking for Queries with Commercial Intent Alexander Shishkin Polina Zhinalieva Kirill Nikolaev {sisoid, bondy, kvn}@yandex-team.ru Yandex LLC WebQuality Workshop 2013 1
Topical Relevance Scale Vital — the most likely search target Useful — authoritative source of information Highly relevant — provides substantial information Slightly relevant — provides minimal information Irrelevant — does not appear to be of any use Query: "WebQuality 2013" URL Rating www.dl.kuis.kyoto-u.ac.jp/webquality2013/ Vital www.quality2013.eu/ Irrelevant wcqi.asq.org/ Irrelevant quality.unze.ba/ Irrelevant 2
The Main Problems of Commercial Ranking Query: "IPhone 5 wholesale" URL Rating wholesaleiphone5.net Highly relevant wholesaleiphone5sale.com Highly relevant iphone5wholesale.com Highly relevant wholesaleiphone5cool.com Highly relevant appleiphone5wholesale.com Highly relevant � ❅ � ❅ � ❅ � ❅ � ❅ � ✠ ❘ ❅ Any rearrangement of SE Top positions are saturated results makes no sense with over-optimized sites in terms of relevance metrics 3
Are Commercial Sites Really Identical? best-tyres.ru tyreservice.ru 4
Over-optimized Document Features Text features Link features 5
SEO Ecosystem Over-optimized sites Further optimization ✛ in the top-10 SE results of search factors PPPPPPPPPPPP ✻ ✤✜ P q t t ✣✢ Webmaster 6
Ecosystem of Commercial Ranking Improving the quality of Quality-correlated ✛ search engine’s results factors optimization ✻ ❄ ✤✜ t t Introducing new features ✲ ✣✢ to capture the site quality Webmaster 7
The Main Steps in Our Approach ◮ Step 1 : introduce new relevance labels ◮ Step 2 : create new ranking features ◮ Step 3 : modify ranking function ◮ ?????? ◮ PROFIT 8
Components of the Document Quality Score ◮ Assortment for a given query ◮ Design quality ◮ Trustworthiness of the site ◮ Quality of service ◮ Usability features of the site 9
Illustration of Assortment ◮ Assortment for a given query ◮ Design quality ◮ Trustworthiness of the site ◮ Quality of service ◮ Usability features of the site 10
Illustration of Assortment ◮ Assortment for a given query ◮ Design quality ◮ Trustworthiness of the site ◮ Quality of service ◮ Usability features of the site 11
Illustration of Usability Features ◮ Assortment for a given query ◮ Design quality ◮ Trustworthiness of the site ◮ Quality of service ◮ Usability features of the site 12
Illustration of Usability Features ◮ Assortment for a given query ◮ Design quality ◮ Trustworthiness of the site ◮ Quality of service ◮ Usability features of the site 13
Aggregation of Quality Components into the Single Score Commercial relevance: R c ( q , d , s ) = V ( q , d ) · ( D ( s ) + T ( s ) + S ( s ) + U ( s )) , q — search query, d — document, s — the whole site, V ( q , d ) — Assortment, D ( s ) — design quality, T ( s ) — trustworthiness, S ( s ) — quality of service, U ( s ) — usability. 14
Features for Measuring Site Quality A few examples: Detailed contact information Absence of advertising Number of different product items Availability of shipping service Price discounts . . . 15
Challenges of Commercial Ranking ◮ Assessment is 6 times more time-consuming ◮ Only highly relevant documents are evaluated ◮ New labels cover no more than 5% of the dataset ◮ All topical relevance labels should be used Solution: extrapolate commercial relevance score to the entire dataset using machine learning. 16
Learning to Rank with New Relevance Labels Unified relevance: R u ( q , d , s ) = R t ( q , d ) + α · R c est ( q , d , s ) , R t ( q , d ) — topical relevance score, R c est ( q , d , s ) — estimate of the commercial relevance score, α — weighting coefficient. And now we use standard machine learning algorithm . . . 17
New Metrics for the Method Evaluation Offline DCG-like metrics: 10 R c ( q , d i , s i ) � Goodness ( q ) = log 2 ( i + 1 ) , i = 1 10 ( R c ( q , d i , s i ) ≤ th ) � Badness ( q ) = , log 2 ( i + 1 ) i = 1 th — threshold for the minimal acceptable site quality. 18
Changes in New Metrics Badness metric (70%-decrease) Goodness metric (30%-increase) 19
Changes in Online Metrics A/B experiment: ◮ 7%-increase in the Long Clicks per Session metric; ◮ 5%-decrease in the Abandonment Rate metric. Interleaving experiment: ◮ users chose new ranking results 1% more often than results from default ranking system. 20
The End Questions? 21
Recommend
More recommend