Sharif University of Technology 23/2/1397 1/32 Scale Effects in Web Search Soroush Ebadian, Parand Alizadeh Under Supervision of Prof. Fazli Social and Economic Networks, Spring 1397
Sharif University of Technology 23/2/1397 Contents 2/32 • Overview on Problem Space • Data Description • Direct Effects of Scale • Indirect Effects of Scale • Discussion & Conclusion
Sharif University of Technology 23/2/1397 3/32 • Overview on Problem Space • Data Description • Direct Effects of Scale • Indirect Effects of Scale • Discussion & Conclusion
Sharif University of Technology 23/2/1397 Analysis of Web Search Markets 4/32 • T wo different worlds • Ranking based on algorithmic innovation and fixed document features • Learning from historical queries is critical ranking quality Little is known about which one we live in.
Sharif University of Technology 23/2/1397 Analysis of Web Search Markets 5/32 (cont.) • Learning tends to slow down with each additional data point Fig. 1: A learning curve averaged over many trials Can any viable entrant easily achieve?!
Sharif University of Technology 23/2/1397 Authors of Paper 6/32 • Microsoft AI & Research: 4/5 • HomeAway Inc. : 1/5
Sharif University of Technology 23/2/1397 7/32 • Overview on Problem Space • Data Description • Direct Effects of Scale • Indirect Effects of Scale • Discussion & Conclusion
Sharif University of Technology 23/2/1397 Data Description 8/32 • T wo search engines with same restrictions • More than 6 months • Based on Click-Through-Rates (CTR) Provider 1 (# impressions) > 200 billion Provider 2 (# impressions) > 300 billion Provider 1 (# clicks) > 100 billion Provider 2 (# clicks) > 150 billion Table 1. Summary statistics
Sharif University of Technology 23/2/1397 9/32 • Overview on Problem Space • Data Description • Direct Effects of Scale • Indirect Effects of Scale • Discussion & Conclusion
Sharif University of Technology 23/2/1397 Benchmark & Target Data 10/32 • Legally limited time raw log retention • Benchmark data: first 3 months • Target data: next 9 months • <H(q,d), CTR(q,d)> • H(q,d): historical measure before day d for query q • CTR(q,d): CTR in day d of query q
Sharif University of Technology 23/2/1397 CTR & Historical Occurrences 11/32 Positive Correlation • Generated 270 pairs into buckets by H(q,d) 1 0.5 0 Provider 1 Provider 2 0-10 10-100 100-1k 10k-100k 100k-1m 1m-10m 10m-100m Fig. 2. CTR shows a positive correlation with the number of historical occurrences.
Sharif University of Technology 23/2/1397 12/32 Regression Analysis CTR = − 0.0530[ − 0.085, − 0.021] + 0.3287[0.315, 0.343] sqrt(log(x)) Fig. 3. Provider 1, relationship between CTR and number of historical examples.
Sharif University of Technology 23/2/1397 13/32 Regression Analysis (cont.) CTR = − 0.3871[ − 0.486, − 0.288] + 0.4792[0.438, 0.520] sqrt(log(x)) Fig. 4. Provider 2, relationship between CTR and number of historical examples.
Sharif University of Technology 23/2/1397 Scale Effect Analysis on New Queries 14/32 • Popular queries may be easier to satisfy • Same “query difficulty” • (1) query has less than 200 clicks in the three-month benchmark • (2) total number of clicks of the query between 1000 and 2000 (in a year) • Provider 1: 8000 queries • Provider 2: 10000 queries
Sharif University of Technology 23/2/1397 Scale Effect Analysis on New Queries 15/32 (cont.) • CTR(q, c): CTR of q in period of c+1 to c+100 clicks • c ∈ {100, 200, . . . , 900}
Sharif University of Technology 23/2/1397 Scale Effect Exists in Both 16/32 Fig. 5. Provider 1, relationship between CTR and number of historical examples for new queries only.
Sharif University of Technology 23/2/1397 Scale Effect Exists in Both (cont.) 17/32 Fig. 6. Provider 2, relationship between CTR and number of historical examples for new queries only.
Sharif University of Technology 23/2/1397 18/32 • Overview on Problem Space • Data Description • Direct Effects of Scale • Indirect Effects of Scale • Discussion & Conclusion
Sharif University of Technology 23/2/1397 19/32 Constructing Bipartite Knowledge Graph • G = <Q, D, E> • Q = queries, D = documents • e ij = click count between q i and d j • Represent each query as a bag of words queries reduce by 7%
Sharif University of Technology 23/2/1397 20/32 Summary of Query-Document Graph • Cardinality Q: 4.82 billion • Cardinality D: 3.26 billion • Cardinality E: 11.6 billion • T otal clicks: > 100 billion
Sharif University of Technology 23/2/1397 21/32 Clustering Documents • Construct similarity matrix of document using cosine similarity • Convert similarity weights to 0 or 1 using a threshold • Construct document similarity graph
Sharif University of Technology 23/2/1397 22/32 Clustering Documents (cont.) • Find connected components of documents similarity graph • Each connected component is an intent - cluster • Construct query/intent-cluster graph • E ij = fraction of clicks from q i to cluster j
Sharif University of Technology 23/2/1397 Algorithm 1. Find Connected 23/32 Components 1. Every document pair is a separate cluster 2. Identify link nodes between pairs and merge 3. Repeat 2 until convergence
Sharif University of Technology 23/2/1397 24/32 Evaluation of Clusters • Form a 100-query test set and get all clusters • Score edges with 0 or 1 using auditors • Choose thresholds between 0.7, 0.8, 0.9 and 0.95
Sharif University of Technology 23/2/1397 25/32 Evaluation of Clusters (cont.) • Precision: fraction of pairs judged to be relevant • Weighted Precision: precision with applying Markov weight to each pair
Sharif University of Technology 23/2/1397 26/32 Evaluation of Clusters (cont.) • Pseudo Recall: for threshold 0.7 is 1, o.w is fraction of pairs each method recovers • Weighted Recall: pseudo recall with applying Markov weight to each pair
Sharif University of Technology 23/2/1397 27/32 Evaluation of Clusters (cont.) Pseudo recall W. Recall Threshold Precision W. Precision 0.7 0.69 0.79 1 1 0.8 0.7 0.84 0.76 1.054 0.9 0.68 0.83 0.45 1.04 0.95 0.66 0.83 0.26 1.03 Table 2. Precision and Recall by threshold
Sharif University of Technology 23/2/1397 28/32 Fig. 7. CDF of the number intent clusters with edge to submitted query.
Sharif University of Technology 23/2/1397 29/32 Fig. 8. CDF of the number of queries per queries per intent cluster.
Sharif University of Technology 23/2/1397 30/32 Impact on CTR
Sharif University of Technology 23/2/1397 31/32 Discussion & Conclusion • It is unclear that increase on scale makes the search problem easier or harder • Search engines are one of the most complicated engineering tasks ever attempted
Sharif University of Technology 23/2/1397 32/32 Thanks for your attention!
Recommend
More recommend