Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim - PowerPoint PPT Presentation

Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1

Search Engine Samplers Search Engine Web Public Public D Index Interface Interface Top k results Queries Indexed Documents Random document Sampler x ∈ D 2

Motivation � Useful tool for search engine evaluation: � Freshness � Fraction of up-to-date pages in the index � Topical bias � Identification of overrepresented/underrepresented topics � Spam � Fraction of spam pages in the index � Security � Fraction of pages in index infected by viruses/worms/trojans � Relative Size � Number of documents indexed compared with other search engines 3

Size Wars August 2005 : We index 20 billion documents . September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s. So, who’s right? 4

Related Work � Random Sampling from a Search Engine’s Index [BharatBroder98, CheneyPerry05, GulliSignorni05] � Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] � Queries from user query logs [LawrenceGiles98, DobraFeinberg04] � Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01] 5

Our Contributions � A pool-based sampler Focus of � Guaranteed to produce near-uniform samples this talk � A random walk sampler � After sufficiently many steps, guaranteed to produce near-uniform samples � Does not need an explicit lexicon/pool at all! 6

Search Engines as Hypergraphs “news” “google” www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc” � results(q) = { documents returned on query q } � queries(x) = { queries that return x as a result } � P = query pool = a set of queries � Query pool hypergraph: Indexed documents � Vertices: { result(q) | q ∈ P } � Hyperedges: 7

Query Cardinalities and Document Degrees “news” “google” www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc” � Query cardinality: card(q) = |results(q)| � Document degree: deg(x) = |queries(x)| � Examples: � card(“news”) = 4, card(“bbc”) = 3 � deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2 8

The Pool-Based Sampler: Preprocessing Step Query Pool Large corpus P q 1 C q 2 … … � Example: P = all 3-word phrases that occur in C � If “ to be or not to be ” occurs in C, P contains: � “ to be or ”, “ be or not ”, “ or not to ”, “ not to be ” � Choose P that “covers” most documents in D 9

Monte Carlo Simulation � We don’t know how to generate uniform samples from D directly � How can we use biased samples to generate uniform samples? � Samples with weights that represent their bias can be used to simulate uniform samples Monte Carlo Simulation Methods Rejection Importance Metropolis- Maximum- Rejection Importance Metropolis- Maximum- Sampling Sampling Hastings Degree Sampling Sampling Hastings Degree 10

Document Degree Distribution � We are able to generate biased samples from the “document degree distribution” � Advantage: Can compute weights representing the bias of p: 11

Rejection Sampling [von Neumann] � accept := false � while (not accept) � generate a sample x from p � toss a coin whose heads probability is w p (x) � if coin comes up heads, accept := true � return x 12

Pool-Based Sampler � Degree distribution: p(x) = deg(x) / Σ x’ deg(x’) Search Engine Search Engine results(q 1 ), results(q 2 ),… q 1 ,q 2 ,… Pool-Based Sampler (x 1 ,1/deg(x 1 )), (x 1 ,1/deg(x 1 )), (x 2 ,1/deg(x 2 )),… … (x 2 ,1/deg(x 2 )), Degree distribution Rejection Degree distribution Rejection x sampler Sampling sampler Sampling Documents sampled from degree Uniform distribution with corresponding weights sample 13

Sampling documents by degree “google” “news” www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc” � Select a random q ∈ P � Select a random x ∈ results(q) � Documents with high degree are more likely to be sampled � If we sample q uniformly � “oversample” documents that belong to narrow queries � We need to sample q proportionally to its cardinality 14

Sampling queries by cardinality � Sampling queries from pool uniformly: Easy � Sampling queries from pool by cardinality: Hard � Requires knowing cardinalities of all queries in the search engine � Use Monte Carlo methods to simulate biased sampling via uniform sampling: � Sample queries uniformly from P � Compute “cardinality weight” for each sample: � Obtain queries sampled by their cardinality 15

Dealing with Overflowing Queries � Problem: Some queries may overflow (card(q) > k) � Bias towards highly ranked documents � Solutions: � Select a pool P in which overflowing queries are rare (e.g., phrase queries) � Skip overflowing queries � Adapt rejection sampling to deal with approximate weights Theorem : Samples of PB sampler are at most β -away from uniform. ( β = overflow probability of P) 16

Bias towards Long Documents 60% Percent of documents from sample . Pool Based 50% Random Walk Bharat-Broder 40% 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 Deciles of documents ordered by size 17

Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73 18

Conclusions � Two new search engine samplers � Pool-based sampler � Random walk sampler � Samplers are guaranteed to produce near- uniform samples, under plausible assumptions. � Samplers show no or little bias in experiments. 19

20 Thank You

Top-Level Domains in Google, MSN and Yahoo! 60% Google ple MSN sam 50% Yahoo! ents from 40% 30% ercent of docum 20% 10% P 0% m k u g t u v a s o o e t s e e i u o r d a c u n e f d o n i n o g e c i Top level domain name 21

Query Cardinality Distribution � results(q) = { documents returned on query q } � card(q) = |results(q)| � Cardinality distribution: Unrealistic assumptions: � Can sample queries from the cardinality distribution � In practice, don’t know a priori card(q) for all q ∈ P � ∀ q ∈ P, 1 ≤ card(q) ≤ k � In practice, some queries underflow (card(q) = 0) or overflow (card(q) > k) 22

Degree Distribution Sampler Search Engine Search Engine Query sampled Document from cardinality results(q) q sampled from distribution degree Degree Distribution Sampler distribution Cardinality Distribution Sample x uniformly Cardinality Distribution Sample x uniformly x Sampler from results(q) Sampler from results(q) 23

Cardinality Distribution Sampler Search Engine Search Engine card(q 1 ), q 1 ,q 2 ,… card(q 2 ),… Cardinality Distribution Sampler (q 1 ,card(q 1 )/k), (q 2 ,card(q 2 )/k), Rejection Uniform Query q Rejection Uniform Query … Sampling Sampler Sampling Sampler Uniform Sample from samples from P cardinality distribution 24

Complete Pool-Based Sampler Search Engine Search Engine (q,card(q)),… Uniform Query Rejection Uniform Query Rejection Sampler Sampling Sampler Sampling Uniform Query query (q,results(q)),… sampled from sample cardinality distribution (x,1/deg(x)),… Degree Distribution Rejection x Degree Distribution Rejection Sampler Sampling Sampler Sampling Uniform Documents sampled from degree document distribution with corresponding weights sample 25

A random walk sampler � Define a graph G over the indexed documents � (x,y) ∈ E iff results(x) ∩ results(y) ≠ ∅ � � Run a random walk on G � Limit distribution = degree distribution � Use MCMC methods to make limit distribution uniform. � Metropolis-Hastings � Maximum-Degree � Does not need a preprocessing step � Less efficient than the pool-based sampler 26

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim - PowerPoint PPT Presentation

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1 Search Engine Samplers Search Engine Web Public Public D Index Interface Interface Top k results Queries

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Intr oduc tion to E c onome tr ic s Chapte r 4 E ze quie l Ur ie l Jim ne z Unive r

Distributed Algorithms for MCMC Sampling Yitong Yin Nanjing University Shonan Meeting No. 162:

What we told CVPR 18 ACs Slides edited by: DAF, from slides by DAF, Ivan, Deva, Aude Outline

IETF 67 SIP meeting draft-ietf-sip-connected-identity-02 Current status Finished WGLC (based

STATUS OF LEVEL 2 RETRIEVALS JOEL SUSSKIND AIRS TEAM MEETING JUNE 2001 LATEST TEAM EXERCISE

Simulation for estimation and testing Christopher F Baum EC 823: Applied Econometrics Boston

I ask then: Did God reject His people? By no means! I am an Israelite myself, a descendant of

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim - PowerPoint PPT Presentation

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1 Search Engine Samplers Search Engine Web Public Public D Index Interface Interface Top k results Queries

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Intr oduc tion to E c onome tr ic s Chapte r 4 E ze quie l Ur ie l Jim ne z Unive r

Distributed Algorithms for MCMC Sampling Yitong Yin Nanjing University Shonan Meeting No. 162:

What we told CVPR 18 ACs Slides edited by: DAF, from slides by DAF, Ivan, Deva, Aude Outline

IETF 67 SIP meeting draft-ietf-sip-connected-identity-02 Current status Finished WGLC (based

STATUS OF LEVEL 2 RETRIEVALS JOEL SUSSKIND AIRS TEAM MEETING JUNE 2001 LATEST TEAM EXERCISE

Simulation for estimation and testing Christopher F Baum EC 823: Applied Econometrics Boston

I ask then: Did God reject His people? By no means! I am an Israelite myself, a descendant of

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling