Optimizing Result Prefetching in Web Search Engines with Segmented - PowerPoint PPT Presentation

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo Moran Technion, Israel Institute of Technology

Web Search Engines • Index billions of documents. • Make use of distributed storage. • Their main task: Answer user’s queries – millions per day. • Issue addressed: Optimizing prefetching policy for answering Active Queries. VLDB 2002 2

Talk Outline 1. Defining Active Queries and Active Search Session. 2. Modeling the cost of serving Active Queries and Active Search Sessions. 3. Optimizing cost of Active Search Session. 4. Possible implementations VLDB 2002 3

Part 1 Active Queries Active Search Sessions VLDB 2002 4

Query execution: Parties involved (in typical large search engines) Fast QI Cache store results for Query Integrator: fast access . . . . user controls the execution Segmented index Documents are assigned to segments randomly, uniformly and independently (local inverted index organization) VLDB 2002 5

Search Session: User’s point of view Query Integrator (QI) Submit an initial query, then wait … … eventually receive back a result page, (usually, the top 10 documents for the query) VLDB 2002 6

Search Session (cont.) Query Integrator (QI) … possibly submit a follow-up query, requesting additional results … Receive the following result page, and so on until no further follow-up query is asked VLDB 2002 7

Search Session: QI’s point of view query Query ? result Integrator cache (QI) Initial query received … Are the results stored in the cache? Case 1 . 1 st result page (top 10 results) is cached: return to user. Case 2. 1 st result page is not cached. This is an Active Query , needs more work: VLDB 2002 8

QI’s work on Active Query Query Integrator (QI) Ask the Segmented Index for the top n results ( n to be determined) Segmented Index VLDB 2002 9

QI’s work on Active Query (cont.) Query Integrator (QI) Each of the m segments returns its top z results ( z=z(m,n) to be discussed later) VLDB 2002 10

QI’s work on Active Query (cont.) Query Cache Integrator (QI) QI selects top n (out of mz ) results, for preparing r= n/10 result page(s) Returns 1 st page (10 results) to user Stores the other r-1 pages (if r>1) in the cache VLDB 2002 11

Active Search Session: A (suffix of a) Search Session which starts with Active Query Goal: Optimize QI’s policy for serving a “typical” Active Search Session VLDB 2002 12

Part 2 Modeling cost of Active Search Session (of a “typical” user) VLDB 2002 13

Cost of Active Search Session depends on: A. Architectural constraints and query specific parameters (not controlled by the QI) 1. C – the number of relevant documents held in each segment (could be in the millions). 2. m - Number of index segments. 3. Other properties of the search engine. VLDB 2002 14

Cost of Active Search Session depends on: B. Implementation dependent parameters (controlled by the QI): 1. n =10r, the number of results the QI prepares. 2. z = z q (n,m) = # of results each of the m segments returns, so that with probability q, these mz results contain the top n results. VLDB 2002 15

The QI needs to decide n - The Prefetching Dilemma: 1. Preparing just the single required result page is cheap, but follow up queries will require additional Active Query executions. 2. Prefetching several result pages is expensive, but save Active Query executions should the user request the prefetched pages while they are still cached. Need to model User Behaviour VLDB 2002 16

Search engine users - some statistics: (based on 3 published analyses of search engine query logs): A typical user views relatively few result pages 1 st Pages (containing the top-10 results – of queries) account for at least 58% of all result pages viewed by users. – Pages in ranks 1-3 (containing the top- 30 results) account for at least 88% of all page views. VLDB 2002 17

Search Engine users: Views of Result Pages 1-6 Analysis of an AltaVista Views of Pages 2-6 query log containing ~7.55 million result 1000000 page views from 800000 September, 2001: 600000 Views About 4.8 million views 400000 (63.5%) are for the 200000 first page of results 0 (the top-10 results). 2 3 4 5 6 Result Page Number VLDB 2002 18

Search Engine users: Views of Result Pages 6-20 Views of Pages 6-20 140000 120000 100000 Views 80000 60000 40000 20000 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Result Page Number VLDB 2002 19

Modeling Search Sessions • Result pages are always viewed in their natural order. • The number of result pages that are viewed per session is a geometric random variable with parameter p . � The probability that a user will view precisely result pages 1,…,k of any query in a session is (1-p)p k-1 . VLDB 2002 20

Executing an Active Query: Work at segments • Each of the m segments selects the top z results. Assuming documents are evenly distributed, all segments perform roughly the same amount of work, which depends on: – The number of documents relevant to the query (query’s breadth). – the number z of results that each segment returns to the QI. VLDB 2002 21

Executing an Active Query: Work at QI • The QI’s work is dominated by receiving the mz results from the segments, and merging them to get the top n=10r results. • The cache space required depends on n=10r, the number of results that are prepared . • A single round of communication between the QI and the segments is performed. VLDB 2002 22

W(r): The expected cost of Active Search Session (executing the ‘th Active Query) × ∞ ∑ i W = Pr = i 1 (cost of executing each Active Query) When preparing r pages, this cost is: W(r) = ar + [ (b+cz q (r,m)+dr) / (1-p r ) ] Note: For each topic, there is a fixed optimal value r opt which optimizes the expected work. VLDB 2002 23

Part 3 Optimizing cost of typical Active Search Session. VLDB 2002 24

Optimizing W(r): calculating r opt • The paper presents a polynomial time algorithm that, given a query-topic T, determines r opt , the number of result pages to prepare per query execution so as to minimize the work function W(r). • In addition, a (simpler) polynomial time algorithm which, given a parameter ε , outputs a value r ε for which W(r opt ) / W(r ε ) ≥ 1- ε , and is applicable to any (monotonic) cost function, assuming the number of viewed pages is geometric RV. VLDB 2002 25

Optimizing W(r): Calculating z q (r,m) Problem demonstration: m=5, r =1 (hence n = 10) How many results should the QI retrieve from each segment? 1 7 3 2 5 Fetching the top three results 4 6 10 from each of the five segments 8 fails to collect all top 10 results! 9 VLDB 2002 26

Calculating z q (r,m) • Notations: 1. m – number of segments. 2. n – number of requested results (n=10r). • In order to collect the top-n results with probability 1, the QI must retrieve n results from each segment. • In order to have any chance of collecting the top-n results, the QI must retrieve at least  n/m  results from each segment. VLDB 2002 27

Calculating z q (r,m) • The paper presents a polynomial time DP algorithm to calculate z q (r,m), the minimal number of results that the QI can retrieve from each segment and still obtain the top-n (n=10r) results with probability q. • Applicable to any search engine that uses an m–way locally segmented index where documents are distributed uniformly and independently. VLDB 2002 28

Sample values of z q (r,m), q=0.99 40 5 segments 35 lower bound for 5 segments 30 25 segments 50 segments 25 z q (r,m) 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 no. of result pages, r (10 results per page) VLDB 2002 29

b+cz q (r,m)+dr Behavior of W(r) = ar + 1-p r q= 0.99, m= 25 16500 p= 0.5 14500 query breadth: 2 13 results per W(r) 12500 segment r opt = 6 10500 8500 1 2 3 4 5 6 7 8 r VLDB 2002 30

Part 4 Possible implementations VLDB 2002 31

Implementation of prefetching policy 1. A preprocessing stage: calculate and store relevant data. 2. Query execution stage: determine the parameters n, z by the data stored and the query topic. VLDB 2002 32

Prefetching Framework: Preprocessing Stage • Determine the work function W(r) that describes the resources that the engine consumes during a Search Session. • For a wide range of query breadths, calculate r opt and the corresponding values of z q (r opt ,m). • Load tables with the values of r opt and z q (r opt ,m) into the QI and each of the index segments. VLDB 2002 33

Prefetching Framework: During Query Execution • Centralized approach: – The QI estimates the query’s breadth using global term statistics. – Looks up the corresponding values of r opt and z q (r opt ,m). – Queries each segment for z q (r opt ,m) results and prepares r opt result pages. VLDB 2002 34

Prefetching Framework: During Query Execution • Distributed approach: – Each segment estimates the query’s breadth independently by considering the number of query matches it contains. � Looks up the corresponding z q (r opt ,m), and returns this number of results to the QI. VLDB 2002 35

Optimizing Result Prefetching in Web Search Engines with Segmented - PowerPoint PPT Presentation

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo Moran Technion, Israel Institute of Technology Web Search Engines Index billions of documents. Make use of distributed storage. Their

1 Prefetching Implementations Recall Stream Buffer Diagram Sequential and stride prefetching

Prefetching Hyperlinks Prefetching Methods Prefetching Uncacheable/Dynamic Data

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Fractal Prefetching B+-Trees: Optimizing Both Cache and Disk Performance Shimin Chen, Phillip B.

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National

COMP 590-154: Computer Architecture Prefetching Prefetching (1/3) Fetch block ahead of demand

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Game Engines 1 Overview Game engines are a significant part of the modern games industry

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

CS400 Problem Seminar Fall 2000 Assignment 4: Search Engines Handed out: Wed., Oct. 18,

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A.

MathWebSearch 0.5: Scaling an Open Formula Search Engine Michael Kohlhase, Bogdan A. Matican,

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1

WEB COMMUNITY UVU Annual Web Audit, SEO and Accessibility Coordination January 26, 2018 Our

Optimizing Result Prefetching in Web Search Engines with Segmented - PowerPoint PPT Presentation

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo Moran Technion, Israel Institute of Technology Web Search Engines Index billions of documents. Make use of distributed storage. Their

1 Prefetching Implementations Recall Stream Buffer Diagram Sequential and stride prefetching

Prefetching Hyperlinks Prefetching Methods Prefetching Uncacheable/Dynamic Data

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Fractal Prefetching B+-Trees: Optimizing Both Cache and Disk Performance Shimin Chen, Phillip B.

Collective Prefetching for Parallel I/O Systems Yong Chen and Philip C. Roth Oak Ridge National

COMP 590-154: Computer Architecture Prefetching Prefetching (1/3) Fetch block ahead of demand

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Game Engines 1 Overview Game engines are a significant part of the modern games industry

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

CS400 Problem Seminar Fall 2000 Assignment 4: Search Engines Handed out: Wed., Oct. 18,

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A.

MathWebSearch 0.5: Scaling an Open Formula Search Engine Michael Kohlhase, Bogdan A. Matican,

CREDENTIAL TRANSPARENCY &amp; INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1

WEB COMMUNITY UVU Annual Web Audit, SEO and Accessibility Coordination January 26, 2018 Our

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020