Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - PDF document

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline • Web: Indexing & Efficiency – Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment • Improving effectiveness of Web search engines – Web page ranking • Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search – Result snippets • Social Search – tagging, collaborative search/filtering, recommender system – Real-time search • Peer-to-Peer Search 2 2 1

The Web • Document collections are scattered across many geographical areas. • Constraints prohibiting the centralization of data include: – Data security – Volume – Rate of change – Political and legal constraints – Other proprietary motivations 3 Web Search • Parallel and distributed processing • Web search tools access data distributed on servers worldwide but indexed centrally. • Most of these systems have a partitioned index on large clusters of servers with a centralized control . • They store pointers in the form of hypertext links to various Web servers. 4 2

Partitioned Indexing • Partitioning of index across multiple machines, based on either: • Terms (Global index organization) • Each node holds posting list for some terms • Using content-index, query terms sent to nodes having the terms • Higher concurrency level, but larger postings lists • Documents (Local index organization)– more common • Each node holds a complete index (shorter PLs) • Query terms sent to all nodes • Top k results from each node merged • Global statistics (e.g.. idf) must be calculated • A Hybrid approach in Tiered Indexing may be used 5 Index Tiering • A popular early termination technique to improve the efficiency of query processing • Dividing nodes into two tiers to allocate the index of most popular documents on tier 1 and the rest on tier 2. • Search tier 1 first, if not enough results then search tier 2. • Note: other popular early termination techniques ( top-doc and query pruning) were discussed earlier in the semester! 6 6 3

Distributed Index Construction • Not possible on a single machine • Various architecture for distributed indexing • MapReduce architecture (a term-partitioned index) • Master node assigns tasks to worker nodes ( map workers & reduce workers ) to split up the computing jobs: • Map Phase: Parsing & building localized <term, doc> pairs • Reduce Phase: Combining/merging posting pairs for each term 7 MapReduce (Cont’d) • Map & reduce phases can be done in parallel on many machines • A map machine can be a reducer machine in the process • Data broken into pieces ( shards )…generally 16M-64 M [128M] and send to map workers as they finish their job • Map workers work on one shard at a time (generally), unless having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf> • Sort based on term, and then secondary key (doc_id) • The same keys (terms) are assigned to the same reduce worker • Load should be balanced on the reducers 8 4

MapReduce (Cont’d) Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008 9 Query Servers • Each server has its own disk holding a portion of index • Queries are distributed, via a centralized control, to servers that contain the related posting lists • Common terms may map to many servers • No single point of resource contention ( efficient ) • If a server crashes, that portion of index is not available 10 5

Index in Dynamic Environment • Data collection is not static • Reconstruct the index periodically from scratch (many search engines use this) • Maintain an auxiliary index to store new document & remerge with existing index • Maintain multiple indexes - complicated in maintaining collection statistics 11 Outline • Web: Indexing & Efficiency – Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment • Improving effectiveness of Web search engines – Web page ranking • Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social seacrh – Result snippets • Social Search – tagging, collaborative search/filtering, recommender system – Real-time search • Peer-to-Peer Search 12 12 6

Definitions…. • Web graph: each page is a node and links are directed edges from one node to other node • Out-links (out-degree) A: links from page A to B • In-links (in-degree) A: links from other pages to A • Sink: if out-links = 0 • Source: if in-links=0 • Static page: pages that are generated prior to any request • Dynamic page: pages that generated as the result of a request • Hidden/deep web: pages with no links/password protected/via a Form,… • Indexable Web: union of pages indexed by major search engines 13 Evaluation of Web Search Engines: High Precision Search • Traditional IR systems are evaluated based on precision and recall. • Web search engines are evaluated based on top N documents. • Recall estimation is very difficult • Precision is of limited concern, as many users do not look beyond 1 st screen. => How fast and accurate the first results screen is generated? 14 7

Web Page Ranking • Considering both query dependant and query independent scores (captured during indexing), a global score is generated for each page: • Query dependant score • Similarity measures such as Cosine, BM25, proximity,… • Query independent score • Link analysis (anchor text, popularity metrics such as: authorities and hub, page rank,…) • Sponsored search • Localized search • Query log analysis • etc. 15 15 Query Log Analysis • Using user query patterns on certain days and time of day, week, month, and year, many optimizations are possible: • Pre-cache likely Web pages in anticipation of user queries to reduce page access delays; increasing system throughput (efficiency optimization) • Possible to adjust relevance ranking to tune for certain user queries (accuracy optimization) 16 8

Anchor Text • Short, 2-3 terms, describe the linked/destination page. • May/may not be a different point of view than the author’s. • Anchor text of links to a doc d i included in index for d i • Extended anchor text (text surrounding anchor text) may also be used • Generally weighted based on frequency (notion of idf ) • Spamming problem 17 Page Rank • A scoring mechanism in Web search ( trade marked by Google and patented by Stanford ) • Generally calculated at the time of crawling • Using incoming and outgoing links as an indicator of popularity , adjusts Web page score • Popular page is defined as a page that - Many Web pages link to it ( inlinks ) - Important (popular) pages link to it • May be affected by link spam 18 9

Page Rank − ( 1 d ) ∑ PageRank ( D ) = + i PageRank ( A ) d N C ( D ) D ... D i 1 n C(D i ) : number of links out from page D i d : damping factor (from 0-1; commonly 0.85) N: total number of pages An Iterative Algorithm: Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,…. 19 Authorities and Hub • Various algorithms based on assigning each retrieved web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999) • Authority page: an authoritative source on a given topic • Hub page: page listing pointers to authority pages on a topic • Authority score: summation of scores of all the hubs pointing to that authority page • Hub score: summation of scores of all authority pages the hub is pointing to 20 10

Computing Authority and Hub Scores • Retrieve all pages containing the query term t. This is called root set. (~200 pgs) • Create a set including union of root set pages, pages that point to root set pages, and pages that root set pages point to. This is called base set . • Using the base set to compute the hub and authority scores. • An iterative algorithm: • Initialize hubs and authorities a score of 1 • Update s(H) and s(A) 21 Sponsored Search • Search system vendors sell advertisers keywords so that whenever such words are issued in a query, the advertiser’s desired homepage link is returned. • Sponsored search results are biased towards advertisers with higher bids, click frequency of Ads,… • Significant revenue is generated to search engine vendors via such search approach ( ex.: per click (50 sents to 15 dollars) 22 11

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - PDF document

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search 3 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 3 1 3 Search 3.1 Problem-solving

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

INTERNET FOR A MOBILE INTERNET FOR A MOBILE GENERATION GENERATION www.itu.int/mobileinternet

History of the Internet Pat Morin COMP 2405 Outline Origins of the Internet Internet

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020

MathWebSearch 0.5: Scaling an Open Formula Search Engine Michael Kohlhase, Bogdan A. Matican,

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

WEB COMMUNITY UVU Annual Web Audit, SEO and Accessibility Coordination January 26, 2018 Our

Overview of Artificial Intelligence (AI) What is AI? -- Four views AI Ancient History

Virtual Machines and the Metaphysics of Science Aaron Sloman http://www.cs.bham.ac.uk/ axs

Photon-photon collisions at the LHC Lucian Harland-Lang, University College London UK HEP forum,

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - PDF document

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search 3 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 3 1 3 Search 3.1 Problem-solving

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

INTERNET FOR A MOBILE INTERNET FOR A MOBILE GENERATION GENERATION www.itu.int/mobileinternet

History of the Internet Pat Morin COMP 2405 Outline Origins of the Internet Internet

CREDENTIAL TRANSPARENCY &amp; INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020

MathWebSearch 0.5: Scaling an Open Formula Search Engine Michael Kohlhase, Bogdan A. Matican,

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

WEB COMMUNITY UVU Annual Web Audit, SEO and Accessibility Coordination January 26, 2018 Our

Overview of Artificial Intelligence (AI) What is AI? -- Four views AI Ancient History

Virtual Machines and the Metaphysics of Science Aaron Sloman http://www.cs.bham.ac.uk/ axs

Photon-photon collisions at the LHC Lucian Harland-Lang, University College London UK HEP forum,

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B ONE WORKFORCE GRANT PROGRAM September 2020