internet search
play

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - PDF document

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in


  1. Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline • Web: Indexing & Efficiency – Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment • Improving effectiveness of Web search engines – Web page ranking • Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search – Result snippets • Social Search – tagging, collaborative search/filtering, recommender system – Real-time search • Peer-to-Peer Search 2 2 1

  2. The Web • Document collections are scattered across many geographical areas. • Constraints prohibiting the centralization of data include: – Data security – Volume – Rate of change – Political and legal constraints – Other proprietary motivations 3 Web Search • Parallel and distributed processing • Web search tools access data distributed on servers worldwide but indexed centrally. • Most of these systems have a partitioned index on large clusters of servers with a centralized control . • They store pointers in the form of hypertext links to various Web servers. 4 2

  3. Partitioned Indexing • Partitioning of index across multiple machines, based on either: • Terms (Global index organization) • Each node holds posting list for some terms • Using content-index, query terms sent to nodes having the terms • Higher concurrency level, but larger postings lists • Documents (Local index organization)– more common • Each node holds a complete index (shorter PLs) • Query terms sent to all nodes • Top k results from each node merged • Global statistics (e.g.. idf) must be calculated • A Hybrid approach in Tiered Indexing may be used 5 Index Tiering • A popular early termination technique to improve the efficiency of query processing • Dividing nodes into two tiers to allocate the index of most popular documents on tier 1 and the rest on tier 2. • Search tier 1 first, if not enough results then search tier 2. • Note: other popular early termination techniques ( top-doc and query pruning) were discussed earlier in the semester! 6 6 3

  4. Distributed Index Construction • Not possible on a single machine • Various architecture for distributed indexing • MapReduce architecture (a term-partitioned index) • Master node assigns tasks to worker nodes ( map workers & reduce workers ) to split up the computing jobs: • Map Phase: Parsing & building localized <term, doc> pairs • Reduce Phase: Combining/merging posting pairs for each term 7 MapReduce (Cont’d) • Map & reduce phases can be done in parallel on many machines • A map machine can be a reducer machine in the process • Data broken into pieces ( shards )…generally 16M-64 M [128M] and send to map workers as they finish their job • Map workers work on one shard at a time (generally), unless having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf> • Sort based on term, and then secondary key (doc_id) • The same keys (terms) are assigned to the same reduce worker • Load should be balanced on the reducers 8 4

  5. MapReduce (Cont’d) Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008 9 Query Servers • Each server has its own disk holding a portion of index • Queries are distributed, via a centralized control, to servers that contain the related posting lists • Common terms may map to many servers • No single point of resource contention ( efficient ) • If a server crashes, that portion of index is not available 10 5

  6. Index in Dynamic Environment • Data collection is not static • Reconstruct the index periodically from scratch (many search engines use this) • Maintain an auxiliary index to store new document & remerge with existing index • Maintain multiple indexes - complicated in maintaining collection statistics 11 Outline • Web: Indexing & Efficiency – Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment • Improving effectiveness of Web search engines – Web page ranking • Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social seacrh – Result snippets • Social Search – tagging, collaborative search/filtering, recommender system – Real-time search • Peer-to-Peer Search 12 12 6

  7. Definitions…. • Web graph: each page is a node and links are directed edges from one node to other node • Out-links (out-degree) A: links from page A to B • In-links (in-degree) A: links from other pages to A • Sink: if out-links = 0 • Source: if in-links=0 • Static page: pages that are generated prior to any request • Dynamic page: pages that generated as the result of a request • Hidden/deep web: pages with no links/password protected/via a Form,… • Indexable Web: union of pages indexed by major search engines 13 Evaluation of Web Search Engines: High Precision Search • Traditional IR systems are evaluated based on precision and recall. • Web search engines are evaluated based on top N documents. • Recall estimation is very difficult • Precision is of limited concern, as many users do not look beyond 1 st screen. => How fast and accurate the first results screen is generated? 14 7

  8. Web Page Ranking • Considering both query dependant and query independent scores (captured during indexing), a global score is generated for each page: • Query dependant score • Similarity measures such as Cosine, BM25, proximity,… • Query independent score • Link analysis (anchor text, popularity metrics such as: authorities and hub, page rank,…) • Sponsored search • Localized search • Query log analysis • etc. 15 15 Query Log Analysis • Using user query patterns on certain days and time of day, week, month, and year, many optimizations are possible: • Pre-cache likely Web pages in anticipation of user queries to reduce page access delays; increasing system throughput (efficiency optimization) • Possible to adjust relevance ranking to tune for certain user queries (accuracy optimization) 16 8

  9. Anchor Text • Short, 2-3 terms, describe the linked/destination page. • May/may not be a different point of view than the author’s. • Anchor text of links to a doc d i included in index for d i • Extended anchor text (text surrounding anchor text) may also be used • Generally weighted based on frequency (notion of idf ) • Spamming problem 17 Page Rank • A scoring mechanism in Web search ( trade marked by Google and patented by Stanford ) • Generally calculated at the time of crawling • Using incoming and outgoing links as an indicator of popularity , adjusts Web page score • Popular page is defined as a page that - Many Web pages link to it ( inlinks ) - Important (popular) pages link to it • May be affected by link spam 18 9

  10. Page Rank − ( 1 d ) ∑ PageRank ( D ) = + i PageRank ( A ) d N C ( D ) D ... D i 1 n C(D i ) : number of links out from page D i d : damping factor (from 0-1; commonly 0.85) N: total number of pages An Iterative Algorithm: Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,…. 19 Authorities and Hub • Various algorithms based on assigning each retrieved web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999) • Authority page: an authoritative source on a given topic • Hub page: page listing pointers to authority pages on a topic • Authority score: summation of scores of all the hubs pointing to that authority page • Hub score: summation of scores of all authority pages the hub is pointing to 20 10

  11. Computing Authority and Hub Scores • Retrieve all pages containing the query term t. This is called root set. (~200 pgs) • Create a set including union of root set pages, pages that point to root set pages, and pages that root set pages point to. This is called base set . • Using the base set to compute the hub and authority scores. • An iterative algorithm: • Initialize hubs and authorities a score of 1 • Update s(H) and s(A) 21 Sponsored Search • Search system vendors sell advertisers keywords so that whenever such words are issued in a query, the advertiser’s desired homepage link is returned. • Sponsored search results are biased towards advertisers with higher bids, click frequency of Ads,… • Significant revenue is generated to search engine vendors via such search approach ( ex.: per click (50 sents to 15 dollars) 22 11

Recommend


More recommend