link based web search
play

Link-based Web Search Web Search PageRank HITS Stability Issues - PDF document

Roadmap Link-based Web Search Web Search PageRank HITS Stability Issues Current Research Vagelis Hristidis School of Computer Science Florida International University COP 6727 9/5/2004 FIU, COP 6727 2 Search the Web


  1. Roadmap Link-based Web Search � Web Search � PageRank � HITS � Stability Issues � Current Research Vagelis Hristidis School of Computer Science Florida International University COP 6727 9/5/2004 FIU, COP 6727 2 Search the Web Standard Web Search Engine Architecture crawl the Eliminate duplicates web DocIds crawler user create an inverted query index Search Inverted engine Show results index servers 9/5/2004 FIU, COP 6727 3 9/5/2004 FIU, COP 6727 4 1

  2. Limitations of traditional IR analysis Before Google • Text-based ranking function � Traditional IR Ranking Eg. Could www.harvard.edu Web be recognized as one of the � Term frequency (tf) most authoritative pages, database � Inverse Document Frequency (idf) Keyword since many other web pages � … contain “harvard” more often. • Pages are not sufficiently Web self – descriptive pages Usually the term “search engine” doesn't’t appear on search engine web pages 9/5/2004 FIU, COP 6727 5 9/5/2004 FIU, COP 6727 6 Link Analysis [Kleinberg98, PageRank] Roadmap � Assumptions � Web Search � If the pages pointing to this page are good, then this is also � PageRank a good page. � HITS � The words on the links pointing to this page are useful indicators of what this page is about. � Stability Issues � Does it work? � Current Research � Apparently, Google uses it � The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage. 9/5/2004 FIU, COP 6727 7 9/5/2004 FIU, COP 6727 8 2

  3. PageRank PageRank is a Usage Simulation � Make use of the link structure of the web to � “Random surfer” calculate a quality ranking (PageRank) for � Given a random URL each web page. � Clicks randomly on links � Each page has unique PageRank, � After a while gets bored and gets a new random URL independent of keyword query � The number of visits to each page is its � PageRank does NOT express relevance of PageRank. page to query 9/5/2004 FIU, COP 6727 9 9/5/2004 FIU, COP 6727 10 PageRank Calculation Intuition PageRank Calculation PR(A)=(1-d) + d*(PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)) � PageRank of page P increases when pages d: damping factor, normally this is set to 0.85. with large PageRanks point to P. T1, …, Tn: pages pointing to page A PR(A): PageRank of page A. PR(Ti): PageRank of page Ti. C(Ti): the number of links going out of page Ti. Note: d is needed due to PageRank sinks 9/5/2004 FIU, COP 6727 11 9/5/2004 FIU, COP 6727 12 3

  4. Example of Calculation (1) Example of Calculation (2) Page A 1*0.85/2 Page B 1 Page A 1 Page B 1*0.85/2 1*0.85 1*0.85 Page C Page D 1*0.85 1 Page C 1 Page D 9/5/2004 FIU, COP 6727 13 9/5/2004 FIU, COP 6727 14 Example of Calculation (3) Example of Calculation (4) Page A Page A Page B Page B 2.03875 1 0.575 0.575 Page C Page C Page D Page D 1.1925 2.275 0.15 0.15 Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from (from Page A) +0.15 (not transferred) = 1.19125 Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 Page D: receives none, but has not transferred 0.15 = 0.15 9/5/2004 FIU, COP 6727 15 9/5/2004 FIU, COP 6727 16 4

  5. Example of Calculation (5) Google Page A Page B � Uses PageRank as one of the criteria to rank 1.490 0.783 keyword query results. � Other criteria (may) include: � Term frequencies � Term proximities � Term position (title, top of page, etc) Page C Page D � Term characteristics (boldface, capitalized, etc) 1.577 0.15 � Link analysis information � Category information � After 20 iterations it converges � Popularity information � Converges because Web data graph irreducible (strongly connected) and aperiodic 9/5/2004 FIU, COP 6727 17 9/5/2004 FIU, COP 6727 18 Roadmap HITS [Kleinberg98] Hubs & Authorities � Jon M. Kleinberg: Authoritative Sources in a � Web Search Hyperlinked Environment . JACM 46(5): 604-632 � PageRank (1999) � HITS ( Hypertext-Induced Topic Search) developed � HITS by Jon Kleinberg, while visiting IBM Almaden. � Stability Issues � IBM expanded HITS into Clever. � IBM doesn't see Clever as real-time search engine. � Current Research But create constantly refreshed lists of relevant pages for categories 9/5/2004 FIU, COP 6727 19 9/5/2004 FIU, COP 6727 20 5

  6. Hubs & Authorities Hubs & Authorities � Rank pages according to keyword query (in contrast to PageRank) � Good hub: page that points to many good authorities. � Good authority: page pointed to by many good hubs. � Given Keyword Query, assign a hub and an authoritative value to each page. � Pages with high authority are results of query 9/5/2004 FIU, COP 6727 21 9/5/2004 FIU, COP 6727 22 Hubs & Authorities Calculation : Root Set Hubs & Authorities Calculation : Root and Base Set (Cont’d) Set and Base Set Expand root set into base set by including (up to a designated size cut-off) � all pages linked to by pages in root set � � Using query term to collect a root set of pages all pages that link to a page in root set � from text-based search engine (AltaVista) Typical base set contains roughly 1000-5000 pages � Base Set Root Set Root Set 9/5/2004 FIU, COP 6727 23 9/5/2004 FIU, COP 6727 24 6

  7. Hubs & Authorities Calculation Example: Mini Web � Iterative algorithm on Base Set: authority weights a (p), and hub weights h (p). X Y Z � Set authority weights a (p) = 1, and hub weights h (p) = 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 1 for all p. h a X x x ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ = ⎢ ⎥ = � Repeat following two operations A H a h M Y ⎢ 0 0 1 ⎥ y y ⎣ ⎦ ⎣ ⎦ ⎢ ⎥ (and then re-normalize a and h to have unit norm): a ⎢ ⎥ h 1 1 0 Z z z ⎣ ⎦ h(v 1 ) v 1 v 1 a(v 1 ) = = − H M A T h(v 2 ) v 2 p p v 2 a(v 2 ) i * i 1 H M H X M − i * i 1 h(v 3 ) a(v 3 ) = = v 3 v 3 T T A M H − A M M A − * * * 1 ∑ ∑ i i 1 i i = = a ( p ) h(q) h ( p ) a (q) Z Y p points to q q points to p 9/5/2004 FIU, COP 6727 25 9/5/2004 FIU, COP 6727 26 Example Hubs & Authorities Calculation ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 0 1 2 2 1 3 1 2 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = = ⎢ ⎥ M T 1 0 1 T 2 2 1 M T 1 1 0 = � Theorem (Kleinberg, 1998). The iterates a(p) M 0 0 1 ⎢ ⎥ M M ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 1 1 0 ⎥ ⎢ 1 1 2 ⎥ ⎢ 2 0 2 ⎥ ⎢ 1 1 0 ⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ and h(p) converge to the principal ⎣ ⎦ ∞ eigenvectors of M T M and MM T , where M is Iteration 0 1 2 3 … the adjacency matrix of the (directed) Web ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ + ⎤ 1 6 28 132 X is the best 2 3 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ hub subgraph. = ⎢ 1 ⎥ H 1 2 8 36 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X + ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 1 3 ⎦ ⎣ 1 ⎦ ⎣ 4 ⎦ ⎣ 20 ⎦ ⎣ 96 ⎦ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 5 24 114 + 1 3 ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = A 1 5 24 114 + ⎢ 1 3 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Z Y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 1 ⎦ ⎣ 4 ⎦ ⎣ 18 ⎦ ⎣ 84 ⎦ 2 ⎣ ⎦ Z is most authoritative 9/5/2004 FIU, COP 6727 27 9/5/2004 FIU, COP 6727 28 7

  8. PageRank v.s. Authorities Roadmap � PageRank � HITS � Web Search (Google) (CLEVER) � PageRank � computed for all web � performed on the set of � HITS pages stored in the retrieved web pages for database prior to the each query � Stability Issues query � computes authorities � computes authorities only and hubs � Current Research � Trivial and fast to � easy to compute, but compute real-time execution is hard 9/5/2004 FIU, COP 6727 29 9/5/2004 FIU, COP 6727 30 How do we analyze algorithm stability? PageRank Stability General Strategy: Theoretical Result: Start with original adjacency matrix, A 1. � If original k pages to be modified do not have Perturb the matrix to get A* 2. Select k nodes in graph to add or delete � high overall PR scores then perturbed scores Compute distance, d(r(A),r(A*)), for some distance 3. will not be far from the original measure d and objective function r that measures the quality of results of A’ somehow Compute amount of perturbation p (Α,Α * ) for some 4. Note: Result conditioned on d, resetting distance function p that measures the amount of perturbation probability, not being too small Evaluate the conditions, if any, where small values for 5. p generate large values for d 9/5/2004 FIU, COP 6727 31 9/5/2004 FIU, COP 6727 32 8

Recommend


More recommend