motivation topic sensitive pagerank
play

Motivation Topic-Sensitive PageRank Improve search results Current - PDF document

Motivation Topic-Sensitive PageRank Improve search results Current engines work well for us computer types, but not for novice users Exploit search context in a tractable and Taher H. Haveliwala


  1. � ✆ ✞ ☎ ✝ ✁ Motivation Topic-Sensitive PageRank ✂ Improve search results ✄ Current engines work well for us “computer types”, but not for novice users ✂ Exploit search context in a tractable and Taher H. Haveliwala effective way ✄ Current engines can only do so well when Stanford University optimizing parameters for Joe User issuing query q taherh@cs.stanford.edu Search Context Link-Based Scoring (HITS) ✂ Query context ✂ HITS (“Hubs and Authorities”) ✄ Highlighted word on page ✄ [Kleinberg SODA ’98] ✄ Previous queries issued ✄ Determine important Hub pages and ✂ User context important Authority pages ✄ Bookmarks ✄ +Query specific rank score ✄ Browsing history ✄ - Expensive at runtime Placing Search in Context: The Concept Revisited [Finkelstein et al. WWW10 ’01]

  2. � ☎ ✟ ✞ ✞ ✁ ✝ ✝ ✆ ✝ ✞ Link-Based Scoring (PageRank) Original PageRank ✂ PageRank query ✄ [Page et al. ’98] ✄ Assigns a-priori “importance” estimates to Query Processor pages page → rank Web graph Query-time ✄ - Query independent rank score ✄ + Inexpensive at runtime PageRank() Offline ✂ Algorithm has hooks for “personalization” Topic-Sensitive PageRank Topic-Sensitive PageRank Assigns multiple a-priori “importance” estimates query context to pages One PageRank score per basis topic Query Processor (page,topic) + Query specific rank score → Classifier Web + Make use of context rank topic Query-time + Inexpensive at runtime Related approach: one score per query word TSPageRank() was considered in [Richardson, Domingos NIPS ’02] (builds on [Rafiei, Mendelzon WWW ’00]) Yahoo! Offline or ODP

  3. � ☎ ☎ ✝ ✁ ✞ Original PageRank Intuition PageRank Diagram ✂ “Page is important if many important pages point to it” ✄ Many pages point to Yahoo!, so it is “important” ✄ Because Yahoo! is important, anyone it prominently points to is “important” Graph structure for entire web ☎✆☎ PageRank Diagram PageRank Diagram 1 1 0.5 1 1 0.5 1 Initialize all nodes to rank 1 Propagate ranks across links (multiplying by link weights)

  4. � ☎ ✂ ✆ ✁ ✄ ✁ ✁ ✁ PageRank Diagram PageRank Diagram 1.5 1 0.5 1.5 0.5 0.5 0.5 PageRank Diagram PageRank Diagram 1 1.2 1.5 1.2 0.5 0.6 After a while…

  5. � ✂ ✄ ✁ ✁ ✆ Original PageRank Influencing the Computation ✄ Input ✄ Uninfluenced: ☎ Web graph G “Page is important if many important pages point to it.” ✄ Output ☎ Rank vector r : (page → page importance) Influenced: ✄ r = PR( G ) “Page is important if many important pages point to it, and btw, the following are by definition important pages.” Influencing the Computation Influencing the Computation Graph structure for entire web Pick a set of influence ✝✟✞ ✝✠✁

  6. � ✄ ✝ ✄ ✄ ✄ ✁ ✄ ☎ ✄ ☎ ✄ ☎ Influenced PageRank Topic-Sensitive PageRank query Input: context Web graph G influence vector v Query Processor v : (page → degree of influence) (page,topic) → Classifier Web Output: rank topic Rank vector r : (page → page importance wrt v ) Query-time r = IPR( G , v ) TSPageRank() How to choose v ? Yahoo! Offline or ODP ✁✂✁ ✁✂✆ Topic-Sensitive PageRank: Part I (preprocessing) Offline Processing ✟ Input: Goal: Generate multiple a-priori estimates of ✠ Web W page importance, each score providing an importance estimate with respect to a topic ✠ Basis topics [c 1 , ... ,c 16 ] Use the Open Directory as a source of We use 16 categories (first level of ODP) representative basis topics (i.e., use ODP ✟ Output: pages to form a set of influence vectors v j ) ✠ List of rank vectors [ r 1 , ... , r 16 ] Offline preprocessing step, just as with ordinary r j : (page → page importance wrt topic c j ) PageRank ✁✂✞

  7. � Offline Processing Graphical Depiction of Part I For each topic c j ∈ FirstLevel(ODP): 1  Sports if ( ) i ∈ pages c  j set [ ] = v i ( )  pages c j j  0 otherwise  Compute r j = IPR( W , v j ) d Select set of influence, calculate PageRank for all pages [ ] . 05 r sports d = For example, ✁✄✂ ✁✄☎ Graphical Depiction of Part I Topic-Sensitive PageRank Health query context Query Processor (page,topic) → Classifier Web rank topic Query-time d TSPageRank() Select set of influence, calculate PageRank for all pages Yahoo! r health [ d ] = . 01 Offline or ODP For example, ✁✄✆ ✁✄✝

  8. � ✡ Topic-Sensitive PageRank: Part II (query processing) Two Usage Scenarios ☎ Goal: calculate some distribution of ☎ Classify the query ☎ Classify the query + context weights over the 16 topics in our basis ☎ Use a multinomial Naive Bayes classifier ✟ query history ✟ words surrounding a highlighted search phrase ✆ Training set: pages listed in ODP ✟ ... ✆ Input: {query} or {query, context} ✆ Output: probability distribution (weights) over the basis topics ✁✄✂ ✁✞✝ Classify the Query Example Topic Distribution ☎ Only the link structure of pages relevant For the query ‘golf’, with no additional context, the distribution of topic weights we would use to the query topic will be used to rank is: pages ☎ Better to rank query ‘golf’ with the Sports- 1 0.9 0.8 specific rank vector 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 s s s s h e s s n e a l e g y s d t s r e t m n w o c c n t t l r e e m l n e r r A t a o e e t i n o n p i i o o n u a e e N a e e c p W i p H H T e r g i i p o s G e c o S S u m _ r e S h B d c e f R o e S C n R R a _ s ✁✄✠ d ✁✄✁ i K

  9. � ✁ ✂ ✝ Classify the Query Context Picking the Topic Distribution ✄ The topic distribution will influence If the query is ‘golf’, but the previous query was ‘financial services investments’, then the rankings to prefer pages important to the distribution of topic weights we would use is: topic of the query context ✄ If user issues queries about investment 1 0.9 0.8 opportunities, a follow-up query on ‘golf’ 0.7 0.6 0.5 should be ranked with the Business- 0.4 0.3 specific rank vector 0.2 0.1 0 s s s s h e s s n e a l e g y s d t s r e t m n w o c c n e t t l r e e m a l n n n r r A t o e e t i o p i i o o n u a e e N a e e c p W i p H H T e r g i i p o s G e c o S S u m _ r f e S B d c e R h o e S n R R C a _ s d ✁✆☎ i K Composite Link Score Interpretation of Composite Score ✠ For set of influence vectors { v j } ✄ Use the distribution w to weight the respective topic-specific ranks, forming ∑ j [w j · IPR( W , v j )] = IPR( W , ∑ j [w j · v j ]) the topic-sensitive PageRank score for document d : s d = ∑ j w j r j [d] ✄ Weighted sum of rank vectors itself forms a valid rank vector ✁✆✞ ✁✆✟

  10. Interpretation Interpretation Health Sports d d First set of influence Second set of influence ✄✆☎ ✄✆✝ Interpretation Implementation Platform ☞ Stanford WebBase repository: 120M Health pages Sports ☞ For research experiments, topic weights can be estimated automatically by classifier, or specified explicitly d Topic-sensitive score is PageRank of above graph r = . 026 { , }, For example, sports health d ✞✠✟ ✞☛✡ �✂✁

  11. Does it make a difference? User Study (no search context) ✆ Do the different topical rank vectors rank ✆ Test set of 10 queries results for queries differently? ✆ 5 users were each shown top 10 results ✆ To answer, measure the similarity of to queries, when ranked using induced ranks for some set of test query ✞ Standard PageRank vector results ✞ Topic-Sensitive PageRank vector ✆ Details in paper, but short answer is, ✆ A page in the result was “relevant” if 3 of “yes, the different rank vectors induce the 5 users judged it to be relevant different result rankings” ✂☎✄ ✂☎✝ User Study (no search context) User Study Follow-up ✆ After factoring in text-based scoring, the precision values for both standard and topic-sensitive ranking go up ✆ Topic-sensitive rankings still preferred ✆ “Precision” not the best metric to use ✞ Some pages are “more relevant” ✞ Some pages are of “higher quality” ✂✟✂ ✂☎✠ �✁�

  12. Query for ‘golf’ (topic-sensitivity disabled) Results for ‘golf’ ✄✆☎ ✄✆✝ Results Enable History Tracking ‘financial services investments’ ‘financial services investments’ ✄✆✞ ✄✆✟ �✂✁

  13. ✠ ✟ ✠ ✠ ✠ ‘golf’ again, but query history judged to be Business topic Search Context Advantages of mediating through basis topics, as opposed to ‘keyword extraction’: Flexibility : uniformly treat variety of sources of context and personalization Transparency : topic weights are easily interpreted by user Privacy : topic weights reveal less unintentionally Efficiency : low query time cost, with small additional preprocessing cost ✄✆☎ ✄✞✝ Future Work Future Work ☛ Finer grained set of representative topics, ☛ Graph weighting scheme based on page similarity to ODP category, rather than to reflect more accurately user page membership to ODP category preferences and search context ✄✆✡ ✄✆☞ �✂✁

Recommend


More recommend