navigating the web graph
play

Navigating the Web graph Workshop on Networks and Navigation Santa - PowerPoint PPT Presentation

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington Outline Topical locality: Content, link, and semantic topologies


  1. Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington

  2. Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

  3. The Web as a text corpus Pages close in p 1 word vector space p 2 mass tend to be related weapons Cluster hypothesis destruction (van Rijsbergen 1979) The WebCrawler (Pinkerton 1994) The whole first generation of search engines

  4. Enter the Web’s link structure p ( j ) p ( i ) = α � N + (1 − α ) | ℓ : j → ℓ | j : j → i Brin & Page 1998 Barabasi & Albert 1999 Broder & al. 2000

  5. Three network topologies Text Links

  6. Three network topologies Text Meaning Links

  7. The “link-cluster” conjecture Connection between semantic topology (topicality or relevance) and link topology (hypertext) G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) | rel(q) AND link(q,p)] Related nodes are “clustered” if R > G (modularity) Necessary and sufficient condition for a random G = 5/15 C = 2 crawler to R = 3/6 = 2/4 find pages related to start points ICML 1997

  8. Link-cluster conjecture • Stationary hit rate for a random crawler: η ( t + 1) = η ( t ) ⋅ R + (1 − η ( t )) ⋅ G ≥ η ( t ) G ∗ = t →∞ η    η → 1 − ( R − G ) Conjecture η ∗ > G ⇔ R > G η ∗ R − G Value added G − 1 = 1 − ( R − G )

  9. [ ] G ( q ) ≡ Pr rel ( p ) | rel ( q ) ∧ path ( q , p ) ≤ δ R ( q , δ ) Pr[ rel ( p )] Link-cluster conjecture Pages that link to each other tend to be related Preservation of semantics (meaning) A.k.a. topic drift ∑ path ( q , p ) { p : path ( q , p ) ≤ δ } L ( q , δ ) ≡ JASIST 2004 { p : path ( q , p ) ≤ δ }

  10. The “link-content” ∑ sim ( q , p ) conjecture { p : path ( q , p ) ≤ δ } S ( q , δ ) ≡ { p : path ( q , p ) ≤ δ } Correlation of lexical and linkage topology L( δ ): average link distance S( δ ): average similarity to start (topic) page from pages up to distance δ Correlation ρ (L,S) = –0.76 9

  11. Heterogeneity of link-content correlation S = c + (1 − c ) e aL b edu net gov org com signif. diff. a only (p<0.05) signif. diff. a & b (p<0.05)

  12. Mapping the relationship between links, content, and semantic topologies • Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology: – Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification • Data: Open Directory Project (dmoz.org) – ~ 1 M pages after cleanup ~ 1.3*10 12 page pairs! –

  13. p 1 p 1 ⋅ p 2 p 2 term j ( ) = σ c p 1 , p 2 p 1 ⋅ p 2 Content similarity term i term k Link similarity p 2 p 1 U p 1 ∩ U p 2 σ l ( p 1 , p 2 ) = U p 1 ∪ U p 2

  14. Semantic similarity top lca c 2 c 1 • Information-theoretic 2logPr[lca( c 1 , c 2 )] measure based on σ s ( c 1 , c 2 ) = logPr[ c 1 ] + logPr[ c 2 ] classification tree (Lin 1998) • Classic path distance in special case of balanced tree

  15. Individual metric distributions semantic conten t link

  16. | Retrieved & Relevant | Precision = | Retrieved | | Retrieved & Relevant | Recall = | Relevant |

  17. | Retrieved & Relevant | Precision = | Retrieved | | Retrieved & Relevant | Recall = | Relevant | Averaging ∑ σ s ( p , q ) semantic similarity { p , q : σ c = s c , σ l = s l } P ( s c , s l ) = { p , q : σ c = s c , σ l = s l } Summing ∑ σ s ( p , q ) semantic { p , q : σ c = s c , σ l = s l } similarity R ( s c , s l ) = ∑ σ s ( p , q ) { p , q }

  18. Science σ l log Recall Precision σ c

  19. Adult σ l log Recall Precision σ c

  20. News σ l log Recall Precision σ c

  21. All pairs σ l log Recall Precision σ c

  22. Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

  23. Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ

  24. Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ Phase transition ρ * Power law tail Pr( λ | ρ ) ~ ρ − α ( λ ) Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002

  25. Local content-based growth model  k ( i ) if r ( p i , p t ) < ρ *  Pr( p t → p i < t ) =  mt  c [ r ( p i , p t )] − α otherwise  • Similar to preferential attachment (BA) • Use degree info (popularity/ importance) only for nearby (similar/ related) pages

  26. So, many models can predict degree distributions... Which is “right” ? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs

  27. None of these models is right!

  28. The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3

  29. The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3 Bias choice by content similarity instead of uniform distribution

  30. Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt

  31. Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt Pr( i ) ∝ [ r ( i, t )] − α ˆ ψ = 0 . 2 , α = 1 . 7

  32. Both mixture models get the degree distribution right…

  33. …but the degree-similarity mixture model predicts the similarity distribution better Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004

  34. Citation networks PNAS 1 15,785 articles 2 0.75 published in PNAS � l 0.5 between 1997 and 0.25 2002 0 0 0 0.25 0.5 0.75 1 � c

  35. Citation networks

  36. Citation networks

  37. Open Questions Understand distribution of content similarity across all pairs of pages Growth model to explain co-evolution of both link topology and content similarity The role of search engines

  38. Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)

  39. Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001) Practice: can’t find them! • Greedy algorithms based on location in geographical small world networks: ~ poly(N) (Kleinberg 2000) • Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)

  40. Exception # 1 • Geographical networks (Kleinberg 2000) – Local links to all lattice neighbors – Long-range link probability distribution: power law Pr ~ r– α • r: lattice (Manhattan) distance α : constant clustering exponent • t ~ log 2 N ⇔ α = D

  41. Is the Web a geographical network? Replace lattice distance by lexical distance r = (1 / σ c) – 1 long range links (power law tail) local links

  42. Exception # 2 h =2 h =1 • Hierarchical networks (Kleinberg 2002, Watts & al. 2002) – Nodes are classified at the leaves of tree – Link probability distribution: exponential tail Pr ~ e–h • h: tree distance (height of lowest common ancestor) t ~ log ε N , ε ≥ 1

  43. Is the Web a hierarchical network? Replace tree distance by semantic distance h = 1 – σ s top exponential tail lca c 2 c 1

  44. Take home message: the Web is a “friendly” place!

  45. Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

  46. Crawler applications • Universal Crawlers – Search engines! • Topical crawlers – Live search (e.g., myspiders.informatics.indiana.edu) – Topical search engines & portals – Business intelligence (find competitors/partners) – Distributed, collaborative search

  47. Topical [sic] crawlers spears

  48. Evaluating topical crawlers • Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework – Define common tasks of measurable difficulty – Identify topics, relevant targets – Identify appropriate performance measures • Effectiveness: quality of crawler pages, order, etc. • Efficiency: separate CPU & memory of crawler algorithms from bandwidth & common utilities Information Retrieval 2005

  49. Evaluating topical crawlers : Topics Keywords Description Targets • Automate evaluation using edited directories • Different sources of relevance assessments

  50. Evaluating topical crawlers: Tasks Start from seeds, find targets and/or pages similar to target descriptions d =2 d =3

Recommend


More recommend