Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington
Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search
The Web as a text corpus Pages close in p 1 word vector space p 2 mass tend to be related weapons Cluster hypothesis destruction (van Rijsbergen 1979) The WebCrawler (Pinkerton 1994) The whole first generation of search engines
Enter the Web’s link structure p ( j ) p ( i ) = α � N + (1 − α ) | ℓ : j → ℓ | j : j → i Brin & Page 1998 Barabasi & Albert 1999 Broder & al. 2000
Three network topologies Text Links
Three network topologies Text Meaning Links
The “link-cluster” conjecture Connection between semantic topology (topicality or relevance) and link topology (hypertext) G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) | rel(q) AND link(q,p)] Related nodes are “clustered” if R > G (modularity) Necessary and sufficient condition for a random G = 5/15 C = 2 crawler to R = 3/6 = 2/4 find pages related to start points ICML 1997
Link-cluster conjecture • Stationary hit rate for a random crawler: η ( t + 1) = η ( t ) ⋅ R + (1 − η ( t )) ⋅ G ≥ η ( t ) G ∗ = t →∞ η η → 1 − ( R − G ) Conjecture η ∗ > G ⇔ R > G η ∗ R − G Value added G − 1 = 1 − ( R − G )
[ ] G ( q ) ≡ Pr rel ( p ) | rel ( q ) ∧ path ( q , p ) ≤ δ R ( q , δ ) Pr[ rel ( p )] Link-cluster conjecture Pages that link to each other tend to be related Preservation of semantics (meaning) A.k.a. topic drift ∑ path ( q , p ) { p : path ( q , p ) ≤ δ } L ( q , δ ) ≡ JASIST 2004 { p : path ( q , p ) ≤ δ }
The “link-content” ∑ sim ( q , p ) conjecture { p : path ( q , p ) ≤ δ } S ( q , δ ) ≡ { p : path ( q , p ) ≤ δ } Correlation of lexical and linkage topology L( δ ): average link distance S( δ ): average similarity to start (topic) page from pages up to distance δ Correlation ρ (L,S) = –0.76 9
Heterogeneity of link-content correlation S = c + (1 − c ) e aL b edu net gov org com signif. diff. a only (p<0.05) signif. diff. a & b (p<0.05)
Mapping the relationship between links, content, and semantic topologies • Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology: – Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification • Data: Open Directory Project (dmoz.org) – ~ 1 M pages after cleanup ~ 1.3*10 12 page pairs! –
p 1 p 1 ⋅ p 2 p 2 term j ( ) = σ c p 1 , p 2 p 1 ⋅ p 2 Content similarity term i term k Link similarity p 2 p 1 U p 1 ∩ U p 2 σ l ( p 1 , p 2 ) = U p 1 ∪ U p 2
Semantic similarity top lca c 2 c 1 • Information-theoretic 2logPr[lca( c 1 , c 2 )] measure based on σ s ( c 1 , c 2 ) = logPr[ c 1 ] + logPr[ c 2 ] classification tree (Lin 1998) • Classic path distance in special case of balanced tree
Individual metric distributions semantic conten t link
| Retrieved & Relevant | Precision = | Retrieved | | Retrieved & Relevant | Recall = | Relevant |
| Retrieved & Relevant | Precision = | Retrieved | | Retrieved & Relevant | Recall = | Relevant | Averaging ∑ σ s ( p , q ) semantic similarity { p , q : σ c = s c , σ l = s l } P ( s c , s l ) = { p , q : σ c = s c , σ l = s l } Summing ∑ σ s ( p , q ) semantic { p , q : σ c = s c , σ l = s l } similarity R ( s c , s l ) = ∑ σ s ( p , q ) { p , q }
Science σ l log Recall Precision σ c
Adult σ l log Recall Precision σ c
News σ l log Recall Precision σ c
All pairs σ l log Recall Precision σ c
Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search
Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ
Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ Phase transition ρ * Power law tail Pr( λ | ρ ) ~ ρ − α ( λ ) Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
Local content-based growth model k ( i ) if r ( p i , p t ) < ρ * Pr( p t → p i < t ) = mt c [ r ( p i , p t )] − α otherwise • Similar to preferential attachment (BA) • Use degree info (popularity/ importance) only for nearby (similar/ related) pages
So, many models can predict degree distributions... Which is “right” ? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs
None of these models is right!
The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3
The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3 Bias choice by content similarity instead of uniform distribution
Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt
Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt Pr( i ) ∝ [ r ( i, t )] − α ˆ ψ = 0 . 2 , α = 1 . 7
Both mixture models get the degree distribution right…
…but the degree-similarity mixture model predicts the similarity distribution better Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
Citation networks PNAS 1 15,785 articles 2 0.75 published in PNAS � l 0.5 between 1997 and 0.25 2002 0 0 0 0.25 0.5 0.75 1 � c
Citation networks
Citation networks
Open Questions Understand distribution of content similarity across all pairs of pages Growth model to explain co-evolution of both link topology and content similarity The role of search engines
Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)
Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001) Practice: can’t find them! • Greedy algorithms based on location in geographical small world networks: ~ poly(N) (Kleinberg 2000) • Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)
Exception # 1 • Geographical networks (Kleinberg 2000) – Local links to all lattice neighbors – Long-range link probability distribution: power law Pr ~ r– α • r: lattice (Manhattan) distance α : constant clustering exponent • t ~ log 2 N ⇔ α = D
Is the Web a geographical network? Replace lattice distance by lexical distance r = (1 / σ c) – 1 long range links (power law tail) local links
Exception # 2 h =2 h =1 • Hierarchical networks (Kleinberg 2002, Watts & al. 2002) – Nodes are classified at the leaves of tree – Link probability distribution: exponential tail Pr ~ e–h • h: tree distance (height of lowest common ancestor) t ~ log ε N , ε ≥ 1
Is the Web a hierarchical network? Replace tree distance by semantic distance h = 1 – σ s top exponential tail lca c 2 c 1
Take home message: the Web is a “friendly” place!
Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search
Crawler applications • Universal Crawlers – Search engines! • Topical crawlers – Live search (e.g., myspiders.informatics.indiana.edu) – Topical search engines & portals – Business intelligence (find competitors/partners) – Distributed, collaborative search
Topical [sic] crawlers spears
Evaluating topical crawlers • Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework – Define common tasks of measurable difficulty – Identify topics, relevant targets – Identify appropriate performance measures • Effectiveness: quality of crawler pages, order, etc. • Efficiency: separate CPU & memory of crawler algorithms from bandwidth & common utilities Information Retrieval 2005
Evaluating topical crawlers : Topics Keywords Description Targets • Automate evaluation using edited directories • Different sources of relevance assessments
Evaluating topical crawlers: Tasks Start from seeds, find targets and/or pages similar to target descriptions d =2 d =3
Recommend
More recommend