Navigating the Web graph Workshop on Networks and Navigation Santa - PowerPoint PPT Presentation

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington

Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

The Web as a text corpus Pages close in p 1 word vector space p 2 mass tend to be related weapons Cluster hypothesis destruction (van Rijsbergen 1979) The WebCrawler (Pinkerton 1994) The whole first generation of search engines

Enter the Web’s link structure p ( j ) p ( i ) = α � N + (1 − α ) | ℓ : j → ℓ | j : j → i Brin & Page 1998 Barabasi & Albert 1999 Broder & al. 2000

Three network topologies Text Links

Three network topologies Text Meaning Links

The “link-cluster” conjecture Connection between semantic topology (topicality or relevance) and link topology (hypertext) G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) | rel(q) AND link(q,p)] Related nodes are “clustered” if R > G (modularity) Necessary and sufficient condition for a random G = 5/15 C = 2 crawler to R = 3/6 = 2/4 find pages related to start points ICML 1997

Link-cluster conjecture • Stationary hit rate for a random crawler: η ( t + 1) = η ( t ) ⋅ R + (1 − η ( t )) ⋅ G ≥ η ( t ) G ∗ = t →∞ η    η → 1 − ( R − G ) Conjecture η ∗ > G ⇔ R > G η ∗ R − G Value added G − 1 = 1 − ( R − G )

[ ] G ( q ) ≡ Pr rel ( p ) | rel ( q ) ∧ path ( q , p ) ≤ δ R ( q , δ ) Pr[ rel ( p )] Link-cluster conjecture Pages that link to each other tend to be related Preservation of semantics (meaning) A.k.a. topic drift ∑ path ( q , p ) { p : path ( q , p ) ≤ δ } L ( q , δ ) ≡ JASIST 2004 { p : path ( q , p ) ≤ δ }

The “link-content” ∑ sim ( q , p ) conjecture { p : path ( q , p ) ≤ δ } S ( q , δ ) ≡ { p : path ( q , p ) ≤ δ } Correlation of lexical and linkage topology L( δ ): average link distance S( δ ): average similarity to start (topic) page from pages up to distance δ Correlation ρ (L,S) = –0.76 9

Heterogeneity of link-content correlation S = c + (1 − c ) e aL b edu net gov org com signif. diff. a only (p<0.05) signif. diff. a & b (p<0.05)

Mapping the relationship between links, content, and semantic topologies • Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology: – Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification • Data: Open Directory Project (dmoz.org) – ~ 1 M pages after cleanup ~ 1.3*10 12 page pairs! –

p 1 p 1 ⋅ p 2 p 2 term j ( ) = σ c p 1 , p 2 p 1 ⋅ p 2 Content similarity term i term k Link similarity p 2 p 1 U p 1 ∩ U p 2 σ l ( p 1 , p 2 ) = U p 1 ∪ U p 2

Semantic similarity top lca c 2 c 1 • Information-theoretic 2logPr[lca( c 1 , c 2 )] measure based on σ s ( c 1 , c 2 ) = logPr[ c 1 ] + logPr[ c 2 ] classification tree (Lin 1998) • Classic path distance in special case of balanced tree

Individual metric distributions semantic conten t link

| Retrieved & Relevant | Precision = | Retrieved | | Retrieved & Relevant | Recall = | Relevant | Averaging ∑ σ s ( p , q ) semantic similarity { p , q : σ c = s c , σ l = s l } P ( s c , s l ) = { p , q : σ c = s c , σ l = s l } Summing ∑ σ s ( p , q ) semantic { p , q : σ c = s c , σ l = s l } similarity R ( s c , s l ) = ∑ σ s ( p , q ) { p , q }

Science σ l log Recall Precision σ c

Adult σ l log Recall Precision σ c

News σ l log Recall Precision σ c

All pairs σ l log Recall Precision σ c

Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ

Link probability r = 1 σ c − 1 Pr( λ | ρ ) = ( p , q ) : r = ρ ∧ σ l > λ vs lexical distance ( p , q ) : r = ρ Phase transition ρ * Power law tail Pr( λ | ρ ) ~ ρ − α ( λ ) Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002

Local content-based growth model  k ( i ) if r ( p i , p t ) < ρ *  Pr( p t → p i < t ) =  mt  c [ r ( p i , p t )] − α otherwise  • Similar to preferential attachment (BA) • Use degree info (popularity/ importance) only for nearby (similar/ related) pages

So, many models can predict degree distributions... Which is “right” ? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs

None of these models is right!

The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3

The mixture model Pr( i ) ∝ ψ · 1 t + (1 − ψ ) · k ( i ) mt degree-uniform mixture a b c i 2 i 2 i 2 i 1 i 1 i 1 t t t i 3 i 3 i 3 Bias choice by content similarity instead of uniform distribution

Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt

Degree-similarity mixture model Pr( i ) + (1 − ψ ) · k ( i ) Pr( i ) ∝ ψ · ˆ mt Pr( i ) ∝ [ r ( i, t )] − α ˆ ψ = 0 . 2 , α = 1 . 7

Both mixture models get the degree distribution right…

…but the degree-similarity mixture model predicts the similarity distribution better Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004

Citation networks PNAS 1 15,785 articles 2 0.75 published in PNAS � l 0.5 between 1997 and 0.25 2002 0 0 0 0.25 0.5 0.75 1 � c

Citation networks

Open Questions Understand distribution of content similarity across all pairs of pages Growth model to explain co-evolution of both link topology and content similarity The role of search engines

Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)

Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001) Practice: can’t find them! • Greedy algorithms based on location in geographical small world networks: ~ poly(N) (Kleinberg 2000) • Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)

Exception # 1 • Geographical networks (Kleinberg 2000) – Local links to all lattice neighbors – Long-range link probability distribution: power law Pr ~ r– α • r: lattice (Manhattan) distance α : constant clustering exponent • t ~ log 2 N ⇔ α = D

Is the Web a geographical network? Replace lattice distance by lexical distance r = (1 / σ c) – 1 long range links (power law tail) local links

Exception # 2 h =2 h =1 • Hierarchical networks (Kleinberg 2002, Watts & al. 2002) – Nodes are classified at the leaves of tree – Link probability distribution: exponential tail Pr ~ e–h • h: tree distance (height of lowest common ancestor) t ~ log ε N , ε ≥ 1

Is the Web a hierarchical network? Replace tree distance by semantic distance h = 1 – σ s top exponential tail lca c 2 c 1

Take home message: the Web is a “friendly” place!

Crawler applications • Universal Crawlers – Search engines! • Topical crawlers – Live search (e.g., myspiders.informatics.indiana.edu) – Topical search engines & portals – Business intelligence (find competitors/partners) – Distributed, collaborative search

Topical [sic] crawlers spears

Evaluating topical crawlers • Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework – Define common tasks of measurable difficulty – Identify topics, relevant targets – Identify appropriate performance measures • Effectiveness: quality of crawler pages, order, etc. • Efficiency: separate CPU & memory of crawler algorithms from bandwidth & common utilities Information Retrieval 2005

Evaluating topical crawlers : Topics Keywords Description Targets • Automate evaluation using edited directories • Different sources of relevance assessments

Evaluating topical crawlers: Tasks Start from seeds, find targets and/or pages similar to target descriptions d =2 d =3

Navigating the Web graph Workshop on Networks and Navigation Santa - PowerPoint PPT Presentation

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington Outline Topical locality: Content, link, and semantic topologies

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Oliver Campbell QC | Alison Newstead Navigating the Legal Landscape Navigating the Legal

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

MAINLAND NAVIGATING THE WORLD OF RETAIL NAVIGATING THE WORLD OF RETAIL METRO VANCOUVER RETAIL

Navigating the Negotiability Process August 16, 2017 Slide 2 ? Navigating the Negotiability

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

COMMISSION MEETING WITH THE ADVISORY COMMITTEE ON REACTOR SAFEGUARDS (ACRS) December 6, 2019

The Structural Topic Model and Applied Social Science Molly Roberts, Brandon Stewart, Dustin

CS 61A Topical Review Object Oriented Programming Albert Xu Slides: albertxu.xyz/teaching/cs61a/

Towards Automa-c Topical Classifica-on of LOD Datasets

Jigsaw Model Three Topics Readings and discussion framed next weeks topics 1. Implicit Bias

#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks Bang Hui Lim,

Presentation on Context in Social Networking Based on the Position Paper from Julian Pye,

1 Host (Rachel): Thank you everyone for joining us. My name is Rachel Mann and I am your host for

Navigating the Web graph Workshop on Networks and Navigation Santa - PowerPoint PPT Presentation

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington Outline Topical locality: Content, link, and semantic topologies

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Oliver Campbell QC | Alison Newstead Navigating the Legal Landscape Navigating the Legal

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

MAINLAND NAVIGATING THE WORLD OF RETAIL NAVIGATING THE WORLD OF RETAIL METRO VANCOUVER RETAIL

Navigating the Negotiability Process August 16, 2017 Slide 2 ? Navigating the Negotiability

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

COMMISSION MEETING WITH THE ADVISORY COMMITTEE ON REACTOR SAFEGUARDS (ACRS) December 6, 2019

The Structural Topic Model and Applied Social Science Molly Roberts, Brandon Stewart, Dustin

CS 61A Topical Review Object Oriented Programming Albert Xu Slides: albertxu.xyz/teaching/cs61a/

Towards Automa-c Topical Classifica-on of LOD Datasets

Jigsaw Model Three Topics Readings and discussion framed next weeks topics 1. Implicit Bias

#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks Bang Hui Lim,

Presentation on Context in Social Networking Based on the Position Paper from Julian Pye,

1 Host (Rachel): Thank you everyone for joining us. My name is Rachel Mann and I am your host for

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,