web dynamics
play

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and - PowerPoint PPT Presentation

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2009 Web Dynamics 3 1 Why crawling is difficult Huge size of the Web (billions of pages) High dynamics of


  1. Web Dynamics Part 3 – Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2 Accessing the Hidden Web Summer Term 2009 Web Dynamics 3 ‐ 1

  2. Why crawling is difficult • Huge size of the Web (billions of pages) • High dynamics of the Web (page creations, updates, deletions) • High diversity in the Web (page importance, quality, formats, conformance to standards) • Huge amount of noise, malicious content (spam), duplicate content (Wikipedia copies) Summer Term 2009 Web Dynamics 3 ‐ 2

  3. Requirements for a Crawler • Robustness: resilience to (malicious or unintended) crawler traps • Politeness : respect servers’ policies for accessing pages (which & how frequent) • Quality : focus on downloading “important” pages • Freshness : make sure that crawled snapshots correspond to current version of pages • Scalability : cope with growing load by adding machines & bandwidth • Efficiency : make efficient use of system resources • Extensibility : possible to add new features (data formats, protocols) Summer Term 2009 Web Dynamics 3 ‐ 3

  4. Basic Crawler Architecture Text DNS indexer Page filter Parse content Fetch Link filter Elimination Web Duplicate Web URLs Initialize with seed urls URL frontier / queue Summer Term 2009 Web Dynamics 3 ‐ 4

  5. Crawler Types • Snapshot crawler : get at most one snapshot of each page (important for archiving) • Batch ‐ mode crawler : revisit known pages periodcally (collection is fixed) • Steady crawler : continuously revisit know pages (collection is fixed) • Incremental crawler : continuously revisit known pages and increase crawl quality by finding new good pages Summer Term 2009 Web Dynamics 3 ‐ 5

  6. Queue design for snapshot crawlers Goals: • Allow for different crawl priorities, but provide fairness • Keep crawler busy while being polite Prioritizer 1 2 F F front queues … Biased front queue selector back queue router 1 2 B B back queues Entries: (unique set of hosts on each) (back queue, next … access time) Back queue selector heap Summer Term 2009 Web Dynamics 3 ‐ 6

  7. Modeling page changes over time Observation: Page changes can be modeled by Poisson process with change rate λ : Probability for at least one change until t: Expectation: E[t]=1/ λ , variance: var[t]=1/ λ 2 Note: change rates differ per page (and maybe over time) Summer Term 2009 Web Dynamics 3 ‐ 7

  8. Poisson processes on real data Cho & Garcia ‐ Molina, TODS 2003: • Daily crawl of 720,000 pages from 270 sites over approx 4.5 months • Seeds: popular pages from large Web crawl Summer Term 2009 Web Dynamics 3 ‐ 8

  9. Change rate distributions on the Web Cho & Garcia ‐ Molina, TODS 2003 Summer Term 2009 Web Dynamics 3 ‐ 9

  10. Sampling change rates Goal : determine λ i for fixed page i Simple estimator: For X i monitored updates in time T i , estimate X λ = i : i T i Question: is this a good estimator? λ = λ No. E • Is it unbiased? [ ] i i No. λ − λ < ε = ε • Is it consistent? P lim [| | ] 1 for any positive i i → ∞ n Better estimator: n i accesses with frequency f i , page was not changed Y i times + Y 0 . 5 λ = − ⋅ f i : log i + n 0 . 5 Summer Term 2009 Web Dynamics 3 ‐ 10

  11. Crawling the dynamic Web Challenges: • How do we model the „up ‐ to ‐ dateness“ of our index • How frequently do we recrawl? – On average, update each of N pages once within I time units (average update frequency f=1/ I ) • How frequently do we schedule per ‐ page revisits? – uniformly vs. depending on the change rates • In which order do we revisit pages – fixed order vs. recrawl (random) vs. purely random Summer Term 2009 Web Dynamics 3 ‐ 11

  12. Measures for recency of the index (1) Definition: Index is ( α , β ) ‐ current at time t when the probability that random page has been up ‐ to ‐ date β time units ago is at least α . Question to answer : How frequently do we need to recrawl to guarantee to be (95%,1 week) ‐ current? Answer: every 18 days for 800 million page sample [Brewington and Cybenko 2000] Summer Term 2009 Web Dynamics 3 ‐ 12

  13. Brewington&Cybenko Model grace period fetch fetch time t ‐ β I 0 t Probability that a specific document is β ‐ current in interval [0; I ]: β I β − − λ − β I e ( ) 1 1 1 ∫ + ∫ − λ − β = + t dt e ( ) dt λ I I I I β 0 if t< β , if t> β , prob. decays prob. is 1 exponentially with delay Now average over all documents (assuming distribution w( λ ) for change rates): ∞ ⎡ ⎤ − λ − β β − I e ( ) 1 ∫ α = λ + λ w d ( ) ⎢ ⎥ λ I I ⎣ ⎦ 0 (see paper: 1/ λ is Weibull ‐ distributed) Summer Term 2009 Web Dynamics 3 ‐ 13

  14. Measures for recency of the index (2) • Freshness F(p;t) of a page p at time t: 1 if p is up ‐ to ‐ date at time t, 0 otherwise • Age A(p;t) of a page p at time t: time since the last update of p that is not reflected in the index • Freshness F(t) of the index at time t: N 1 ∑ = F t F p t ( ) ( , ) i N = i 1 • Average Freshness F(p) of a page p: t 1 ∫ = F p F p t dt ( ) lim ( ; ) t → ∞ t 0 • Average Freshness of the index: 1 ∑ N = F F p ( ) i N = i 1 Summer Term 2009 Web Dynamics 3 ‐ 14

  15. Example: freshness and age for page p F(p;t) Cho & Garcia ‐ Molina, TODS 2003 A(p;t) Summer Term 2009 Web Dynamics 3 ‐ 15

  16. Freshness and age for different crawlers Cho & Garcia ‐ Molina, VLDB 2000 grey area: time when crawler is active solid line: F(t) dotted line: average of F(t) Theorem : Average freshness is the same for both crawlers if load is the same Summer Term 2009 Web Dynamics 3 ‐ 16

  17. Expected freshness and age of a page Assume for page p: – p changes with rate λ – p is synch‘ed at time 0 Then – Expected freshness of p at time t ≥ 0: ( ) − λ − λ − λ = ⋅ − + ⋅ = t t t E F p t e e e [ ( ; )] 0 1 1 P[p changed at time t] – Expected age of p at time t ≥ 0: t ⎛ − λ ⎞ − t e 1 ∫ ⎜ ⎟ − λ = − λ = − s E A p t t s e ds t [ ( ; )] ( ) 1 ⎜ ⎟ λ t ⎝ ⎠ 0 P[first change of p after time s] Summer Term 2009 Web Dynamics 3 ‐ 17

  18. Expected freshness and age over time Cho & Garcia ‐ Molina, TODS 2003 E[F(p;t)] E[A(p;t)] Summer Term 2009 Web Dynamics 3 ‐ 18

  19. Which avg. freshness can we achieve? Assume that • All pages change at the same rate λ • All pages are sync‘ed every I time units (at rate f= 1/I ) • Pages are always sync‘ed in a fixed order E[F(p;t)] Theorem: t I − λ − λ − I − f e e / 1 1 1 1 ∫ ∫ = = = = = F p E F p t dt E F p t dt F ( ) lim [ ( ; )] [ ( ; )] λ λ t I I f → ∞ / t 0 0 Summer Term 2009 Web Dynamics 3 ‐ 19

  20. Are other orders better? • Random order : update all pages once, but in random order (e.g., by recrawling) • Purely random order : pick page to update at random Cho & Garcia ‐ Molina, TODS 2003 Summer Term 2009 Web Dynamics 3 ‐ 20

  21. Non ‐ uniform update frequencies Now • page p i changes with rate λ i • page p i is updated at fixed interval I i (=1/f i ) Question: How are f i and λ i related? Simple answer f i ∝ λ i is wrong! Summer Term 2009 Web Dynamics 3 ‐ 21

  22. Simple example: two pages, one update p 1 assume update here p 2 1 day Assume • p 1 changes once per interval (=9 times/day) • p 2 changes once per day • probability for change uniform in each interval Now estimate expected benefit of updating p 2 in the middle of the day • with prob. ½ change occurs later ⇒ benefit 0 • with prob. ½ change occurs before ⇒ benefit ½ • Expected benefit: ½ * ½ = ¼ Similar computation for p 1 (update in the middle of any interval): • Expected benefit: 1/2 * 1/18 = 1/36 Summer Term 2009 Web Dynamics 3 ‐ 22

  23. Two pages, more updates Rules of thumb: • When sync frequency (f1+f2) much smaller than change frequency ( λ 1+ λ 2), don‘t sync quickly changing pages • Even for f1+f2 ≈λ 1+ λ 2, uniform (5:5) better than proportional (9:1) Can we prove this? Summer Term 2009 Web Dynamics 3 ‐ 23

  24. Proof (1) Notation/Definition: • F( λ i ,f i ): average freshness of p i when p i changes with rate λ i and is updated with rate f i 1 ∑ • average change rate λ = λ i N ⎛ ⎞ n n 1 1 ∑ ∑ ≥ ⎜ ⎟ f x f x • function f(x) is convex if ( ) i i n n ⎝ ⎠ = = i i 1 1 F( λ i ,f i ) is convex in λ i independent of the sync strategy Summer Term 2009 Web Dynamics 3 ‐ 24

  25. Proof(2) With uniform update frequency (f i =f): 1 1 ∑ ∑ = λ = λ F F f F f ( , ) ( , ) u i i i N N With proportional update frequency: 1 1 ∑ ∑ = = λ = λ F F p F f F f ( ) ( , ) ( , ) p i N N here, F(pi)=F( λ i,fi)=F( λ ,f) because F(pi) depends only on r= λ /f Then: 1 1 ∑ ∑ = λ ≥ λ = λ = F F f F f F f F ( , ) ( , ) ( , ) u i i p N N Summer Term 2009 Web Dynamics 3 ‐ 25

Recommend


More recommend