Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, - PowerPoint PPT Presentation

Web Crawling and Web Dynamics • Knut Magne Risvik and Rolf Michelsen, Search engines and Web dynamics . Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302. • Junghoo Cho and Hector Garcia-Molina, The Evolution of the Web and Implications for an Incremental Crawler . Proceedings of 26th International Conference on Very Large Data Bases (VLDB), 10-14 September 2000, Cairo, Egypt, Pages 200-209. 1

Web Dynamics • The Web is growing at a high pace (exponential) • Documents are updated • Dynamics of the web is shifting – More dynamic and real-time information available – News – The dynamics of the web creates a set of tough challenges for all search engines • The link structure is changing – Using the link structure in ranking creates a slow working positive feedback loop • XML — understand the structure and semantics of the data is a key feature for the next generation engines • Goal : The local store copy must be fresh 2

Local Store • Definition A local store copy is a snapshot of the Web at the given crawling time for each document 3

Types of Crawlers • Periodic/batch crawlers – Periodically rebuild the index from scratch • Incremental crawlers – Updates/refreshes the local collection – Replaces “less-important” pages with new and “more-important” pages 4

Trend • Web servers are today most commonly applications serving HTML files directly from a file system upon requests. • More advanced publication systems tying business applications to the web servers. • The percentage of the web that is actually indexable by search engines is decreasing (the deep web is growing) 5

Web Models • Create a model for how documents are updated, e.g. web documents are updated as independent Poisson processes – Average document changed once every ten days – 50% of all documents are changed after 50 days • Develop a crawling strategy that maximizes the freshness of the local store. • Mechanisms for measuring the freshness of the local store. 6

Cho and Garcua-Molina • Crawling strategies • Freshness = probability of copy in local store is up-to-date • Age = time since late update of real document • Interesting measure = average freshness or age over all documents and time • Refreshing documents using uniform update frequencies is always better than using document update frequencies that are proportional to the estimated document change frequencies • The scheduling policy optimizing freshness penalizes documents that are changed too often 7

FAST crawler • Incremental crawler • Star network • Only exchanging information about discovered hyperlinks • Static mapping from hyperlink information to crawler machines • Scales linearly with document storage capacity • Robust with regards to failure (hyperlink information for an unavailable crawler machine is queued on the sending machine until the designated receiver again becomes available) 8

FAST crawler (cont.) • Scheduling algorithm prioritizing retrieval of documents most likely to have been updated on the web • Will only retrieve new documents when old documents are removed from the local store (e.g. document does not exist on the web anymore) • Maximizes freshness by spending as much as possible of the crawler capacity on refreshing documents that have actually changed. • Adaptively computes estimate of the refresh frequencies • Decreases refresh interval if document changed; otherwise increases refresh interval • Avoids rescheduling any document more than once for each indexing cycle 9

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, - PowerPoint PPT Presentation

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web dynamics . Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302. Junghoo Cho and Hector Garcia-Molina, The Evolution of the Web and

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, - PowerPoint PPT Presentation

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web dynamics . Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302. Junghoo Cho and Hector Garcia-Molina, The Evolution of the Web and

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Set11 Search Engines & SEO Outline How do search engines work? Basic operation