Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - PowerPoint PPT Presentation

Web Crawling gzsun@ustc.edu.cn

Reference  [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001.  [ BP98] The Anatomy of a Large-Scale Hypertextual Web Search Engine . Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.

Search Engines for the W eb

What is it?  Web Crawling = Graph Traversal  Abstract  S = { start pages}  repeat  remove an element s from S  foreach (s, v) / / s has a link to v  if v not crawled before , insert v into S  Let’s look an example

http: / / www.ustc.edu.cn/ ch/ index.php http: / / email.ustc.edu.cn/ http: / / alumni.ustc.edu.cn/ http: / / news.ustc.edu.cn/ http: / / netfee.ustc.edu.cn/ http: / / www.job.ustc.edu.cn/ http: / / zsb.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ http: / / ustcaf.org/ http: / / bbs.ustc.edu.cn/ main.html

Why is it not trivial? (Theoretical Issues)  How to choose S? (start pages)  e.g. we choose http: / / www.ustc.edu.cn/ ch/ index.php, because it is believed that from it we can achieve almost all the significant pages in USTC, but this is not always true in the internet.

Why is it not trivial? (Theoretical Issues)  How to choose s from S? (crawl strategy)  e.g. we use DFS strategy to choose such s.  1. email.ustc.edu.cn  2. netfee.ustc.edu.cn  3. bbs.ustc.edu.cn  ....  In worst case, we get into http: / / bbs.ustc.edu.cn, and then we get into some board like test or water . Then in a long period, we get nothing but rubbish. It goes even worse if we can get only limited number of pages (because of lack of resources)

Why is it not trivial? (Theoretical Issues)  How to tackle with dynamic pages?  every dynamic page may be modified after we fetch it, so we need to refresh them.  e.g. every article in our bbs may be modified, deleted or with marks changed, it is hard to keep update

Why is it not trivial? (Practical Issues)  Limited resources may bring many problems  limited storage: pages must be compressed efficiently or even stored in distribute way  limited CPU resources: may need parallel technical  limited network resources: fetch pages as good as possible (but how to define * good* ?)

A simple and special aimed example  fetch all articles from our bbs  visit http: / / bbs.ustc.edu.cn/ cgi/ bbsall to get all the board names  for each board name bn (e.g. algorithm)  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 1 and extract the article links in this page  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 21 and extract the article links in this page  ...  till we can not get any new pages more  for each article file name we get (e.g. M3DFB1F2C )  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbscon?bn= Algorithm&fn= M3DFB1F2C

A complex and universal example choose an URL for solve host downloading, store it if it is names quickly PROBLEM 1: how to download page new PROBLEM 3: choose? Discuss later! PROBLEM 2: how? what’s REP? stored URLS for downloading filter out extract URLs for unwanted URLs has been encountered downloading before? if not, store it PROEBLEM 4: how? temporary stored pages for multiple reading from multiple components

Test something if it is new  hash is good and good enough  e.g. MD5

What’s REP?  A crawler should obey many rules such as the so-called Robots Exclusion Protocol  crawler should fetch / robots.txt before any crawling and read it, then to determine whether to crawl this site or crawl what?  Also important to give your contact information in your queries? Look an example:  66.249.72.244 - - [ 07/ Mar/ 2006: 16: 08: 21 + 0800] "GET / userstatus.php?user= Quester HTTP/ 1.1" 200 3525 "-" "Mozilla/ 5.0 (compatible; Googlebot/ 2.1; + http: / / www.google.com/ bot.html)" "-“  Do you know who submit this query?

Now, how to choose an URL? (Crawl strategy)  Breath First Search  Maintain the URLs in FIFO structure  Drawback: maybe get into a host and crawl too much from this host, this may cause many problems, such as BANDWITH, overloading the server, especially in the case where many crawlers run in parallel.  example: 1000 crawlers crawl bbs server at the same time VS only about 10 of them crawl the same server at the same time?

How to choose an URL? (Crawl strategy)  Depth First Search  Maintain the URLs in a LIFO fashion  Shares the same drawback with BFS  Our first example unfortunately fall into this case.

How to choose an URL? (Crawl strategy)  Random  Random is random, so we randomly to choose not to describe it with probability of 1.

How to choose an URL? (Crawl strategy)  Priority Search  Use some significant priority to determine which URL should be crawled firstly  E.g. in our BBS, the following should be crawled firstly  Ontop articles  New articles  Hot articles (such as the top 10)  Notices  Maybe: article post by your lovely girl (or even girls)

How to choose an URL? (Crawl strategy)  Possible priority  Often changing pages  with high global ranks (e.g. PageRank)  Pages that you are focusing (in such crawlers motivated by special aims such as a crawler for our BBS, or even for yourself)

How to choose an URL? (Crawl strategy)  How to estimate the goodness of a strategy?  if it has crawl 100 0000 pages, how many of them are hot? That is with high importance!

Comparison between these strategies

Advanced Issues  1. How to keep your pages update?  2. How to tackle with dynamic pages?  e.g. google show you dynamic results depending on your query and the pages he owns till now. Can you crawl out all pages from google? Yet, then you will be another google,   3. How to balance your resources to achieve high performance.

Thank you!

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - PowerPoint PPT Presentation

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001. [ BP98] The

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

Optimizing Impression Counts for Outdoor Advertising Yipeng Zhang, Yuchen Li, Zhifeng Bao,

Theory of Computation Textbook The Nature of Computation by Cristopher Moore and (CS

So, how many are familiar with IRC?

FTL WebKits LLVM based JIT Andrew Trick, Apple Juergen Ributzka, Apple LLVM Developers

ToothPicker Apple Picking in the iOS Bluetooth Stack TOOTHP CKER Dennis Heinze Jiska Classen,

EECS 583 Class 5 Dataflow Analysis Intro University of Michigan September 17, 2014 Reading

SIM PTO TRAINING SEPTEMBER 26, 2018 9:00 AM Call Instructions: Please Mute your phone,

MAVERIC: 6-Month Outcomes of Transcatheter MV Repair in Patients With Severe Secondary Mitral