web crawling
play

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - PowerPoint PPT Presentation

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001. [ BP98] The


  1. Web Crawling gzsun@ustc.edu.cn

  2. Reference  [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001.  [ BP98] The Anatomy of a Large-Scale Hypertextual Web Search Engine . Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.

  3. Search Engines for the W eb

  4. What is it?  Web Crawling = Graph Traversal  Abstract  S = { start pages}  repeat  remove an element s from S  foreach (s, v) / / s has a link to v  if v not crawled before , insert v into S  Let’s look an example

  5. http: / / www.ustc.edu.cn/ ch/ index.php http: / / email.ustc.edu.cn/ http: / / alumni.ustc.edu.cn/ http: / / news.ustc.edu.cn/ http: / / netfee.ustc.edu.cn/ http: / / www.job.ustc.edu.cn/ http: / / zsb.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ http: / / ustcaf.org/ http: / / bbs.ustc.edu.cn/ main.html

  6. Why is it not trivial? (Theoretical Issues)  How to choose S? (start pages)  e.g. we choose http: / / www.ustc.edu.cn/ ch/ index.php, because it is believed that from it we can achieve almost all the significant pages in USTC, but this is not always true in the internet.

  7. Why is it not trivial? (Theoretical Issues)  How to choose s from S? (crawl strategy)  e.g. we use DFS strategy to choose such s.  1. email.ustc.edu.cn  2. netfee.ustc.edu.cn  3. bbs.ustc.edu.cn  ....  In worst case, we get into http: / / bbs.ustc.edu.cn, and then we get into some board like test or water . Then in a long period, we get nothing but rubbish. It goes even worse if we can get only limited number of pages (because of lack of resources)

  8. Why is it not trivial? (Theoretical Issues)  How to tackle with dynamic pages?  every dynamic page may be modified after we fetch it, so we need to refresh them.  e.g. every article in our bbs may be modified, deleted or with marks changed, it is hard to keep update

  9. Why is it not trivial? (Practical Issues)  Limited resources may bring many problems  limited storage: pages must be compressed efficiently or even stored in distribute way  limited CPU resources: may need parallel technical  limited network resources: fetch pages as good as possible (but how to define * good* ?)

  10. A simple and special aimed example  fetch all articles from our bbs  visit http: / / bbs.ustc.edu.cn/ cgi/ bbsall to get all the board names  for each board name bn (e.g. algorithm)  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 1 and extract the article links in this page  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 21 and extract the article links in this page  ...  till we can not get any new pages more  for each article file name we get (e.g. M3DFB1F2C )  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbscon?bn= Algorithm&fn= M3DFB1F2C

  11. A complex and universal example choose an URL for solve host downloading, store it if it is names quickly PROBLEM 1: how to download page new PROBLEM 3: choose? Discuss later! PROBLEM 2: how? what’s REP? stored URLS for downloading filter out extract URLs for unwanted URLs has been encountered downloading before? if not, store it PROEBLEM 4: how? temporary stored pages for multiple reading from multiple components

  12. Test something if it is new  hash is good and good enough  e.g. MD5

  13. What’s REP?  A crawler should obey many rules such as the so-called Robots Exclusion Protocol  crawler should fetch / robots.txt before any crawling and read it, then to determine whether to crawl this site or crawl what?  Also important to give your contact information in your queries? Look an example:  66.249.72.244 - - [ 07/ Mar/ 2006: 16: 08: 21 + 0800] "GET / userstatus.php?user= Quester HTTP/ 1.1" 200 3525 "-" "Mozilla/ 5.0 (compatible; Googlebot/ 2.1; + http: / / www.google.com/ bot.html)" "-“  Do you know who submit this query?

  14. Now, how to choose an URL? (Crawl strategy)  Breath First Search  Maintain the URLs in FIFO structure  Drawback: maybe get into a host and crawl too much from this host, this may cause many problems, such as BANDWITH, overloading the server, especially in the case where many crawlers run in parallel.  example: 1000 crawlers crawl bbs server at the same time VS only about 10 of them crawl the same server at the same time?

  15. How to choose an URL? (Crawl strategy)  Depth First Search  Maintain the URLs in a LIFO fashion  Shares the same drawback with BFS  Our first example unfortunately fall into this case.

  16. How to choose an URL? (Crawl strategy)  Random  Random is random, so we randomly to choose not to describe it with probability of 1.

  17. How to choose an URL? (Crawl strategy)  Priority Search  Use some significant priority to determine which URL should be crawled firstly  E.g. in our BBS, the following should be crawled firstly  Ontop articles  New articles  Hot articles (such as the top 10)  Notices  Maybe: article post by your lovely girl (or even girls)

  18. How to choose an URL? (Crawl strategy)  Possible priority  Often changing pages  with high global ranks (e.g. PageRank)  Pages that you are focusing (in such crawlers motivated by special aims such as a crawler for our BBS, or even for yourself)

  19. How to choose an URL? (Crawl strategy)  How to estimate the goodness of a strategy?  if it has crawl 100 0000 pages, how many of them are hot? That is with high importance!

  20. Comparison between these strategies

  21. Advanced Issues  1. How to keep your pages update?  2. How to tackle with dynamic pages?  e.g. google show you dynamic results depending on your query and the pages he owns till now. Can you crawl out all pages from google? Yet, then you will be another google,   3. How to balance your resources to achieve high performance.

  22. Thank you!

Recommend


More recommend