crawling wit ith
play

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - PowerPoint PPT Presentation

Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion Web-Crawling Web-Crawling is a process by


  1. Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph

  2. Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion

  3. Web-Crawling

  4. • Web-Crawling is a process by which search engines crawler/spiders/bots scan a website and collect details about each page: titles, images, keywords, What is Web- other linked pages, etc. Crawling? • It also discovers updated content on the web, such as new sites or pages, changes to existing sites, and dead links. According to Google “The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they use links on those sites to discover other pages . ” - http://www.digitalgenx.com/learn/crawling-in-seo.php •

  5. • Crawlers are widely used by search engines like Google, Yahoo or Bing to retrieve contents for a URL, examine that page for other links, retrieve the URLs for those links, and so on. • Google is the first company that published its web- Web-Crawling crawler which has two programs, spider and mite. • Spider maintains the seeds and mite is responsible for downloading webpages. • Googlebot and Bingbot are the most popular spiders owned by Google and Bing respectively. https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

  6. • Web-scraping is closely related to web-crawling but it is a different technique • The main purpose of web-scraper is to convert unstructured data found on the internet to Web-Crawling vs structured format for analyzing or for later reference Web-Scraping • Web-scraping ( like web-crawling ) often has the ability to browse different pages and follow links. • But ( unlike web-crawling ) its primary purpose is extracting the data on those pages and not indexing the web https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

  7. Process Flo low of f Se Sequentia ial l Web- Cr Crawle ler Popular open source web-crawlers: • Scrapy • A python-based web-crawling framework • Heritrix • A Java based web-crawler designed for web-archiving. • Written by the Internet Archive • HTTrack • A ‘C’ based web -crawler • Developed by Xavier Roche • Apache Nutch https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

  8. Frontier In Initialization • A crawl frontier is a data structure used for storage of URLs eligible for crawling and supporting such operations as adding URLs and selecting for crawl (can be seen as priority queue) • The initial list of URLs contained in the crawler frontier are known as seeds: • Crawling “seeds” are the pages at which a crawler commences • Seeds should be selected carefully, and multiple seeds may be necessary to ensure good coverage https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

  9. More on Frontier More on Frontier • The web-crawler will constantly ask the frontier what pages to visit. • As the crawler visits each of those pages, it will inform the frontier with the response of each page. • The crawler will also update the crawler frontier with any new hyperlinks contained in those pages it has visited. • These hyperlinks are added to the frontier and will visit those new webpages based on the policies of the crawler frontier. Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

  10. Fetching Fetching • The fetcher is a multi-threaded application (capable of processing more than one tasks in parallel) that employs protocol plugins to retrieve the content of a set of URLs. • The Protocols plugin collects information about the network protocols supported by the system • Plug-in is a software component that adds a specific feature to an existing computer program. • Network protocols are formal standards and policies made up of rules, procedures and formats that defines communication between two or more devices over a network . Network protocols conducts the action, policies, and affairs of the end-to-end process of timely, secured and managed data or network communication. • It is analogous to downloading of a page (similar to what a browser does when you view the page). Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

  11. Parsing Involves one or more of the following: • Simple hyperlink/URL extraction • Tidying up the HTML content in order to analyze the HTML tag tree • HTML Tag Tree is the hierarchical representation of the HTML page in the form of tree structure • Convert the extracted URL to a canonical form, remove stopwords from the page’s content and stem the remaining words. • Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. Eg: The words consulting, consultant and consultative is stemmed to consult Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

  12. The behaviour of a web-crawler depends on the outcome of a combination of policies: • Selection Policy: • It is used to download the appropriate pages, that is, it states which pages to download. • Re-visit Policy: Web-Crawling • It states when to check for changes to the pages. • Two simple (or naive) re-visiting policies: Policies • Uniform policy: All pages in the collection are re-visited with the same frequency. • Proportional policy: The pages that change more frequently are re- visited. There are quantitative methods to measure the visiting frequency • Politeness Policy states to avoid overloading websites. • Parallelization Policy: • To avoid repeated downloads of the same page when we run a parallel crawler (A crawler that runs multiple processes). Bamrah NHS, Satpute BS, Patil P., 2014. Web Forum Crawling Techniques. International Journal of Computer Applications. 85(36 – 41).

  13. • A strategy for a crawler to choose URLs from a crawling queue. Crawl Ordering • It is related to one of the following two main tasks: Policy (o (or Crawl • Downloading newly discovered webpages not Strategy) represented in the index • Refreshing copies of pages likely to have important updates Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

  14. • A strategy for a crawler to choose URLs from a crawling queue. • It is related to one of the following two main tasks: • Downloading newly discovered webpages not represented in the index • Refreshing copies of pages likely to have important updates • Breadth-first search: Crawl Ordering • It is a technique where all the links in a page are followed in sequential order before the crawler Policy follows the child links. • Child links can only be generated from the parent links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

  15. • A strategy for a crawler to choose URLs from a crawling queue. • It is related to one of the following two main tasks: • Downloading newly discovered webpages not represented in the index • Refreshing copies of pages likely to have important updates • Breadth-first search: • It is a technique where all the links in a page are followed in sequential order before the crawler follows the child links. • Crawl Ordering Child links can only be generated from the parent links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory • Depth-first search: Policy • It is an algorithm where the crawler starts with the parent link and crawls the child link until it reaches the end and then continues with another parent link • It does not have to save all the parents links in a page, it consumes relatively less memory than BFS https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

  16. • Prioritize by indegree • The page with the highest number of incoming hyperlinks from previously downloaded pages, is downloaded next. • Incoming links are those links which are coming to our website from another website Crawl Ordering • Outgoing links are those types of links which are going to another site from our site Policy • Prioritize by PageRank • Pages are downloaded in descending order of PageRank , as estimated based on the pages and links acquired so far by the crawler. • PageRank is an algorithm used by Google Search to rank webpages in their search engine results Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

  17. Apache Nutch

Recommend


More recommend