crawler net a component based distributed framework for
play

Crawler.NET: A component-based distributed framework for web - PowerPoint PPT Presentation

Crawler.NET: A component-based distributed framework for web traversal Levente Hunyadi (BME AAIT) March 23, 2007 Crawler.NET A component-based distributed framework for web traversal Motivation Introduction The Web: Motivation a source


  1. Crawler.NET: A component-based distributed framework for web traversal Levente Hunyadi (BME AAIT) March 23, 2007 Crawler.NET A component-based distributed framework for web traversal

  2. Motivation Introduction The Web: Motivation a source of distributed information Objectives � Architecture a giant set of semi-structured data � Component framework ⇒ search engines are invaluable to locate information Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  3. Motivation Introduction up-to-date index database � Motivation ⇓ Objectives efficient traversal � Architecture ⇓ Component framework parallelization � Crawler ⇓ application distributed architecture � Conclusions ⇓ increased complexity � Crawler.NET A component-based distributed framework for web traversal

  4. Objectives Introduction scalability � Motivation easy configuration and management � Objectives Architecture support for extension � Component framework robustness, resilience to failures � Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  5. Architectural overview Introduction Two separate layers: Motivation Component framework Crawling application Objectives Architecture General tasks Field-specific issues Component framework component interaction downloading � � Crawler documents application lifecycle management � extracting hyperlinks Conclusions � transparent � interprocess administering page � communication references scheduling requests � Crawler.NET A component-based distributed framework for web traversal

  6. Design Introduction the component framework exposes general � Motivation component skeletons that realize common behavior Objectives new, field-specific components are created by Architecture � Component means of inheritance framework the framework provides loose coupling between Crawler � application components Conclusions Advantages: + simpler and faster development + openness for extension Crawler.NET A component-based distributed framework for web traversal

  7. Building blocks of the architecture Components Introduction � Component encapsulate field-specific functionality, produce, framework consume or transform data Building blocks Components Providers � Providers give access to data sources Connectors Connectors Crawler � application provide asynchronous, message-based Conclusions communication between components Crawler.NET A component-based distributed framework for web traversal

  8. Components Introduction abstract base class implements generic tasks � Component differentiated subclasses based on how they interact framework � Building blocks with environment Components Providers Connectors Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  9. Components Introduction GenericComponent Component framework Building blocks Components Generic- Generic- Simple- Complex- Providers Producer Consumer Filter Filter Connectors Crawler application Synchronous- Asynchronous- Conclusions OutputFilter ComplexFilter Synchronous- SemiSynchronous- CompexFilter ComplexFilter Crawler.NET A component-based distributed framework for web traversal

  10. Providers Introduction wrap external resources used by components � Component synchronized access to data sources framework � Building blocks diverse functionality: � Components Providers access databases � Connectors transparent cache mechanisms � Crawler application network resources � Conclusions Crawler.NET A component-based distributed framework for web traversal

  11. Connectors Introduction abstractions of typed queues � Component represent a message queue framework � Building blocks intra-process or inter-process � Components Providers support one-to-many, many-to-many relationships, � Connectors identification by roles Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  12. Relization of connectors Introduction Method of message transfer transparent to components: Component local connector framework � Building blocks typed FIFO queue Components data is passed by reference Providers Connectors remote connector � Crawler corresponds to two local queues and associated application network communication components in separate Conclusions processes data is serialized (and transmitted over TCP) Crawler.NET A component-based distributed framework for web traversal

  13. Architecture Introduction Client-server architecture: Component clients retrieve documents with respect to the framework � Crawler appropriate traversal strategy application the server partitions the web and assigns partitions Architecture � Server components to clients Client components Implementation using component framework classes Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  14. Marshaler component Introduction forwards incoming URLs to clients based on domain � Component or host name framework Crawler caches recently forwarded URLs to decrease � application network load Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  15. Marshaler component Introduction Limited data exchange during web traversal: Component framework locality principle : approx. 10% of hyperlinks are � Crawler outbound from host or domain application Architecture batch transmission � Server components Zipfian distribution : discarding cached URLs leads to � Client components sharply reduced load Traversal Load balancing Load balancing between marshalers: URL distribution Parsing based on URL host name hash URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  16. Basic client components Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Parser Downloader finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

  17. Traversal component Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Parser Downloader finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

  18. Traversal component Introduction fetches new URLs to download from persistent � Component storage framework Crawler notification on arrival of new URLs from server or � application availability of a host Architecture Server selects next URL based on traversal strategy components � Client (breadth-first, relevance-based, etc.) components Traversal Load balancing host, #new items Url distributor Traversal component Parsing URL distributor component url, referrer url Conclusions url queue Downloader finished Crawler.NET A component-based distributed framework for web traversal

  19. Load balancing component Introduction prevents overloading hosts � Component cooperates with traversal components framework � Crawler configurable delay between requests � application Architecture dynamic adaptation based on response times � Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  20. Load balancing component Introduction Load balancer Component framework available host Crawler host, #new items Url distributor Traversal component application Architecture url, referrer url Server components url queue Client Downloader statistics components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  21. Parser component Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Downloader Parser finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

Recommend


More recommend