web archiving
play

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May - PowerPoint PPT Presentation

Web Dynamics Web Archiving Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-1/77 Agenda Introduction - Indexing vs. archiving Web


  1. Web Dynamics Web Archiving Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrücken, May 27, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-1/77

  2. Agenda • Introduction - Indexing vs. archiving Web Archiving - Temporal coherence of Web archives Dr. Marc Spaniol • Aspects of Web archiving - Selection - Capturing  Conceptual approaches  Coherence aware archiving  Quantifying (in-)coherence - Archiving - Hosting • Summary Databases and • References Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-2/77

  3. Indexing vs. Archiving • Indexing - Completeness Web Archiving - Access to content ⇒ “Taking a Photo” - Scalability (speed) Dr. Marc Spaniol - Efficiency - Freshness • Archiving - Completeness - Access to content ⇒ “Shooting a Movie” - Scalability (coverage) - Authenticity - Coherence Databases and - Durability Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-3/77

  4. The Challenge of Web Archiving • World Wide Web - A disorganized free-for-all Web Archiving - Very little metadata - Unpredictable additions, deletions, modifications Dr. Marc Spaniol - No (coordinated) preservation strategy • HTTP cannot ask for only new or modified contents - Timestamps have limited benefit - No list of pages that have been deleted, changed, and added - Each content must be requested, one at a time, by name • There is no “SELECT *” in HTTP - Crawlers can only GET one resource at a time, by name - HTTP cannot give a crawler a list of all URLs for the site ⇒ Undiscovered or hidden resources will not be captured or refreshed ⇒ “Strategy” required Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-4/77

  5. Temporal Coherence of Web Archives Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-5/77

  6. The Challenge of Archive Coherence • Crawler operations Web Archiving - Visit (pages)  Extract (links from pages) Dr. Marc Spaniol Taking place  Compare (versions of pages) - Follow (links) in parallel ⇓ • Website operations Potentially - Modifications “inside” pages incoherent  Content (text)  Structure (links) - Modifications “inside” site  Page creation  Page deletion Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-6/77

  7. Potential Pitfalls in Web Archiving • Crawling takes a long (!) time Smart(er) - Politeness Crawling - Multiple seeds per crawl Web Archiving Strategies - Spam Dr. Marc Spaniol • Crawlers aren’t “really” smart - Highly volatile against dynamics in CMS - Easy to be trapped, if not exactly configured - Doesn’t recognize patterns of “identical” contents Archive in ⇒ ⇒ Pre-analysis of site(s) needed Danger! • Some examples of crawler behavior - Enjoy link generation from JavaScript, PHP, etc. - Tend to go for shopping Evaluation of - Like time travelling in calendars Crawl ⇒ Crawling is simply “unpredictable” Coherence ⇒ Crawlers need “constant” monitoring Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-7/77

  8. Aspects of Web Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-8/77

  9. Selection Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-9/77

  10. Selection of Seed(s) and Scope • Entry point / seed: Where the Web Archiving capturing process (crawl) starts. Top Dr. Marc Spaniol of the hypertext path that will be followed. • Scope: The extent of the area that will be included in the gathering, as defined by criteria applicable to each node. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-10/77

  11. Completeness • Vertically: Number of Web Archiving relevant nodes found from entry Dr. Marc Spaniol point • Horizontally: Number of relevant entry points found within the designated perimeter Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-11/77

  12. Extensive Collection • Horizontal completeness Web Archiving is preferred to vertical Dr. Marc Spaniol completeness • Holistic, domain based, or topic-centric archiving Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-12/77

  13. Intensive Collection • Vertical completeness is Web Archiving preferred to horizontal completeness Dr. Marc Spaniol • Site-based archiving • Defines the high level target of a collection • Explicit exclusion to avoid duplicate content with other collections Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-13/77

  14. Capturing Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-14/77

  15. A Webmaster’s Omniscient View MySQL 1. Data1 2. User.abc Dynamic Web Archiving 3. Fred.foo Entry point / seed Authenticated Dr. Marc Spaniol Tagged: No robots Orphaned httpd 1. file1 2. /dir/wwx 3. Foo.html Deep Databases and Information Systems Prof. Dr. G. Weikum Unknown/not visible MPII-Sp-0510-15/77

  16. Web Server’s View of a Web Site Require authentication Web Archiving Entry point / seed Generated on-the-fly Dr. Marc Spaniol (e.g. by CGI) Tagged: No robots Databases and Information Systems Unknown/not visible Prof. Dr. G. Weikum MPII-Sp-0510-16/77

  17. A Craw ler’s View of a Web Site Not crawled Entry point / seed Web Archiving (protected) Not crawled (generated on-the-fly, Dr. Marc Spaniol e.g. by CGI) Not crawled robots.txt or robots META tag Not crawled (unadvertised & unlinked) Crawled pages Not crawled Not crawled (remote link only) (too deep) Databases and Information Systems Remote web site Prof. Dr. G. Weikum MPII-Sp-0510-17/77

  18. Web Information Systems Dynamic Web sites Web Archiving Dr. Marc Spaniol Hidden Web • Each interaction with a Web information system can potentially generate a unique customized response ⇒ Document the context of this interaction, or pseudo-transaction Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-18/77

  19. Craw ler-Server Collaboration • Open Archives Initiative (OAI) Protocol for Metadata Harvesting • Provided flat list (maybe hidden for public) Web Archiving • RSS feeds Dr. Marc Spaniol • OAI server - Pushed by search-engines - Yahoo content acquisition program, google ⇒ The sitemap standard is intended to list the resources at a site Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-19/77

  20. Server Side Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-20/77

  21. Transaction based Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-21/77

  22. Client Side Archiving Web Archiving Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-22/77

  23. Capturing Approaches Summary Approach Benefits Drawbacks + Extremely comprehensive - Change monitoring may decrease Web Archiving server performance + Changes are fully traceable Server Side - Needs sophisticated set-up + Instantaneous snapshots Dr. Marc Spaniol Archiving - Requires server access + No network latency or limitations + Deep Web “compliant” + Comes for “free” - Unsystematic (requires constant traffic) + “Smart” coverage achieved by - Data quality is potentially poor Transaction human interaction based - Needs traffic monitoring Archiving + Simple maintenance - Privacy issues + No server collaboration required - Potential network latency or limitations + No server collaboration needed - Changes might get lost + Only crawler set-up required - Sophisticated crawling strategy needed Client Side Archiving - Potential network latency or limitations + Mostly automated process Databases and Information Systems (daily/weekly/monthly) - Computational “expensive” Prof. Dr. G. Weikum MPII-Sp-0510-23/77

  24. Temporal Coherence • What means coherence? - “The action or fact of cleaving or sticking together” Web Archiving - “Harmonious connexion of the several parts, so that the whole ‘hangs together’” Dr. Marc Spaniol Oxford English Dictionary [http://dictionary.oed.com] • Temporal coherence in Web archiving: - Capturing Web sites as “authentic” as possible - Ensure an “as of time point x (or interval [x, y])” capture of a Web site ⇒ Periodic domain scope crawls of Web sites to obtain a best possible representation with respect to a time point / interval Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-24/77

  25. Assumptions and Notations • Basic Assumptions - Web site to be crawled consists of n Web pages Web Archiving - Changes of Web pages occur per time unit and independent of each other - Change rates are assumed / given Dr. Marc Spaniol - Delay between downloads of pages is the same - Download time is neglected • Basic Notation - Crawl: c - Web pages: p 1 ,…, p n λ i - Change probability of page p i : - Time of downloading page p i : t(p i ) µ i - Last modified value of page p i : θ (p i ) - Content hash or etag of page p i : - Crawl interval: [t s ,t e ] Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0510-25/77

Recommend


More recommend