Web Dynamics Introduction to Web Archiving Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrücken, May 28, 2009 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-1/50
Agenda • Motivation - Indexing vs. archiving Introduction to Web Archiving - The challenge of Web archiving - Next generation Web archiving Marc Spaniol • Aspects of Web archiving - Web archiving tools - Selection - Capturing - Archiving - Hosting • Summary Databases and • References Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-2/50
Indexing vs. Archiving • Indexing - Completeness Introduction to Web Archiving - Access to content ⇒ “Taking a Photo” - Scalability (speed) Marc Spaniol - Efficiency - Freshness • Archiving - Completeness - Access to content ⇒ “Shooting a Movie” - Scalability (coverage) - Authenticity - Coherence Databases and - Durability Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-3/50
The Challenge of Web Archiving • Digital library Introduction to Web Archiving - Organized - Groomed content Marc Spaniol - Lots of metadata - Structured changes - Active preservation policies • World Wide Web - A disorganized free-for-all - Very little metadata - Unpredictable additions, deletions, modifications - No (coordinated) preservation strategy Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-4/50
Goals of Web Archiving • Role of Web - Providing information and services for seemingly all domains Introduction to Web Archiving - Reflecting all types of events, opinions, and developments within society, science, politics, environment, business, etc. Marc Spaniol - Giving room for the articulation for a multitude of stakeholders ⇒ Archiving this quickly changing multifaceted information space has becomes a relevant issue for cultural heritage • Web archiving imposes various challenges: ... Hidden Web Inherent New types of ephemeral content Change & character Evolution Social Web Preservation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-5/50
Next Generation Web Archiving Development of Web archiving technology for - High quality Web archives Introduction to Web Archiving - Long-term archive usability Marc Spaniol ⇒ From Web page storage to “Living Web Archives“ Evolution Living Usage Variety Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-6/50
Archive Fidelity Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity by Introduction to Web Archiving - Capturing all types of content - Capturing of hidden Web Marc Spaniol - Detecting traps Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-7/50
Advanced Filtering Next generation Web archiving methods and tools: • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features - Capture all types of content Marc Spaniol - Detect traps - Filtering Web spam - Filtering noise Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-8/50
Archive Coherence Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol - Deal with issues of temporal Web construction - Identify, analyze and repair temporal gaps - Consistent Web archive federation Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-9/50
Archive Interpretability Next generation Web archiving methods and tools • Enhance archive fidelity and authenticity Introduction to Web Archiving • Provide advanced filtering features • Improve archive coherence and integrity Marc Spaniol • Facilitate (long-term) archive Interpretability - Dealing with terminology evolution - Handling semantic evolution - Preparing for evolution aware access support Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-10/50
Goals of Web Archiving Summarized • Archiving function α applied to website W produces a capture C W of the web site’s resources and related metadata: Introduction to Web Archiving α (W) → C W Marc Spaniol • Restoration function ρ “unpacks” the capture C W and reproduces the original site: ρ (C W ) → W • Transformation function τ “unpacks” the capture C W , converts the components to the modern-day equivalent, and reproduces the original site within a new environment: Databases and τ (C W ) → W ∆ Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-11/50
Aspects of Web Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-12/50
Web Archiving Tools Introduction to Web Archiving Marc Spaniol AIP: Archival Information Package DIP: Data Information Package Databases and SIP: Submission Information Package Information Systems Prof. Dr. G. Weikum OAIS: Open Archival Information System MPII-Sp-0509-13/50
Selection of Seed(s) and Scope • Entry point / seed: Where the Introduction to capturing process Web Archiving (crawl) starts. Top Marc Spaniol of the hypertext path that will be followed. • Scope: The extent of the area that will be included in the gathering, as defined by criteria applicable to each node. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-14/50
Completeness • Vertically: Number of Introduction to relevant nodes Web Archiving found from entry Marc Spaniol point. • Horizontally: Number of relevant entry points found within the designated perimeter. Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-15/50
Extensive Collection • Horizontal completeness Introduction to is preferred to Web Archiving vertical Marc Spaniol completeness • Holistic, domain based, or topic-centric archiving Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-16/50
Intensive Collection • Vertical completeness is preferred to Introduction to Web Archiving horizontal completeness Marc Spaniol • Site-based archiving • Defines the high level target of a collection • Explicit exclusion to avoid duplicate content with other Databases and collections Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-17/50
The Challenge of Web Archiving • HTTP cannot ask for only new or modified contents - Timestamps have limited benefit Introduction to Web Archiving - No list of pages that have been deleted, changed, and added - Each content must be requested, one at a time, by name Marc Spaniol • There is no “SELECT *” in HTTP - Crawlers can only GET one resource at a time, by name - HTTP cannot give a crawler a list of all URLs for the site ⇒ Undiscovered or hidden resources will not be captured or refreshed ⇒ “Strategy” required Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-18/50
Server Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-19/50
Server Side Archiving Revisited • Benefits + Extremely comprehensive Introduction to Web Archiving + Changes are fully traceable (if budget permits) + Instantaneous snapshots possible Marc Spaniol + No network latency or limitations + Deep Web compliant • Drawbacks - Change monitoring may decrease server performance - Needs sophisticated set-up - Requires server access Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-20/50
Transaction based Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-21/50
Transaction based Archiving Revisited • Benefits + Comes for “free” Introduction to Web Archiving + “Smart” coverage achieved by human interaction + Simple maintenance Marc Spaniol + No server collaboration/manipulation required • Drawbacks - Unsystematic - Data quality is potentially poor - Needs traffic monitoring - Privacy issues - Potential network latency or limitations - Requires constant traffic Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-22/50
Client Side Archiving Introduction to Web Archiving Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-23/50
Client Side Archiving Revisited • Benefits + No server collaboration/manipulation needed Introduction to Web Archiving + Only crawler set-up required + Mostly automated process (daily/weekly/monthly) Marc Spaniol • Drawbacks - Changes might get lost - Good data quality requires sophisticated crawling strategies - Potential network latency or limitations - Computational “expensive” Next week’s lecture: “Data Quality in Web Archiving” Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0509-24/50
Recommend
More recommend