Web Data Engin ineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer In Intern rnet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019
2019-06-12 Helge Holzmann (helge@archive.org) What is a web archive? • Web archives preserve our history as documented on the web… • … in huge datasets, consisting of all kinds of web resources • e.g., HTML pages, images, video, scripts, … • … stored as big files in the standardized ( W)ARC format • along with metadata + request / response headers • next to lightweight capture index files ( CDX ) • … to provide access to webpages from the past • for users through close reading • replayed by the Wayback Machine • for data analysis at scale through distant-reading • enabled by Big Data processing methods, like Hadoop / Spark, …
2019-06-12 Helge Holzmann (helge@archive.org) 3
2019-06-12 Helge Holzmann (helge@archive.org) 4
2019-06-12 Helge Holzmann (helge@archive.org) Not today's topic … http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
2019-06-12 Helge Holzmann (helge@archive.org) The (archived) web… • ... is a very valuable dataset to study the web (and the offline world) • Access to very diverse knowledge from various discliplines (history, politics , …) • The whole web at your fingertips / processable snapshots • Adds a temporal dimension to the Web / captures dynamics • ... is a widely unstructured collection of data • Access and analysis at scale is challenging • Processing petabytes of data is expensive and time-consuming • Difficult to discover, identify, extract records and contained information • Potentially highly technical, complex access and parsing process • Low-level details users / researchers / data scientists don't want to / can't deal with • Data engineering needed to be used in downstream applications / studies 6
2019-06-12 Helge Holzmann (helge@archive.org) Different perspectives on web archives • User-centric View • (Temporal) Search / Information Retrieval • Direct access / replaying archived pages • Data-centric View • (W)ARC and CDX (metadata) datasets • Big data processing: Hadoop, Spark, … • Content analysis, historical / evolution studies • Graph-centric View • Structural view on the dataset • Graph algorithms / analysis, structured information • Hyperlink and host graphs, entity / social networks, facts and more 7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives . PhD thesis 2019]
2019-06-12 Helge Holzmann (helge@archive.org) Web (archives) as graph • Foundational model for most downstream applications / analysis tasks • E.g., Search index construction, term / entity co- occurrence studies, … • Different ways / approaches to construct / extract (temporal) graphs • (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc. • Technical challenges that users don't want to / can't deal with: • Efficient generation, effective representation, … 8
2019-06-12 Helge Holzmann (helge@archive.org) (Temporal) search in web archives • Wanted: Enter a textual query , find relevant captures • Challenges: • Documents are temporal / consist of multiple versions • New captures could near-duplicates or relevant changes • Temporal relevance in addition to textual relevance • Relevance to the query is not always encoded in the content • Information needs / query intents are different from traditional IR • Mostly navigational : Under which URL can I find a specific resource? • How to turn (temporal) graphs into a searchable index ? • Integrate full-text, titles, headlines, anchor texts, ...? • Convert into a format supported by Information Retrieval systems , e.g. ElasticSearch • Adaptation of existing retrieval models 9
2019-06-12 Helge Holzmann (helge@archive.org) Web Data Engineering • Transforming data into useful information • Making it usable for downstream applications • Search, data science, digital humanities, content analysis, ... • Regular users, researchers, data scientists / analysts, ... • Enabling efficient and effective access through... • ... infrastructures • ... suitable data formats • ... simple tools / APIs • ... optimized indexes • Technical considerations made by computer scientists • to help users / researchers focus on their application / study / research • to hiding complexity / low-level details through flexible abstractions 10
2019-06-12 Helge Holzmann (helge@archive.org) Example: Language Analysis (1) • Possible research questions: • Which pages of a language exist outside the contries ccTLD? • Which languages are used the most in a certain area / topic? • How has a language evolved over time on the web? • Requirements: • Tools for (W)ARC access, HTML parsing, language detection • Language-annotated pages / captures • Challenges: • Texts too short to detect a language / confidence scores • Multiple languages on one page / filtering and weighting • Slow and expensive processing due to large-scale content analysis (weeks) 11
2019-06-12 Helge Holzmann (helge@archive.org) Example: Language Analysis (2) • Wanted: • Efficient access to comprehensive results • Lightweight, reusable exchange format • Dynamic threshold / flexible post-filtering • Solution: (CDX) Attachment Format (ATT / CDXA ) • Leightweight, efficient loading, integrated data validation, decoupled from data CDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx.lang_2017-18_v2.cdxa.gz *.cdx # Language detection using 'square leaf' approach com,yahoo,answers,es )/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 com,yahoo,answers,espanol )/ 20060617034947 http:// … text/html 200 RMMUE3QW RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 com,yahoo,answers,fr )/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 com,yahoo,answers,hk )/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 com,yahoo,answers,id )/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in )/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 com,yahoo,answers,it )/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 12
2019-06-12 Helge Holzmann (helge@archive.org) We have more available (examples) • Dataset of all homepages in Global Wayback (GWB) – web.archive.org • Extracted from snapshot 20180911224740 • GWB-20180911224740_homepages.cdx.gz • Pre-processed attachments • GWB-20180911224740_homepages-*.cdx.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit.lang_2017-18 .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit.lang_2017-18_v2 .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last .cdxa.gz # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5 13
Fatcat.wiki (beta) Archive and knowledge graph of every publicly-accessible scholarly output with a priority on long-tail, at-risk publications .
2019-06-12 Helge Holzmann (helge@archive.org) Fatcat.wiki (big catalog) • At-scale web harvesting of scholarly works • with descriptive metadata and full-text • linked with versions and secondary outputs • API-first accessible / editable system 15
2019-06-12 Helge Holzmann (helge@archive.org) Challenge: the Internet Archive is big • Web archive / Wayback Machine • 20+ years of web • 625+ library and other partners • 753,932,022,000 (captured) URLs • 362 billion web pages • More than 5,000 URLs archived every second • 40+ petabyte • And there's more:
2019-06-12 Helge Holzmann (helge@archive.org) Challenge: web archives are Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • MapReduce or variants • Homogeneous data types / formats • Distributed batch processing • load → transform • aggregate → write • Web archive data is heterogeneous , may include text, video, images, … • Common header / metadata format, but various / diverse payloads • Requires cleaning, filtering, selection, extraction before processing 17
Recommend
More recommend