A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web Archiving and Digital Libraries Sergej Wildemann & Helge Holzmann June 6, 2019 L3S Research Center - University of Hannover - Germany
Introduction
Motivation • Named entities are evolving • Cities grow • People change positions in their careers • Information is spread on the Web • Content is changed or deleted on web pages • Web search engines forget or rerank resources • Limited search in web archives by: • Exact URI • "Full-text" Challenge Accessing online resources that describe an entity over time 1
Example: Site Search in the Wayback Machine • Indexed anchor texts • Problem: Domains only, no specific paths or dates 2
Dataset Generation
Processing Pipeline 1. Entity collection 2. URI collection and assignment 3. URI unification 4. Ranking 5. Temporal enrichment 3
Incorporated Datasets Dataset Entities Classes URIs Tags Dates Wikipedia • • DBpedia • ⋊ ⋉ Wikidata • ⋊ ⋉ Delicious • • ⋊ • ⋉ ⋊ ⋉ GWA • • ⋊ ⋉ GWW • • ⋊ ⋉ Wayback CDX • ⋊ ⋉ ( • ) extracted information – ( ⋊ ⋉ ) used for joining 4
Entity Collection • Potential entities: List of all Wikipedia article titles • Filtering with DBpedia’s ontology: Type Count % Person 1,500,000 22.7 Place 840,000 12.7 Creative Work 496,000 7.5 Organization 286,000 4.3 Other 2,378,000 36.0 N/A 1,100,000 16.7 6,600,000 100 5
URI Collection • Wikidata provides URIs directly or indirectly via identifiers • Wikipedia articles contain "External Links" section • Official websites (usually without a path) • Databases like IMDb • GWA and GWW were generated from the German Web Archive (1996-2013) • Searching for entities in anchor texts • GWW restricted the search to pages referenced from Wikipedia • Up to 10 URIs per entity • With most prominent years in dataset 6
URI Collection from Delicious Delicious contains tagged bookmarks with timestamps DATE USERID URI TAG... • General idea: Matching of tags against entities • Problem: Tags are single words • Solution: Normalizing tags and entity titles • Disambiguation terms must be found in tags • Additional tags as metadata Normalization Entity: New_York_(State) ⇒ newyork Tag: new-york ⇒ newyork 7
URI Unification URI := protocol://domain[/path][?query][#fragment] • Protocol removal • Subdomain unification ( www. prefix) • Path stripping (index pages) • Query parameter cleanup (empty or tracking keys) • Fragment removal Unification http://www.example.org/index.html?foo=&ref=123#content https://example.org/?ref=456#about ⇒ www.example.org 8
Ranking of Entity-URI Matches • Provide initial votings in range [ 1 , 10 ] per dataset • Wikipedia with links to homepages or databases • Voting: 10 (domains), 5 (URIs with path) • Wikidata contains hand-picked URIs, but many indirect ones seem not useful • Voting: 10 (direct), 5 (indirect) • GWA and GWW seem to have weak results • Voting: 3 • Delicious URI matches are ranked by relative number of users of a tag • Voting: 10 (most used tag) down to 1 9
Dataset by the Numbers • 22.8 M URIs and 1.6 M described entities • 91.3 % of entities have matching URIs • 13.4 % of entities with at least one multi-source URI • 13.7 URIs per entity on average • URIs for 70 % of entities in both Wiki and GWW • 35 % in GWW and only 2.3 % in Delicious Average URIs per type and dataset Dataset Person Org. Place C.W. All Wikipedia 3.52 2.56 2.47 1.81 2.83 Wikidata 5.52 2.79 2.96 5.95 4.60 Delicious 22.15 51.43 63.53 61.43 55.86 GWA 6.54 7.16 6.52 8.49 7.14 GWW 4.43 5.32 4.63 6.10 5.13 10
Dataset Cross-Evaluation
URI Assignments to Entities GWA GWW 9,046,987 3,365,652 Wiki Delicious 171,895 114,048 4,963 8,329,706 2,326,440 13,688 3,810 1,901 17,801 16,056 1,968 5,634 8,913 11
Overlapping URIs in Delicious 10,000 50 Entity Type 9,000 45 CreativeWork Organization 8,000 Person 40 Place 7,000 35 Overlapping URIs 6,000 30 Overlap (%) 5,000 25 4,000 20 3,000 15 2,000 10 1,000 5 0 0 1 2 3 4 5 6 7 8 9 10 Vote 12
Quality of Individual Datasets Mean Reciprocal Rank Person Organization | Q | Delicious Delicious 1 1 � MRR = | Q | rank i GWA GWA Dataset Dataset i = 1 GWW GWW • All URIs of a dataset as Q • Avg. inverse rank in results Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 • Penalty for missing URIs MRR MRR Place CreativeWork Results Delicious Delicious • Delicious performs well GWA GWA Dataset Dataset overall GWW GWW • URI selection of GWW Wiki Wiki better than GWA 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MRR MRR 13
Quality of Overlapping Results nDCG – simplified in this Person Organization environment: Delicious Delicious • Only overlapping URIs GWA GWA Dataset Dataset • Ideal rank: all results rank 1 GWW GWW Results Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 • Best URIs in Wiki nDCG nDCG Place CreativeWork • GWA contains more of the Delicious Delicious popular URIs than GWW GWA GWA Dataset Dataset GWW GWW Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 nDCG nDCG 14
Comparison with Web Search Engines Bing results for 50,000 queries of our most promising entities • 10 URIs from the first result page • Precision: • Our top URI is in the result set of 83 % of entities • Average Bing position: 2.26 • Recall: • 23.29 % of all Bing URIs in our dataset 15
Conclusion
Conclusion Ordered collection of annotated resources describing 1.6 M named entities over time. • Combination of multiple diverse datasets • Evaluation of dataset quality • Promising results with respect to regular search engines Future Work • Integration of more data sources • Expansion covered time-frames and entities • Date tag evaluation • Language awareness • Improved entity matching 16
Explore Dataset Online https://tempurion.l3s.uni-hannover.de 17
Recommend
More recommend