A Collaborative Named Entity Focused URI Collection to Explore Web - PowerPoint PPT Presentation

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web Archiving and Digital Libraries Sergej Wildemann & Helge Holzmann June 6, 2019 L3S Research Center - University of Hannover - Germany

Introduction

Motivation • Named entities are evolving • Cities grow • People change positions in their careers • Information is spread on the Web • Content is changed or deleted on web pages • Web search engines forget or rerank resources • Limited search in web archives by: • Exact URI • "Full-text" Challenge Accessing online resources that describe an entity over time 1

Example: Site Search in the Wayback Machine • Indexed anchor texts • Problem: Domains only, no specific paths or dates 2

Dataset Generation

Processing Pipeline 1. Entity collection 2. URI collection and assignment 3. URI unification 4. Ranking 5. Temporal enrichment 3

Incorporated Datasets Dataset Entities Classes URIs Tags Dates Wikipedia • • DBpedia • ⋊ ⋉ Wikidata • ⋊ ⋉ Delicious • • ⋊ • ⋉ ⋊ ⋉ GWA • • ⋊ ⋉ GWW • • ⋊ ⋉ Wayback CDX • ⋊ ⋉ ( • ) extracted information – ( ⋊ ⋉ ) used for joining 4

Entity Collection • Potential entities: List of all Wikipedia article titles • Filtering with DBpedia’s ontology: Type Count % Person 1,500,000 22.7 Place 840,000 12.7 Creative Work 496,000 7.5 Organization 286,000 4.3 Other 2,378,000 36.0 N/A 1,100,000 16.7 6,600,000 100 5

URI Collection • Wikidata provides URIs directly or indirectly via identifiers • Wikipedia articles contain "External Links" section • Official websites (usually without a path) • Databases like IMDb • GWA and GWW were generated from the German Web Archive (1996-2013) • Searching for entities in anchor texts • GWW restricted the search to pages referenced from Wikipedia • Up to 10 URIs per entity • With most prominent years in dataset 6

URI Collection from Delicious Delicious contains tagged bookmarks with timestamps DATE USERID URI TAG... • General idea: Matching of tags against entities • Problem: Tags are single words • Solution: Normalizing tags and entity titles • Disambiguation terms must be found in tags • Additional tags as metadata Normalization Entity: New_York_(State) ⇒ newyork Tag: new-york ⇒ newyork 7

URI Unification URI := protocol://domain[/path][?query][#fragment] • Protocol removal • Subdomain unification ( www. prefix) • Path stripping (index pages) • Query parameter cleanup (empty or tracking keys) • Fragment removal Unification http://www.example.org/index.html?foo=&ref=123#content https://example.org/?ref=456#about ⇒ www.example.org 8

Ranking of Entity-URI Matches • Provide initial votings in range [ 1 , 10 ] per dataset • Wikipedia with links to homepages or databases • Voting: 10 (domains), 5 (URIs with path) • Wikidata contains hand-picked URIs, but many indirect ones seem not useful • Voting: 10 (direct), 5 (indirect) • GWA and GWW seem to have weak results • Voting: 3 • Delicious URI matches are ranked by relative number of users of a tag • Voting: 10 (most used tag) down to 1 9

Dataset by the Numbers • 22.8 M URIs and 1.6 M described entities • 91.3 % of entities have matching URIs • 13.4 % of entities with at least one multi-source URI • 13.7 URIs per entity on average • URIs for 70 % of entities in both Wiki and GWW • 35 % in GWW and only 2.3 % in Delicious Average URIs per type and dataset Dataset Person Org. Place C.W. All Wikipedia 3.52 2.56 2.47 1.81 2.83 Wikidata 5.52 2.79 2.96 5.95 4.60 Delicious 22.15 51.43 63.53 61.43 55.86 GWA 6.54 7.16 6.52 8.49 7.14 GWW 4.43 5.32 4.63 6.10 5.13 10

Dataset Cross-Evaluation

URI Assignments to Entities GWA GWW 9,046,987 3,365,652 Wiki Delicious 171,895 114,048 4,963 8,329,706 2,326,440 13,688 3,810 1,901 17,801 16,056 1,968 5,634 8,913 11

Overlapping URIs in Delicious 10,000 50 Entity Type 9,000 45 CreativeWork Organization 8,000 Person 40 Place 7,000 35 Overlapping URIs 6,000 30 Overlap (%) 5,000 25 4,000 20 3,000 15 2,000 10 1,000 5 0 0 1 2 3 4 5 6 7 8 9 10 Vote 12

Quality of Individual Datasets Mean Reciprocal Rank Person Organization | Q | Delicious Delicious 1 1 � MRR = | Q | rank i GWA GWA Dataset Dataset i = 1 GWW GWW • All URIs of a dataset as Q • Avg. inverse rank in results Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 • Penalty for missing URIs MRR MRR Place CreativeWork Results Delicious Delicious • Delicious performs well GWA GWA Dataset Dataset overall GWW GWW • URI selection of GWW Wiki Wiki better than GWA 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MRR MRR 13

Quality of Overlapping Results nDCG – simplified in this Person Organization environment: Delicious Delicious • Only overlapping URIs GWA GWA Dataset Dataset • Ideal rank: all results rank 1 GWW GWW Results Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 • Best URIs in Wiki nDCG nDCG Place CreativeWork • GWA contains more of the Delicious Delicious popular URIs than GWW GWA GWA Dataset Dataset GWW GWW Wiki Wiki 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 nDCG nDCG 14

Comparison with Web Search Engines Bing results for 50,000 queries of our most promising entities • 10 URIs from the first result page • Precision: • Our top URI is in the result set of 83 % of entities • Average Bing position: 2.26 • Recall: • 23.29 % of all Bing URIs in our dataset 15

Conclusion

Conclusion Ordered collection of annotated resources describing 1.6 M named entities over time. • Combination of multiple diverse datasets • Evaluation of dataset quality • Promising results with respect to regular search engines Future Work • Integration of more data sources • Expansion covered time-frames and entities • Date tag evaluation • Language awareness • Improved entity matching 16

Explore Dataset Online https://tempurion.l3s.uni-hannover.de 17

A Collaborative Named Entity Focused URI Collection to Explore Web - PowerPoint PPT Presentation

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web Archiving and Digital Libraries Sergej Wildemann & Helge Holzmann June 6, 2019 L3S Research Center - University of Hannover - Germany Introduction

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Sunglasses SM001 Collection SM005 Collection YPC001 Collection(swimming goggles) SR001

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Information Extraction Extracting limited forms of information from text Named entity

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

Languages & the Professions: URI as National Leader URI: A National Leader in Languages

10/29/14 URI and Economic Development URI is the flagship public university in RI AND the

MiniApp URI Proposal Zhou Dan 1 Content Use Case - Access a MiniApp Background

Guidelines and Registration Procedures for New URI Schemes draft-ietf-appsawg-uri-scheme-reg-01

All Your URI are Belong to Us Geoffrey Young geoff@modperlcookbook.org 1

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Visualizer Basics Jon Storrick Jon.Storrick@gmail.com Center for Computational Analysis of

Refactoring Sequential Java Code for Concurrency via Concurrent Libraries Danny Dig (MIT

Building a Database on S3 By Sandeepkrishnan Introduction Next wave on the Web to provide

Web Security: Vulnerabilities & Attacks Slide credit: John Mitchell Dawn Song Security User

The use of the SIPS URI Scheme in SIP draft-ietf-sip-sips-05 Franois Audet - audet@nortel.com

What is Scripting? ! Yes! The name comes from written script CSCI: 4500/6500 Programming such as

Multiple Graph Processing Naming and Addressing Alastair Green Neo Technology Cypher Language

OAuth 2.0 authorization using blockchain-based tokens Nikos Fotiou, Iakovos Pittaras, Vasilios

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A Collaborative Named Entity Focused URI Collection to Explore Web - PowerPoint PPT Presentation

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web Archiving and Digital Libraries Sergej Wildemann & Helge Holzmann June 6, 2019 L3S Research Center - University of Hannover - Germany Introduction

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Sunglasses SM001 Collection SM005 Collection YPC001 Collection(swimming goggles) SR001

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Information Extraction Extracting limited forms of information from text Named entity

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

Languages &amp; the Professions: URI as National Leader URI: A National Leader in Languages

10/29/14 URI and Economic Development URI is the flagship public university in RI AND the

MiniApp URI Proposal Zhou Dan 1 Content Use Case - Access a MiniApp Background

Guidelines and Registration Procedures for New URI Schemes draft-ietf-appsawg-uri-scheme-reg-01

All Your URI are Belong to Us Geoffrey Young geoff@modperlcookbook.org 1

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Visualizer Basics Jon Storrick Jon.Storrick@gmail.com Center for Computational Analysis of

Refactoring Sequential Java Code for Concurrency via Concurrent Libraries Danny Dig (MIT

Building a Database on S3 By Sandeepkrishnan Introduction Next wave on the Web to provide

Web Security: Vulnerabilities &amp; Attacks Slide credit: John Mitchell Dawn Song Security User

The use of the SIPS URI Scheme in SIP draft-ietf-sip-sips-05 Franois Audet - audet@nortel.com

What is Scripting? ! Yes! The name comes from written script CSCI: 4500/6500 Programming such as

Multiple Graph Processing Naming and Addressing Alastair Green Neo Technology Cypher Language

OAuth 2.0 authorization using blockchain-based tokens Nikos Fotiou, Iakovos Pittaras, Vasilios

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Languages & the Professions: URI as National Leader URI: A National Leader in Languages

Web Security: Vulnerabilities & Attacks Slide credit: John Mitchell Dawn Song Security User