Network-centric Approaches to the Exploration of News Streams Andreas Spitz November 12, 2018 — EPFL, Lausanne Heidelberg University, Germany Database Systems Research Group
Collaborators Satya Almasian Gloria Feher Michael Gertz Jannik Strötgen
Catching up on the News www.deviantart.com/clearkid 1
Part I Implicit Entity Networks
The Importance of Entities in News The Five Ws of journalism: ◮ Who was involved? ◮ Where did it take place? ◮ When did it take place? ◮ What happened? ◮ Why did that happen? 2
The Importance of Entities in News The Five Ws of journalism: A common definition of event in IR: ◮ Who was involved? ◮ An event is something that happens at a given place and time between ◮ Where did it take place? a group of actors . ◮ When did it take place? ◮ What happened? ◮ Why did that happen? 2
What Are Implicit Entity Networks? 3
What Are Implicit Entity Networks? 3
What Are Implicit Entity Networks? 3
Implicit Network Construction
Implicit Network Extraction 4
Implicit Network Aggregation A. Spitz and M. Gertz. “Terms over LOAD: Leveraging Named Entities for Cross-Document Extrac- tion and Summarization of Events”. In: SIGIR . 2016 5
Implicit Network Aggregation A. Spitz and M. Gertz. “Terms over LOAD: Leveraging Named Entities for Cross-Document Extrac- tion and Summarization of Events”. In: SIGIR . 2016 5
Applications of Implicit Networks NLP and IR applications: ◮ Entity disambiguation ◮ Entity linking ◮ Extractive summarization ◮ Relationship extraction ◮ ... 6
Applications of Implicit Networks NLP and IR applications: Interactive text stream exploration: ◮ Entity disambiguation ◮ Entity participation in events ◮ Entity linking ◮ Evolving topic detection ◮ Extractive summarization ◮ Visual summarization ◮ Relationship extraction ◮ ... ◮ ... 6
Entity-centric News Exploration
News Article Data Set English news articles from RSS feeds: ◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127 k articles ◮ 5 . 4 M sentences 7
News Article Data Set English news articles from RSS feeds: NLP processing pipeline: ◮ 14 news outlets (from US, UK, and AU) ◮ Part-of-speech and sentence tagging: Stanford POS tagger ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ Temporal tagging: HeidelTime ◮ 127 k articles ◮ Entity classification: ◮ 5 . 4 M sentences YAGO classes (LOC, ORG, PER) ◮ Named entity recognition and linking: 7
News Article Data Set English news articles from RSS feeds: NLP processing pipeline: ◮ 14 news outlets (from US, UK, and AU) ◮ Part-of-speech and sentence tagging: Stanford POS tagger ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ Temporal tagging: HeidelTime ◮ 127 k articles ◮ Entity classification: ◮ 5 . 4 M sentences YAGO classes (LOC, ORG, PER) The resulting implicit network has ◮ Named entity recognition and linking: ◮ 125 k entities ◮ 351 k terms ◮ 83 . 4 M edges 7
Implicit Network Exploration Pipeline 8
Interactive Entity-centric Search T ry it yourself: A. Spitz, S. Almasian, and M. Gertz. “EVELIN: Exploration of Event and Entity Links in Implicit Networks”. In: WWW . 2017. url : http://evelin.ifi.uni-heidelberg.de:7777 9
Interactive Entity-centric Search: An Example 10
Evaluation Data: Entity Participation in Events 11
Evaluation Results: Entity Participation w2v skip − gram w2v CBOW GloVe 0.8 recall@k 0.6 0.4 0.2 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 rank k neighbourhood mode implicit netw. SUM AVG MINMAX 12
Evaluation Results: Performance vs. Entity Frequency implicit network w2v skip − gram w2v CBOW GloVe 0 entity rank 250 500 750 0 1 ⋅ 10 5 2 ⋅ 10 5 0 1 ⋅ 10 5 2 ⋅ 10 5 0 1 ⋅ 10 5 2 ⋅ 10 5 0 1 ⋅ 10 5 2 ⋅ 10 5 entity frequency 13
Entity-centric Network Topics
What Are Network Topics? term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17 14
What Are Network Topics? term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17 14
Implicit Network Extraction for Topic Detection Andreas Spitz and Michael Gertz. “Entity-Centric Topic Extraction and Exploration: A Network- Based Approach”. In: ECIR . 2018 15
Edge Aggregation and Weighting � � − 1 | D ( v 1 ) ∪ D ( v 2 ) | + max { T ( e ) } − min { T ( e ) } c ( e ) ω ( e ) = 3 · + � | D ( e ) | | T ( e ) | δ ∈ ∆( e ) exp ( − δ ) � �� � � �� � � �� � coverage temporal coverage distance 16
Topic Extraction and Triangular Growth Intuition: ◮ edges between entities correspond to seeds of topics 17
Topic Extraction and Triangular Growth Intuition: ◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17
Topic Extraction and Triangular Growth Intuition: ◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17
Topic Overlap and Merging Topics 18
Topic Overlap and Merging Topics 18
Topic Overlap and Merging Topics 18
Topic Subgraph Exploration: An Example 19
Term Ranking in Network Topics 20
Term Ranking in Network Topics term score min { ω ( e 1 , t 1 ) , ω ( e 2 , t 1 ) } t 1 t 2 min { ω ( e 1 , t 2 ) , ω ( e 2 , t 2 ) } . . . . . . min { ω ( e 1 , t n ) , ω ( e 2 , t n ) } t n 20
Deriving Classic Topics From Network Topics Beirut - Lebanon Russia - Moscow Russia - Putin Trump - Obama Q3820 - Q822 Q159 - Q649 Q159 - Q7747 Q22686 - Q76 term score term score term score term score syrian 0.14 russian 0.28 russian 0.29 presid 0.40 rebel-held 0.12 soviet 0.06 presid 0.18 american 0.21 rebel 0.06 nato 0.06 annex 0.09 republican 0.19 cease-fir 0.05 diplomat 0.06 nato 0.08 democrat 0.19 bombard 0.05 syrian 0.06 hack 0.08 campaign 0.18 bomb 0.04 rebel 0.05 west 0.08 administr 0.17 Network news topics from the New York Times (Jun - Nov 2016) 21
Benefits of Entity-centric Network Topics Benefits vs. traditional topics: ◮ faster extraction than LDA topics ◮ number of topics is flexible ◮ runtime contained in data preparation 22
Benefits of Entity-centric Network Topics Benefits vs. traditional topics: Stream compatibility: ◮ faster extraction than LDA topics ◮ document updates require only (sub-) graph updates ◮ number of topics is flexible ◮ runtime contained in data preparation 22
Interactive Topic Exploration T ry it yourself: A. Spitz, S. Almasian, and M. Gertz. “TopExNet: Entity-Centric Network Topic Exploration in News Streams”. In: WSDM . 2019. url : http://topexnet.ifi.uni-heidelberg.de 23
Linking Topics to Source Articles 24
Contexts of Entity Mentions
Why the Context Maters 25
Edge Context Extraction Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion . 2018 26
Edge Context Extraction Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion . 2018 26
Context-based Aggregation of Edges Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion . 2018 27
Edge Aggregation Approaches Streaming aggregation: Static aggregation / clustering: 28
Edge Aggregation Approaches Streaming aggregation: Static aggregation / clustering: ◮ Compare similarity of new edge ( v , w , · ) to existing edges ( v , w , · ) ◮ If similarity threshold is exceeded: merge with existing edge ◮ Otherwise, insert as new parallel edge 28
Edge Aggregation Approaches Streaming aggregation: Static aggregation / clustering: ◮ Compare similarity of new edge ◮ Collect all parallel edges ( v , w , · ) to existing edges ( v , w , · ) ◮ Cluster parallel edges ◮ If similarity threshold is exceeded: (density-based) merge with existing edge ◮ Discard “noisy” edges ◮ Otherwise, insert as new parallel edge ◮ aggregate edges within clusters 28
Evaluation Results: Entity Participation (with Context) Comparison of context aggregation methods 0.8 0.7 aggregation method 0.6 recall@k streaming 0.5 0.4 static 0.3 no context 0.2 0.1 0 10 20 30 40 50 rank k 29
Edge Deflation Potential Edge deflation in streaming aggregation aggregation aggregated edges 150 threshold t = 0.6 100 t = 0.5 50 t = 0.4 t = 0.3 0 0 2500 5000 7500 number of unaggregated edges 30
Evolving Network Topics relative frequency of mentions Topics for David Cameron (Q192) − UK (Q145) 1.00 0.75 0.50 0.25 0.00 Jun Jul Aug Sep Oct brexit nation favour referendum ukip vote prime minist leader demand govern westminst campaign resign pro − brexit 31
Summary and Overview (Part I)
Recommend
More recommend