breaking the news extracting the sparse citation network
play

Breaking the News: Extracting the Sparse Citation Network Backbone - PowerPoint PPT Presentation

Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles Andreas Spitz and Michael Gertz Heidelberg University Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de


  1. Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles Andreas Spitz and Michael Gertz Heidelberg University Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de spitz@informatik.uni-heidelberg.de statNLP Kolloquium Heidelberg, June 26, 2015

  2. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 1 of 37

  3. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 1 of 37

  4. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Networks of news articles are analyzed frequently, e.g. for • Information diffusion • Event detection • Information cascades • Media dynamics Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 2 of 37

  5. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Networks of news articles are analyzed frequently, e.g. for • Information diffusion • Event detection • Information cascades • Media dynamics But what about network extraction and emergence? • Are all networks of news articles born equal? • Or: when is a link a link? Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 2 of 37

  6. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Overview 1) Data Extraction of networks of news articles 2) Network Structure of the News Citation Network 3) Citation Characteristics of the network 4) Applications and Analysis on the network 5) Traditional Networks in comparison 6) Summary Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 3 of 37

  7. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The Ideal Network of News Articles Directed, acyclic network with time ordering of nodes Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 4 of 37

  8. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Types of Links Between News Articles Classification of links by location and target: a) navigational links b) anchored references c) internal links d) advertisement Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 5 of 37

  9. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The Established Approach: Crawling For very large data sets: • Select a large number of news outlets • Crawl the web pages and follow links • Extract all articles along the way Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 6 of 37

  10. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The Established Approach: Crawling For very large data sets: • Select a large number of news outlets • Crawl the web pages and follow links • Extract all articles along the way Problems: • Determining publication time • Extracting the article’s content • Recombining multi-page articles • Distinguishing between link types Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 6 of 37

  11. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The Established Approach: RSS-Feeds For streams of news articles: • Select news outlets that publish RSS-Feeds • Periodically check Feeds • Download new articles Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 7 of 37

  12. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The Established Approach: RSS-Feeds For streams of news articles: • Select news outlets that publish RSS-Feeds • Periodically check Feeds • Download new articles Problems: • Determining publication time • Extracting the article’s content • Recombining multi-page articles • Distinguishing between link types Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 7 of 37

  13. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Structural Basics of News Articles: HTML DOM-Tree Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 8 of 37

  14. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary A Rule-based Approach Create a network by • limiting the set of nodes to articles published by news outlets • downloading all pages of multi-page articles • using outlet-dependent rules to extract the article text • extracting anchored references within the texts as edges Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 9 of 37

  15. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary A Rule-based Approach Create a network by • limiting the set of nodes to articles published by news outlets • downloading all pages of multi-page articles • using outlet-dependent rules to extract the article text • extracting anchored references within the texts as edges Problems: • Determining publication time • Extracting the article’s content • Recombining multi-page articles • Distinguishing between link types • Additional effort to find extraction rules Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 9 of 37

  16. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary The News Citation Network Data collected from 6 German news outlets over 10 months frequency by outlet frequency by category 11010 11010 11010 11010 11k 10k 9544 9544 9k source 8k 7630 7630 7630 7630 welt 7k zeit faz 6k 5207 5207 other 5k politics 4k business 3363 3363 none 3k 2k 1k 668 668 668 142 0 welt zeit faz other politics business none | V | = 18 , 782 articles and | E | = 21 , 581 references between them Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 10 of 37

  17. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Components of the News Network • 63 . 1% of nodes in one giant connected component • Component consists of two clusters of articles from Zeit and Welt • Other articles are mixed in or form small, homogeneous components Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 11 of 37

  18. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Component Size Distribution aggregated politics business 10 3 ● ● ● ● ● ● ● ● ● 10 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● frequency ● ● ● ● ● ● ● ● ● ● ● ● 10 0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● welt zeit faz 10 3 ● ● ● ● ● ● 10 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 10 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 0 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 component size in nodes Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 12 of 37

  19. Motivation Data Extraction Network Structure Citation Characteristics Applications Traditional Networks Summary Degree Distribution aggregated politics business 10 0 complementary cumulative probability ● ● ● ● ● ● 10 −1 ● ● ● ● ● ● ● ● ● 10 −2 ● ● ● ● ●● ● ● ● ● ● ● 10 −3 ● ● ● ●● ● ● ● ● ● ● ● 10 −4 ● ● ● degree ● ● in welt zeit faz 10 0 out ● ● ● ● ● ● ● 10 −1 ● ● ● ● ●● ● ● 10 −2 ● ● ●● ● ●● ● ● ● ● 10 −3 ● ● ● ●● ● ● ● ● 10 −4 ● 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 degree Extracting the Sparse Citation Network Backbone of Online News Articles � Andreas Spitz c 13 of 37

Recommend


More recommend