citeseerx data semanticizing scholarly papers
play

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, - PowerPoint PPT Presentation

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen Liang, IST, Pennsylvania State University Huaiyu Yang, EECS, Vanderbilt University C. Lee Giles, IST & CSE Pennsylvania State University The


  1. CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen Liang, IST, Pennsylvania State University Huaiyu Yang, EECS, Vanderbilt University C. Lee Giles, IST & CSE Pennsylvania State University The International Workshop on Scholarly Big Data (SBD 2016)

  2. CiteSeerX Data: Semanticizing Scholarly Big Data 2 Self-Introduction Dr. C. Lee Giles Dr. Jian Wu David Reese Professor Postdoctoral scholar PI and Director of CiteSeerX Tech leader of CiteSeerX Chen Liang Huaiyu Yang PhD student Undergraduate student Pennsylvania State University Vanderbilt University

  3. CiteSeerX Data: Semanticizing Scholarly Big Data 3 Outline • Scholarly Big Data and the Uniqueness of CiteSeerX Data • Data Acquisition and Extraction • Data Products • Raw Data • Production Database • Production Repository • Data Management and Access • Semantic Entity Extraction From Academic Papers

  4. CiteSeerX Data: Semanticizing Scholarly Big Data 4 Scholarly Data as Big Data • “Volume” • About 120 million #Scholarly Documents scholarly documents on the Web – 120TB or #Documents/Million 120 more [1] 100 • Growing at a rate of >1 80 million annually 60 • English only – factor of 2 40 more with other 20 languages 0 • Compare: NASA Earth Exchange Downscaled Climate Projections dataset (17TB) [1] Khabsa and Giles (2014, PLoS ONE)

  5. CiteSeerX Data: Semanticizing Scholarly Big Data 5 Scholarly Big Data Features • “Variety” • Unstructured: document text • Structured: title, author, citation, etc - metadata • Semi-structured: tables, figures, algorithms, etc. • Rich in facts and knowledge • Related data • Social networks, slides, course material, data “inside” papers • “Velocity” • Scholarly Data is expected to be available in real time • On the whole, scholarly Data can be considered an important instance of big data.

  6. CiteSeerX Data: Semanticizing Scholarly Big Data 6 Digital Library Search Engine (DLSE) • Crawl-based vs. submission-based DLSEs Crawl-based Submission-based Data Source Internet Author upload Metadata Source Automatically Author input + (majority) Extracted Automatically Extracted Data Quality varies high Human Labor (relatively) Low High Accessibility Open (or partially) Subscription • Crawl-based DLSEs are important sources of scholarly data for research tasks such as citation recommendation, author name disambiguation, ontologies, document classification, and Science of Science

  7. CiteSeerX Data: Semanticizing Scholarly Big Data 7 The Uniqueness of CiteSeerX Data • Open-access Scholarly Data sets Datasets DBLP MAG* CiteSeerX Documents 5 million 100 million 7 million Header y y y Citations n y y URLs y y y (publishers) (open + publishers) (open) Full text n n y Disambiguated n n y author names * MAG: Microsoft Academic Graph

  8. CiteSeerX Data: Semanticizing Scholarly Big Data 8 Data Acquisition whitelist Wikipedia PubMed Microsoft URLs arXiv External Academic Central Graph URLs Links user submitted URLs seeds open access web crawling digital repositories crawl repository

  9. CiteSeerX Data: Semanticizing Scholarly Big Data 9 Metadata Extraction crawl repository crawl database crawl repository crawl database PDFLib TET PDFBOX/Xpdf PDFMEF Rule-based filter ML-based Filter ParsCit GROBID ParsCit SVMHeaderParse Currently Under test

  10. CiteSeerX Data: Semanticizing Scholarly Big Data 10 Figures/Table/Barchart Extraction • Data: CiteSeerX papers • Extraction: • Extract figures + tables from papers • Extract metadata from figures + tables metadata metadata metadata • Large scale experiment • 6.7 Million papers in 14 infer semantic days with 8 processes trends cell desc. trends

  11. CiteSeerX Data: Semanticizing Scholarly Big Data 11 Ingestion • Ingestion feeds data title title and metadata to the P.2 P.1 author author production retrieval match system cluster title: Focused Crawling Optimization cluster author: Jian Wu • Ingestion clusters near- paper cluster 1 duplicate documents title • Ingestion generate the P.3 author1 author2 citation graph (next slide) cluster title: Deep web crawling • Relational database cluster author: James Schneider, Mary Wilson • File system paper cluster 2 • Apache Solr

  12. CiteSeerX Data: Semanticizing Scholarly Big Data 12 Type 1 node: clusters with both in-degrees and out- 1 degrees, containing papers, may contain citations Type 2 node (root): clusters with zero in-degree and non-zero out-degrees, only containing papers, i.e., 2 papers that are not cited yet. Type 3 node (leaf): clusters with non-zero in-degree and zero out-degrees, only containing citation records, 3 i.e., records without full text papers. Characteristics : • Directed • No cycles: old papers cannot not cite new papers Paper 1 1 2 Paper 2 1 1 Citation 1 newer older 2 Citation 2

  13. CiteSeerX Data: Semanticizing Scholarly Big Data 13 Name Disambiguation • Challenging due to name variations and entity ambiguity • Task 1: distinguish different entities with the same surface name • Task 2: resolve same entities with different surface names Michael J. Jordan Michael I. Jordan Michael Jordan ? Michael W. Jordan (footballer) Michael Jordan (mycologist) C L Giles Lee Giles C Lee Giles Clyde Lee Giles

  14. CiteSeerX Data: Semanticizing Scholarly Big Data 14 User Correction Figure: user-correction link on a paper summary page. • Users can change almost all metadata fields • New values are effective immediately after changes are submitted • Metadata can be changed multiple times • Version control • About 1 million user corrections since 2008.

  15. CiteSeerX Data: Semanticizing Scholarly Big Data 15 Data Products Document Collection of CiteSeerX • Raw Data • Crawl repository 2015 26 million • 24TB PDFs 2014 • Crawl database • 26 million document URLs 2013 • 2.5 million parent URLs 2012 • 16GB 2011 2010 other page 2009 PDF document URL 2008 1.9 million homepage parent URL 0 5 10 15 20 25 30 Indexed Ingested Crawled

  16. CiteSeerX Data: Semanticizing Scholarly Big Data 16 Data Products • Crawl website http://csxcrawlweb01.ist.psu.edu/ submit a URL to crawl Country ranking by number of docs Domain ranking by number of crawled docs

  17. CiteSeerX Data: Semanticizing Scholarly Big Data 17 What Documents Have We Crawled • Manually label 1000 randomly selected crawled documents others 35,0% • Crawl repository can paper 47,9% be used for documents classification experiments to improve web crawling poster 0,6% • Crawl database can non-en 7,2% be used to generate abstract whitelists and 0,5% resume schedule crawl jobs slides book 0,3% report thesis 4,5% 1,8% 1,5% 0,9%

  18. CiteSeerX Data: Semanticizing Scholarly Big Data 18 Production Databases • citeseerx • csx_citegraph • metadata directly • paper clusters extracted from papers • citation graph database.table description rows citeseerx.papers header metadata 6.8 million citeseerx.authors author metadata 20.6 million citeseerx.cannames authors (disambiguated) 1.2 million citeseerx.citations references 150.2 million citeseerx.citationContext citation context 131.9 million csx_citegraph.clusters citation graph (nodes) 45.7 million csx_citegraph.citegraph citation graph (edges) 112.5 million * Data are collected at the beginning of 2016.

  19. CiteSeerX Data: Semanticizing Scholarly Big Data 19 What Does Citation Graph Look Like in- degree slope=−2.37 out-degree slope=−3.20 out-degree Suitable for large slope=−0.22 scale graph analysis In-degree and out-degree distribution of CiteSeerX Citation Graph. Plots made by SNAP. Data are collected at the beginning of 2016.

  20. CiteSeerX Data: Semanticizing Scholarly Big Data 20 Production Repository • 7 million academic • Classification Accuracy documents (beginning of paper 83.0% 2016) others 7.5% • 9TB report 4.5% • PDF thesis 2.6% academic • XML (metadata) slides 0.8% 92.1% documents book 0.7% • body text abstract 0.3% • reference text non-en 0.3% • full text poster 0.2% • version metadata files resume 0%

  21. CiteSeerX Data: Semanticizing Scholarly Big Data 21 Production Repository • False Negatives • Improving Classification Accuracy • Documents mis-classified as non-academic documents • Classifier based on Machine Learning and Structural others 70.7% features (Caragea et al. 2014 paper 12.3% WSC; Caragea et al. 2016 slides 5.7% IAAI) report 0.7% • Accuracy > 90% academic resume 0.7% 28.3% documents thesis 0.3% abstract 0.3% non-en 0.3% poster 0% book 0%

Recommend


More recommend