CERMINE automatic extraction of metadata and references from - PowerPoint PPT Presentation

CERMINE — automatic extraction of metadata and references from scientific literature Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 11th IAPR International Workshop on Document Analysis Systems 7-10 April 2014 D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 1 / 21

The goal TITLE AUTHORS AFFILIATIONS EMAILS ABSTRACT KEYWORDS D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 2 / 21

The goal VOLUME PAGES TITLE URL AUTHOR YEAR SOURCE D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 3 / 21

The motivation There are documents without metadata . Metadata information may be incomplete or incorrect . D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 4 / 21

Requirements The metadata extraction system should be: comprehensive , automatic , modular , open and widely available , easily applicable , flexible and able to adapt to new layouts , well tested . D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 5 / 21

The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF e Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 6 / 21

The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF Basic e <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 7 / 21

Basic structure extraction Character extraction — iText library Page segmentation — Docstrum Reading order resolving — bottom-up heuristic-based Initial zone classification — SVM ( metadata , references , body and other ) D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 8 / 21

The output <Page> <PageID Value="0"/> TrueViz XML format: <Zone> <ZoneID Value="0"/> hierarchical structure containing: <ZoneCorners> pages, zones, lines, words, <Vertex x="55.320"y="34.295"/> characters <Vertex x="235.704"y="58.295"/> all elements have bounding boxes </ZoneCorners> reading order is given <ZoneNext Value="1"/> <Category Value="TITLE"/> zones have labels <Line> <Word> <Character> D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 9 / 21

The process <XML> <title>Syste... Metadata <author>M.K... extraction <author>J.I... <journal>J... PDF Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 10 / 21

Metadata extraction <XML> Metadata zone classification — SVM ( abstract , bib info , type , <title>System ... title , affiliation , author , keywords , <author>M. Kn... correspondence , dates and editor ) <author>J. Illsl... <affiliation>Uni... Metadata extraction — simple <keywords>arti... rule-based <journal>Journ... <volume>19<v... <date>14.06.1... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 11 / 21

Zone classification classifiers are based on LibSVM library a zone is represented by 78 features : geometrical , lexical , sequential , formatting , heuristics the best SVM parameters were found by: a grid-search over 3-dimensional space of kernel function types and C (penalty parameter) and γ coefficients at every grid point a 10-fold cross-validation was performed we chose the parameters that gave the best mean accuracy initial classifier was trained on 964 documents with 155,144 zones in total metadata classifier was trained on 1,934 documents and 45,035 metadata zones in total D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 12 / 21

The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF e Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf References </front> 250 720 Td <back> extraction (PDF) Tj <XML> <ref>1. <aut <ref>2. <aut ET </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 13 / 21

Parsed reference extraction <XML> Reference strings extraction — <ref> K-means clustering [1] Reference parsing — CRF <author>M.K. ... <title>System... <journal>Journ... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 14 / 21

Reference strings extraction clustering text lines into two sets : first lines and the rest unsupervised K-means algorithm with Euclidean distance 5 features (based on length, indentation, space between lines and the text) D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 15 / 21

Reference parsing [8] Y . Wang, I.T. Phillips and R.M. Haralick, Document zone content classification and its performance evaluation, Pattern Recognition 39 (1) (2006), pp. 57–73. Conditional Random Fields token classifier based on GRMM and MALLET packages 42 constant features + the most popular words + features of two preceding and two following tokens the classifier was trained on 1000 citations from Cora-ref + PubMed D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 16 / 21

GROTOAP2 dataset <NLM> PDF <NLM> PDF zone text matching <NLM> CERMINE tools PDF PubMed Central GROund Truth for Open Access Publications built automatically from PubMed Central Open Access Subset ∼ 60k ground truth files in TrueViz format with corresponding PDF files D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 17 / 21

Results avg. precision avg. recall initial zone classifier 91.74% 87.31% metadata zone classifier 92.49% 93.83% reference parsing 90.18% 89.51% precision recall avg. adjustment journal title 68.68% 49.23% article title 95.03% volume 97.57% 78.57% abstract 91.43% issue 52.50% 56.64% avg. precision avg. recall pages 51.37% 34.71% authors 87.19% 82.07% year 98.79% 89.18% affiliations 70.13% 59.44% DOI 93.60% 57.46% keywords 61.11% 68.37% ISSN 44.29% 3.01% D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 18 / 21

Future work a new extraction path for extracting structured full text the evaluation of the entire references extraction path comparing the results to other similar systems D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 19 / 21

Links CERMINE web service : http://cermine.ceon.pl CERMINE source code : https://github.com/CeON/CERMINE GROTOAP2 : http://cermine.ceon.pl/grotoap2/ D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 20 / 21

Thank you Thank you! Questions? Dominika Tkaczyk d.tkaczyk@icm.edu.pl � 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license. c The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/ D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 21 / 21

CERMINE automatic extraction of metadata and references from - PowerPoint PPT Presentation

CERMINE automatic extraction of metadata and references from scientific literature Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Outline References References References References Complex Networks, Course 295A, Spring,

Outline References References References References Principles of Complex Systems Course 300,

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Expanding Metadata Reuse with an Islandora Metadata Extraction Utility Serhiy Polyakov and

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

DUNE Data Model Meeting: Metadata Metadata Needs And Considerations Steven Timm The following

Metadata In ArcGIS 10.0 Jason Cupp Whats New In ArcGIS 10.0 New Metadata Editor for

Hitachi NEXT 2018 Automating Onboarding Data with Metadata Injection Contents Page 2:

From SDTM to displays, through ADaM & Analyses Results Metadata, a flight on board METADATA

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Shimon An Intelligent Music-Playing Robot Capable of Improvising with Humans Vincent Rolfs

Introducing The Future of Particle Physics (KIT Edition) Chris Quigg Fermilab & CERN The

Open Information Extraction: the Second Generation Authors: Oren Etzioni, Anthony Fader, Janara

POLL 3 1 6/5/2019 Young children as engineers? 4 Goals for webinar Young children as

ACCT 420: Topic modeling and anomaly detection Session 8 Dr. Richard M. Crowley 1 Front matter

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CERMINE automatic extraction of metadata and references from - PowerPoint PPT Presentation

CERMINE automatic extraction of metadata and references from scientific literature Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Outline References References References References Complex Networks, Course 295A, Spring,

Outline References References References References Principles of Complex Systems Course 300,

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Expanding Metadata Reuse with an Islandora Metadata Extraction Utility Serhiy Polyakov and

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

DUNE Data Model Meeting: Metadata Metadata Needs And Considerations Steven Timm The following

Metadata In ArcGIS 10.0 Jason Cupp Whats New In ArcGIS 10.0 New Metadata Editor for

Hitachi NEXT 2018 Automating Onboarding Data with Metadata Injection Contents Page 2:

From SDTM to displays, through ADaM &amp; Analyses Results Metadata, a flight on board METADATA

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Shimon An Intelligent Music-Playing Robot Capable of Improvising with Humans Vincent Rolfs

Introducing The Future of Particle Physics (KIT Edition) Chris Quigg Fermilab &amp; CERN The

Open Information Extraction: the Second Generation Authors: Oren Etzioni, Anthony Fader, Janara

POLL 3 1 6/5/2019 Young children as engineers? 4 Goals for webinar Young children as

ACCT 420: Topic modeling and anomaly detection Session 8 Dr. Richard M. Crowley 1 Front matter

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

From SDTM to displays, through ADaM & Analyses Results Metadata, a flight on board METADATA

Introducing The Future of Particle Physics (KIT Edition) Chris Quigg Fermilab & CERN The