Machine Learning for Author disambiguation Gilles Louppe CERN - PowerPoint PPT Presentation

Machine Learning for Author disambiguation Gilles Louppe CERN October 14, 2015 1 / 12

From publications to signatures Publications Signatures Signature for Doe, John Title Lorem ipsum dolor sit amet, consectetur adipiscing elit ... Author Doe, John Affiliation University of Foo Co-authors Smith, John; Chen, Wang Year 2015 2 / 12

Author disambiguation For each author, group together all his signatures, and only those. M.S.Smith.1 Z.Liang.4 S.W.Hawking.1 Z.Liang.5 ... Z.Liang.83 No more No less But all and only the correct ones 3 / 12

Spread of the problem As extracted from claimed publications in INSPIRE, • Authors have on average 2.06 name variants (synonyms) Eg. : Doe, John ; Doe, J. • Unique name variants are shared on average by 1.04 authors (homonyms) Clustering on same surnames and same given name initials, should yield very good results on average. But, disambiguation issues are expected to amplify with the rise of Asian researchers : Caucasian names (now representative of INSPIRE authors) are almost never ambiguous, while Asian names are very often. 4 / 12

How would you fare ? 5 / 12

How would you fare ? ✓ Same authors 5 / 12

How would you fare ? ✗ Different authors 5 / 12

Learning from data • Manual disambiguation is long and difficult, even for experienced curators. • Couldn’t we automatically find a set of rules to disambiguate two signatures ? � 0 if s 1 and s 2 belong to the same author, ϕ ( s 1 , s 2 ) = 1 otherwise. • This is a machine learning task called supervised learning. 6 / 12

s 1 s 2 Feature extraction x = ( name sim. = 0.7, title sim. = 0.3, ... ) Machine learning model ϕ p ( s 1 , s 2 have different authors | x ) 7 / 12

Feature extraction Feature Combination operator Full name Cosine similarity of ( 2, 4 ) -TF-IDF Given names Cosine similarity of ( 2, 4 ) -TF-IDF First given name Jaro-Winkler distance Second given name Jaro-Winkler distance Given name initial Equality Affiliation Cosine similarity of ( 2, 4 ) -TF-IDF Co-authors Cosine similarity of TF-IDF Title Cosine similarity of ( 2, 4 ) -TF-IDF Journal Cosine similarity of ( 2, 4 ) -TF-IDF Abstract Cosine similarity of TF-IDF Keywords Cosine similarity of TF-IDF Collaborations Cosine similarity of TF-IDF References Cosine similarity of TF-IDF Subject Cosine similarity of TF-IDF Year difference Absolute difference White Product of estimated probabilities Black Product of estimated probabilities American Indian or Alaska Native Product of estimated probabilities Chinese Product of estimated probabilities Japanese Product of estimated probabilities Other Asian or Pacific Islander Product of estimated probabilities Others Product of estimated probabilities 8 / 12

Disambiguation as a clustering problem • Author disambiguation = clustering signatures that belong to the same author. • Using our model ϕ , the probability that two signatures belong to different authors can be used as a (pseudo) distance metric, and e.g., plugged into a hierarchical clustering clustering. • The complexity of hierarchical clustering is O ( N 2 ) . For N = 10 7 signatures, this is impractical. Solution : pre-cluster signatures into blocks of smaller size, then cluster each of these blocks. 9 / 12

Workflow 10 / 12

Results F measure Baseline 1 0.9409 Our model 0.9862 1. Group by same surnames and same given name initials. 11 / 12

References • Implementation available at https://github.com/inveniosoftware/beard . • Ethnicity sensitive author disambiguation using semi-supervised learning. Gilles Louppe, Hussein Al-Natsheh, Mateusz Susik, Eamonn Maguire. http://arxiv.org/abs/1508.07744 . 12 / 12

Machine Learning for Author disambiguation Gilles Louppe CERN - PowerPoint PPT Presentation

Machine Learning for Author disambiguation Gilles Louppe CERN October 14, 2015 1 / 12 From publications to signatures Publications Signatures Signature for Doe, John Title Lorem ipsum dolor sit amet, consectetur adipiscing elit ...

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) :

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Digital Libraries and Development Hussein Suleman hussein@cs.uct.ac.za University of Cape Town

SafeRiver SME Independent- founded december 2005 18 consultants highly skilled in

Model Checking, Hybrid Automata, and Systems Biology Carla Piazza 1 1 Department of Mathematics

Instance Based Methods Tutorial at TABLEAUX 2005 Peter Baumgartner Gernot Stenz Programming

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre

Nonparametric Bayesian Word Sense Induction Xuchen Yao 1 and Benjamin Van Durme 1,2 1 Department of

Methods Matter: Improving USPTO Inventor Disambiguation Algorithms with Classification Models and

Programming by Demonstration with Situated Semantic Parsing Yoav Artzi, Maxwell Forbes, Kenton

Machine Learning for Author disambiguation Gilles Louppe CERN - PowerPoint PPT Presentation

Machine Learning for Author disambiguation Gilles Louppe CERN October 14, 2015 1 / 12 From publications to signatures Publications Signatures Signature for Doe, John Title Lorem ipsum dolor sit amet, consectetur adipiscing elit ...

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Author Disambiguation &amp; Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) :

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Digital Libraries and Development Hussein Suleman hussein@cs.uct.ac.za University of Cape Town

SafeRiver SME Independent- founded december 2005 18 consultants highly skilled in

Model Checking, Hybrid Automata, and Systems Biology Carla Piazza 1 1 Department of Mathematics

Instance Based Methods Tutorial at TABLEAUX 2005 Peter Baumgartner Gernot Stenz Programming

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre

Nonparametric Bayesian Word Sense Induction Xuchen Yao 1 and Benjamin Van Durme 1,2 1 Department of

Methods Matter: Improving USPTO Inventor Disambiguation Algorithms with Classification Models and

Programming by Demonstration with Situated Semantic Parsing Yoav Artzi, Maxwell Forbes, Kenton

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT