Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in Scholarly Work Denise Beaubien Bennett Gainesville, FL March 18, 2010 "Until George W. Bush became President, the first President Bush never used his middle initials," George H.W. Bush's chief of staff, Jean Becker, says. "But once his son became President, the elder Bush began to realize that it was necessary, to help identify which President Bush was being referred to.” • How confident are we that all mentions of plain “George Bush” refer to Senior? • Remember that George H.W. Bush had several roles: CIA Director, Ambassador to China, Vice President 2 1
Automated disambiguation • Scopus • Web of Science • CiteSeer • DBLP author search engine – query interpreted as set of prefixes (implicit truncation) of name parts • Author-ity • improving recall and precision over time! 3 Scopus – snapshot from 2007 2007 – one solid cluster, 6 ambiguous outliers 4 2
Scopus in 2010: improving 2010 - one solid cluster, 3 ambiguous outlier names Web of Science • Their example shows incompleteness of disambiguation; continue using all variations with and without apostrophe 6 3
WoS Distinct Author Sets – clustering is improving Web of Science DIY disambiguation 4
CiteSeer – disambiguated (but not perfect) unclustered items are mostly typos alternate name resolves to preferred name 5
6
Author-ity clusters 7
Author-ity pairwise ranking Author-ity ranking results 8
Author-ity ranking – the bottom super-high probability through 130. less than 50% with title far off topic Voluntary Profiles Author (or proxy) created and maintained • Compliance challenges with ingestion and updating • Usually include numbers • COS Expertise - 480,000 profiles • ResearcherID (to be used by ORCID) • RePEc Author Service in IDEAS 9
COS Community of Science 18 months ago useful tools 19 ResearcherID author-controlled profile 20 10
ResearcherID - features value added from WoS – only works on cites in WoS 21 ResearcherID dups keywords helpful when present 11
RePEc Author Service • Relies on authors to maintain their profiles and identify articles as written by them • 23,000+ registered authors and 7000+ registered non-authors from 2007: dups & funnies they track lost and deceased authors disambiguated index is much cleaner in 2010 12
In development • Cooperative Identities Hub • ISNI • ORCID 25 Manual checking • no guarantee of perfection • scalability • MathSciNet • Mathematics Genealogy Project • ACM 26 13
MathSciNet clusters all papers but preserves name on piece However… • Even the small, discipline-specific database of MathSciNet cannot corral all the duplicate names. – only half of the entries disambiguated for: • Zhang, Lei • Zhang, Li • Red herring: how many people only author one paper in their career??? – about 46% in Medline (sec. 3.5) 28 14
Many people, same name MGP - 30 … 15
ACM – discloses the weighting ACM Digital Library – not quite yet 16
After we disambiguate, we can: • Link / cluster records within the silo – highlighting the preferred version • Link headings (or records) across silos • Analyze / repackage / mashup the data 33 Linking within a silo • more examples -- inspiration from outside the university/research world 34 17
Linking in Community- maintained IMDB others born the same day or year or place links to people, films, etc. credit! Community-maintained - MusicBrainz members & years 18
Community-maintained - MusicBrainz please – no “eyes” no “pears” no hyphen Linking across silos • VIAF – Virtual International Authority File • Getty ULAN – Union List of Artist Names • Names Project - UK individuals and institutions – for benefit of institutional and subject repositories • BKN People – using Bibliographic Ontology (BIBO) to aggregate author silos 38 • rely on local silos for maintenance 19
VIAF – linking across files authority record in BNF (France) matches these other files Getty Union List of Artist Names • ULAN • Used mostly by museums • Merges multiple authority files • Displays all options and sources • Guides to preferred name 20
name variations preferred among options 21
relationships sources Names project (UK) 22
Names Project (UK) 45 BKN People: uses BIBO 46 23
BKN People: uses BIBO 47 Analyzing / repackaging the data – discover outliers through analysis • what’s wrong with this picture? – run the outliers by human checkers – use the analyzed results to refine the disambiguation 48 24
WorldCat Identities more than birth/death dates the fun stuff Anne O’Tate (Author-ity) analyze by address note the fractions of addresses 25
Anne O’Tate (Author-ity) analyze by topic neat clustering, compared to “Topics” with 324 results IDEAS / RePEc analyze – author’s impact within silo 26
MathSciNet collaboration distance the Kevin Bacon of Math How close are these authors? 27
DBLP Vis – coauthor intensity see # papers with coauthor when mouse-over a year DBLP Vis – coauthor timecolor see fatter boxes on graph when mouse-over a year 28
Features to help disambiguate • affiliation (how many addresses/year?) • email address • coauthors • keywords from source or all metadata • dates - degree years, expected range • web page – URL and other data • caution - what fuzziness/distance is acceptable? differences by disciplines? 57 Use with care: one author, many interests 29
For contemplation and discussion 59 Assigning numbers • Centralized numbering system – governance issues, unpalatable to some • Individual small silo numbering – can be highly accurate • Record linking across files – easily accomplished • Getting started -- authors could include number(s) with all contact info 60 30
Trustworthiness • Am I in control of all of my publications? • If I’m logged in (to ResearcherID, via my university account, etc.) and I indicate “these items are mine,” should you trust my accuracy? • Have I captured all of my items? – variants on my name – items I forgot – items credited without my awareness 61 61 Issues to explore • Ingestion vs. maintenance – very different problems – author compliance needed? • De-duplication (within and across silos) • Management and cooperation for updating • Scalability • Automated vs. manual techniques • Optimizing computational performance • Long tail of one-hit authors (how much attention?) 62 31
Researchers, projects, products, models • Great review (by the Author-ity folks) Smalheiser NR, Torvik VI. (2009) Author name disambiguation. 63 Databases and those who created or tinkered with them • MathSciNet • ULAN • DBLP - Han • CiteSeer – Giles, Han • IMDB – Malin • ANAC – Levy sheet music • Medline – Torvik and Smalheiser • D-Dupe - Getoor • rexa.info – McCallum • VIAF - Hickey 32
Recommend
More recommend