benefits and challenges
play

Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 - PowerPoint PPT Presentation

Quatrro Confidential Quatrro Confidential Author / Researcher Databases Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 www.Quatrro.com Trends in STM Research Publishing Exponential growth of scholarly output.


  1. Quatrro Confidential Quatrro Confidential Author / Researcher Databases – Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 www.Quatrro.com

  2. Trends in STM Research Publishing  Exponential growth of scholarly output.  Evolution of social networks and topical communities  Authors seeking more visibility and recognition for their contributions.  Evolving user expectations from online content (functional efficiency, accuracy)  Increasing emphasis on data mining, analysis, integration  Governments, institutions and funding agencies evaluating their “investments” - faculty, departments, grants, collaborations – for Productivity, ROI  Increased interest in the “Who” of STM research - the producers of the research, not just the research itself. 2 Quatrro Confidential

  3. Benefits of Clean, Aggregated Author Data  More efficient, enhanced editorial workflow (Peer Review) – Simpler, faster, higher quality review process (ID “the best” reviewers )  Improved online performance and search results – Enhanced discovery and more accurate retrieval of author information and content  More robust and accurate bibliometric analyses – Research productivity of institutions, departments, individuals – Indicators like citations, downloads, articles published, patents – Supports decisions like funding, promotion, and reappointment. – Better assessment of the impact of money spent on investment in research  Increased exposure to and support of the author – Visibility, tracking, collaborating  Support for the broader community – Analysis, networking, productivity, efficiency 3 Quatrro Confidential

  4. Researcher ID – Thomson Reuters 4 Quatrro Confidential

  5. Researcher ID – Thomson Reuters 5 Quatrro Confidential

  6. Elsevier Scopus ID and Author Profile Author Profile Page 6 Quatrro Confidential

  7. ORCID  Its prime aim is to improve the overall Research ecosystem by creating unique identifiers for researchers and scholars that link to other references such as publications, grants & patents. 7 Quatrro Confidential

  8. Society/Association Initiatives 8 Quatrro Confidential

  9. ACM Authorizer 9 Quatrro Confidential

  10. IEEE Explore Author Search  Author profile user interface before the end of the year.  Authors will be asked to QC the data. 10 Quatrro Confidential

  11. AIP Publishing  One of world’s largest physical science publishers  Overview: – 5.5 million potential author names – 6,000 authors with surname “Wang” – 800,000 articles back to early 20th century – Subject areas and keywords  Outcome: – 980,000 academic authors – 33,000 institutions – Database of publishing physicists complete with a record of affiliations, areas of expertise, papers published, co-authors.  Next Step: – Feedback from users and explore additional refinements 11 Quatrro Confidential

  12. Why Create Author Database?  Support for Authors, Researchers – Create individual author profiles and provide new value added services. – Enhance the author experience with your publications (service).  Support for the Specialty / Domain Which the Society Serves – Having an accurate author record of your publications is important – Enhance interconnectivity and networking of a specific publishing community  Also need and want to respond to market needs, trends, expectations  Important, valuable information they want to own, maintain, develop proactively – Complimentary to similar, broader initiatives (ORCID, etc.)  Believe it is a service its members and community want from them.  ACM: “…emphasizing its continuing commitment to the interests of its authors and to the computing community.” 12 Quatrro Confidential

  13. The Bigger Association/Society Picture Member Committee Editor, Member Reviewer Author Meeting Subscriber Attendee Marketer Donor 13 Quatrro Confidential

  14. Practical Considerations 14 Quatrro Confidential

  15. The Grunt Work  Extracting, cleansing and disambiguating the author data is an arduous but essential process – garbage in, garbage out. – Automated tools using an algorithm and scoring mechanism can be used (to discern whether a record for John Smith and J L Smith is likely to be the same person). – Fully automated solutions are prone to problems (data glitches and missing information results in mapping errors). – Expert human intervention is required to achieve a desirable level of quality. - At the front end, to analyze the data and establish the rule set for the automation; - In the processing phase, to ensure data is validated and standardized; - During disambiguation, for “hands on analysis and processing” when necessary.  Find a partner with sophisticated data cleansing and disambiguation capabilities and experience to help with analysis, strategy and execution.  Once completed, profiles including papers authored, affiliations and other info can be created in a very automated fashion, using existing bibliographic metadata from the publisher and in the “public domain”— e.g. CrossRef 15 Quatrro Confidential

  16. Sourcing and Extracting Author Data  Multiple input formats: PDF, TIFF, XML and HTML (OCR needed?)  Inconsistent representation of Author Data in documents  Author Data represented in unstructured format Name Name Affiliation Affiliation 16 Quatrro Confidential

  17. Issues with Names  Same authors with multiple name variants First Name Middle Name Last Name – Journals use different naming styles T Scullion Tom Scullion Thomas Hyun Scullion  Name changes due to marriage e.g. if Adela LANDOVÁ married Jakub ŠTYCHKOV , she may be known as Adela ŠTYCHKOVÁ or Adela LANDOVÁ- ŠTYCHKOVÁ .  International naming conventions – Eastern order - Family-name (surname) Forename (given name) – Western order - Forename (given name) Family-name (surname) – Surname Prefixes – Abdel, Abdul, Abu, Af, Akhu, Al, Ben, De, Della, Des, Du, El, Ibn, La, Le, On, Op – Multiple family names – María-Jose Carreño Quiñones. – Brazilians may have three or four family names. 17 Quatrro Confidential

  18. Issues with Institutional/Affiliation Data  Lack of standardization in affiliation names  University of California at Davis  University of California Davis  University of California at Davis School of Medicine  University of California, Davis  Authors migrating from one affiliation to another First Name Last Name Department Organization E-mail Karadeniz Technical Abdurrahman Sahin abdurrahmansahin@hotmail.com Department of Civil Engineering University Abdurrahman Sahin Department of Earthquake Engineering Bogazici University abdurrahman.sahin@boun.edu.tr  Data represented in multiple languages  Institut für Klinische Pharmakologie und Toxikologie, Charité Campus Benjamin Franklin, Garystr. 5, 14195 Berlin  Institut für Arbeitsphysiologie an der Universität Dortmund  Institut für Theoretische Physik der Universität Heidelberg  Laboratoire d’Elecfrochimie et des Procédés Membranaires 18 Quatrro Confidential

  19. Other Data Related Issues  Accented characters (require conversion into Unicode)  Surname Prefixes (van, von, de,...)  Names of cities and states being the same in different countries  Authors represented by generic emails (Yahoo or Gmail) without unique organization IDs  Email not as per the standard formats 19 Quatrro Confidential

  20. Modular Approach to Data Preparation Disambiguation Data Preparation and Enhancement and Visualization • Source input documents • Identify author data Disambiguation by email Data Parsing • and affiliation mapping SME verification of identified author data with input document Disambiguation by co-author analysis • Error identification using global validation Data Validation checks across author names and affiliation Manual validation of email data ID if required Creation of unique author profiles • Standardization of author names and affiliation Data data using predefined rules and knowledge Standardization repositories Author Data clustering and Visualization 20 Quatrro Confidential

  21. Parsing Module  Author records need to be split into their constituent data fields -- surname, first name, email, division, organization, city, state, country, etc. Source Document Parsed output data The data parsing module extracts author data from input documents, parses the data and populates the relevant fields in a predefined template. 21 21 Quatrro Confidential

  22. Validation Module  Parsed data needs to be validated for accuracy – automation based on pre- defined rules, built-in databases and other knowledge repositories can help, but manual intervention is typically required to achieve a desirable level of accuracy. Output validation using pre defined rules The data validation module will identify the errors with respect to formatting and parsing for human validation and rectification of errors. 22 Quatrro Confidential

  23. Standardization Module  This process isolates incorrect field names after comparing them with standard names in pre-built databases. It enables running partial or complete standardization rules, and manual validation for errors that cannot be corrected automatically. The self-learning standardization module has built-in thesauri which are continuously updated based on automatic and manual corrections. 23 Quatrro Confidential

  24. Disambiguation Process 24 Quatrro Confidential

Recommend


More recommend