Evaluating utility of subject headings in a data repository: A - PowerPoint PPT Presentation

Evaluating utility of subject headings in a data repository: A preliminary finding from a data search log and record classification Presented by: Mingfang Wu, Australian Research Data Commons mingfang.wu@ardc.edu.au Contributors: Rowan Brownlee, Australian Research Data Commons Ying-Hsang Liu, University of Southern Denmark Jenny Xiuzhen Zhang, RMIT University, Australia NKOS, 10 Sept. 2020

Outlines - A background about the studied data catalogue: Research Data Australia - Log analysis: the usage of subject headings - Experiments on data record classification - Future work 2

Research Data Australia - A National Data Catalogue 144K+ metadata records of dataset Schema: The Registry Interchange Format - Collections and Services (RIF-CS, ISO 2146:2010) 60K+ research grants 99 Contributors 3

Types of subject vocabularies Anzsrc-for: The Australian and New Zealand Standard Research Classification (ANZSRC, fields of research) Global change master directory (GCMD) keywords A u s t r a l i a n P i c t o r i a l T h e s a u r u s ( a p t ) T h e s a u r u s o f P s y c h o l o g i c a l I n d e x T e r m s ( p s y c h i t ) Library of Congress Subject Headings (lcsh) 4

Anzsrc-for: The Australian and New Zealand Standard Research Classification - Fields of Research - ANZSRC ensures that R&D statistics collected are useful to governments, educational institutions, international organisations, scientific, professional or business organisations, business enterprises, community groups and private individuals in Australia and New Zealand. - ANZSRC-FoR include major fields and related sub-fields of research and emerging areas of study investigated by businesses, universities, tertiary institutions, national research institutions and other organisations. 5

Anzsrc-for: The Australian and New Zealand Standard Research Classification - Fields of Research) 1417 terms in three layers 22 two digits 157 four digits 1238 six digits 6

Number of records per anzsrc-for two digits 04: Earth Sciences 06: Biological Sciences 21: History and Archaeology 7

Search interface All text strings (including subject headings) are indexed. 8

1. Advanced search Subject headings 2. Facet filter 9

Record view 3. Facet search (vocabulary + keyword) 10

Log analysis: the usage of subject headings - Transaction log: one year (2019) of activities recorded from the RDA catalogue About 2 million entries/activities, 63% from Australia - About 496,739 sessions (with 30 minutes duration from the same IP address) - 37,056 sessions have at least a search event (keyword search, advanced - search, subject (factet) filter, subject search 4668 (12.6%) of search sessions involved filters/search with the anzsrc-for - subjects, only 45 (0.1%) with gcmd subject 11

Subject usages per anzsrc-for two digits code 12

Subject distribution among clicks and the collection 13

Log analysis: the usage of subject headings - There is less bias in user’s behaviour of applying subject headings, compared to the content bias toward a few subject headings. - However, this log shows low usage of subject headings - Exploring causes - Further log analysis, e.g. correlation between subject usage and - query types - domain knowledge - search quality - Interface design - At the record level: only half of the indexed records have anzsrc-for codes 14

Machine learning for record classification - Assign anzsrc-for code to unlabelled records automatically - Aim to improve search experience for both human and machine - Understand domain coverage of the collection - Train models, three components are essential for the training: - Labels - anzsrc-for code - Classifier - four supervised machine learning methods: - multinomial logistic regression (MLR), multinomial naive bayes (MNB), K Nearest Neighbors (KNN), Support Vector Machine (SVM) - Data - (~78k) records with anzsrc-for code - Split into two sets: training set, test set - Apply model(s)/best prediction to unlabelled records 15

Record classification with anzsrc-for code - Use 77918 records that have an anzsrc-for code for training models - Step by step: first the top two digits, then move down to four, six digits - Four models: multinomial logistic regression (MLR), multinomial naive bayes (MNB), K Nearest Neighbors (KNN), Support Vector Machine (SVM) Acknowledgement: 16 Adapted the code from Miguel Frenandez Zafra

Performance per category Most correlated unigrams: 04: Earth Science 15: Commerce, Management, Tourism and Services 17

Examples of classification within two-digits code Method: MLR 06: 17268 records (out of 41505) have both 06: Biological Sciences (41505 records) 0601 and 0604 labels 02: Physical Sciences (3533 records) 18

Discussion and future work - User behaviour: - Evidence that subject headings are used - Why and why not - Low usage of subject headings from this log collection - Is this unique to this data catalogue and interface? Log analysis + survey and interview - Collection characteristics: - Large proportion of records from the catalogue without a “standard” vocabulary for the subject headings a known issue - Those with subject headings are biased toward a few categories - Encourage underrepresented subject areas to publish and share data - Record classification works for some categories - Explore correlation, improvement 19

Thanks! 20

Evaluating utility of subject headings in a data repository: A - PowerPoint PPT Presentation

Evaluating utility of subject headings in a data repository: A preliminary finding from a data search log and record classification Presented by: Mingfang Wu, Australian Research Data Commons mingfang.wu@ardc.edu.au Contributors: Rowan

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

Using FAST (Faceted Application of Subject Headings) in CONTENTdm Eric Childress Terry

Distributed Person Data: Using Semantic Web Compliant Data in Subject Name Headings

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

Engaging an Author in a Critical Reading of Subject Headings Amelia Koford Texas Lutheran

Utility Flood SOLUTIONS November 9, 2017 UTILITY LIGHTING PRODUCTS 1 1 HO HOWARD WARD

Subject Analysis Part I. Introduction to Subject Analysis 1. What? Workshop 2. Where? presented

Lecture 11: Critiques of Expected Utility Alexander Wolitzky MIT 14.121 1 Expected Utility and Its

Subject line contest Subject line contest Examples of subject lines Meet The Guy We Gave $2000

Sec 2A Subject Options 2020 Information Slides Subject Options Exercise and Subject Choices for

DAS-ITE UTILITY SERVICES DAS Custo me r Co unc il F Y 13 AND F Y 14 Utility Se rvic e Upda te

Storm Water Utility: Creation Update AWWU Presentation to the Assembly Enterprise and Utility

3D Utility Survey and Modeling Resolving the Utility Conundrum 3 D Design and Modeling for Highway

SUE Process SUE Process SUBSURFACE UTILITY ENGINEERING SUBSURFACE UTILITY ENGINEERING Presented

Utility Allowanc e December 12, 2013 1 Utility Allowanc e Initiative Effective April 1 st

Or Oro Va Valley Wa Water Utility ility FY 2019 20 Recommended Budget April 24, 2019 Or Oro

Generalized Linear Models (GLMs/GLIMs) STAT 757 Tuesday, April 19, 2016 Model Framework The GLM

Weihrauch-completeness for layerwise computability 1 Arno Pauly Clare College University of

STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin College February 28, 2018 1 /

Section 3.1: Multiple Linear Regression Jared S. Murray The University of Texas at Austin

Probabilistic Computability and Randomness in the Weihrauch Lattice Vasco Brattka Universit

Best NET-CC: Extension of BCC search to GW interferometer networks Archana Pai Albert Einstein

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov

On Models and Modelling a presentation to the W3C Workshop on Future Standards for Model-Based

Sambuz

Useful Links

Newsletter

Mail Us

Evaluating utility of subject headings in a data repository: A - PowerPoint PPT Presentation

Evaluating utility of subject headings in a data repository: A preliminary finding from a data search log and record classification Presented by: Mingfang Wu, Australian Research Data Commons mingfang.wu@ardc.edu.au Contributors: Rowan

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

Using FAST (Faceted Application of Subject Headings) in CONTENTdm Eric Childress Terry

Distributed Person Data: Using Semantic Web Compliant Data in Subject Name Headings

Guide to Make Google Docs &amp; Google Slides ADA Compliant Google Docs Headings Google

Engaging an Author in a Critical Reading of Subject Headings Amelia Koford Texas Lutheran

Utility Flood SOLUTIONS November 9, 2017 UTILITY LIGHTING PRODUCTS 1 1 HO HOWARD WARD

Subject Analysis Part I. Introduction to Subject Analysis 1. What? Workshop 2. Where? presented

Lecture 11: Critiques of Expected Utility Alexander Wolitzky MIT 14.121 1 Expected Utility and Its

Subject line contest Subject line contest Examples of subject lines Meet The Guy We Gave $2000

Sec 2A Subject Options 2020 Information Slides Subject Options Exercise and Subject Choices for

DAS-ITE UTILITY SERVICES DAS Custo me r Co unc il F Y 13 AND F Y 14 Utility Se rvic e Upda te

Storm Water Utility: Creation Update AWWU Presentation to the Assembly Enterprise and Utility

3D Utility Survey and Modeling Resolving the Utility Conundrum 3 D Design and Modeling for Highway

SUE Process SUE Process SUBSURFACE UTILITY ENGINEERING SUBSURFACE UTILITY ENGINEERING Presented

Utility Allowanc e December 12, 2013 1 Utility Allowanc e Initiative Effective April 1 st

Or Oro Va Valley Wa Water Utility ility FY 2019 20 Recommended Budget April 24, 2019 Or Oro

Generalized Linear Models (GLMs/GLIMs) STAT 757 Tuesday, April 19, 2016 Model Framework The GLM

Weihrauch-completeness for layerwise computability 1 Arno Pauly Clare College University of

STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin College February 28, 2018 1 /

Section 3.1: Multiple Linear Regression Jared S. Murray The University of Texas at Austin

Probabilistic Computability and Randomness in the Weihrauch Lattice Vasco Brattka Universit

Best NET-CC: Extension of BCC search to GW interferometer networks Archana Pai Albert Einstein

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov

On Models and Modelling a presentation to the W3C Workshop on Future Standards for Model-Based

Sambuz

Useful Links

Newsletter

Mail Us

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google