Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019 CLASSIEFIER: USING MACHINE LEARNING TO PAINT A PICTURE OF SOCIAL TRENDS
A foreigner Who am I? From Honduras to the US to Australia From Galaxies to Taxonomies • Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019 Our Community - Innovation Lab 2
Outline: Introducing Our community’s data initiatives • Background: CLASSIE a social dictionary • How did we scope CLASSIEfier ? • How did CLASSIEfier evolve as a project? • • Data science for social good concept Results and conclusions • Our Community - Innovation Lab 3
Is a social enterprise and B Corp that provides advice, connections, training and easy-to-use tech tools for community-builders. Training and networking Grants database Donation Platform Software for grants applications Our Community - Innovation Lab 4
Our Community - Innovation Lab 5
From CLASSIE to CLASSIEfier
Main objective – Classification of grants In 2016, OC introduced Australia lacked a CLASSIE unified taxonomy to CLASSIE opens the door The classification classify subjects, to standard system for Australian beneficiaries and classification social sector initiatives organization types and entities Our Community - Innovation Lab 7
• Subjects CLASSIE Populations • A social sector dictionary • Organisation type Where is the money going? and How is the Australian social sector working? Our Community - Innovation Lab 8
Hierarchical Classification – e.g. Subjects Level 1 Sport and Social Sciences recreation 17 categories Interdisciplinary Community Level 2 Anthropology Sport studies recreation 132 categories Biological Archeology Camps Ethnic studies Asian studies Parks Outdoor sport Paralympics Level 3 anthropology 492 categories Indigenous Mountain and Hiking and Level 4 studies rock climbing walking 243 categories Our Community - Innovation Lab 9
Now we have the dictionary – How do we apply it? How do we ensure that users are • Questions choosing the correct category ? How do we classify historical data ? • 800,000 grant applications since 2010 Our Community - Innovation Lab 10
CLASSIEfier is a tool that will automatically classify grants Our Community - Innovation Lab 11
How did we scope CLASSIEfier?
Source: “One model to rule them all” by Christoph Molnar
CLASSIEfier – Two different models To give automatic suggestions to grant applicants 1. To classify historical data 2. Seems like you are applying for: q Sports and recreation q Art and culture q Community and development Our Community - Innovation Lab 14
CLASSIEfier: How does it work? 15 Our Community - Innovation Lab
How did CLASSIEfier evolve?
CLASSIEfier – The Algorithm What do we have? 800,000 4,000 grant applications grant applications labeled by users since CLASSIE went live How do we generate more labels? At least 2000 applications per category Our Community - Innovation Lab 17
CLASSIEfier – The Algorithm Keyword matching = the process of searching for ‘Literal’ First phase: matches (e.g. “hospital”) in a given piece of text (e.g. a grant description) to identify groups or subjects (e.g. health sector). a simple keyword matching to Example: extract more labels This project will raise awareness and empower deaf deaf people by providing key mental health information in their primary language (Australian Sign Language Sign Language ). People with hearing impediment People with hearing impediment . Stages: For example: • Identify keywords for CLASSIE “orphans” is a confusing category. • Extract applications that exhibit a strong match “wildlife welfare” is a straight forward • Score the classification done by Users category We found that: • Keyword matching accuracy differs from one category to another. • On average is around 80% Our Community - Innovation Lab 18
CLASSIEfier – The Algorithm Training dataset: Second phase: 128,000 Training the Machine Learning model grant applications Classified by keyword matching DIFFICULTY #1: Multilabel DIFFICULTY #2: Hierarchy DIFFICULTY #3: Number of labels per category Our Community - Innovation Lab 19
Example: A grant application that is aimed at helping teenagers teenagers with autism autism . Multilabels and Hierarchy Beneficiaries: • “Children and youth” at level 1 • “Adolescents” at level 2 And also, • “People with disabilities” at level 1 • “People with intellectual disabilities” at level 2 Our Community - Innovation Lab 20
DIFFICULTY #3: Number of labels per category Categories such as Confucius, North American people , Nomadic • people among others will have less than 100 grant applications. 20X less Than the 2000 minimum required Niche classification or “ black holes ” Our Community - Innovation Lab 21
How do we solve it? – Separate training Reads the application Classification Level 1 – Machine learning Information and Sports and recreation communications Classification Level 2: Classification Level 2: We have enough we do not have enough labels we use another labels we use ML model keyword matching Classification Level 3: Classification Level 3: Keyword matching Keyword matching Our Community - Innovation Lab 22
CLASSIEfier – The Algorithm Third phase: Model interpretation: scoring and checking for biases Stages: Choose the best model – k- nearest neighbours (k-nn) • Choose the best parameters • Choose the best scoring • Our Community - Innovation Lab 23
Scoring Recall: !" !"#$% &'(&)*+&,' ,- .&//&'0 1*213/ Precision: !" !"#$" &'(&)*+&,' ,- 2*( 453(&)+&,'/ Our Community - Innovation Lab 24
Scoring Based on the fact that each application has several categories Recall: How many categories got picked per application 0 None 1 <45% 2 >45% 3 Perfect match Precision: How many categories are wrong per application 0 All 1 >55% 2 <55% 3 None – Perfect match CLASSIEfier ~4-5 0 6 Useless Model Perfect Model!! Our Community - Innovation Lab 25
Misclassifications and black holes will cause to underfund minorities that are already overlooked Our Community - Innovation Lab 26
“ The best minds of my generation are thinking about how to make people click ads ,” he says. “That sucks.” -- Jeff Hammerbacher (Cloudera and Facebook data leader) The Data Science for Social Good Movement
Algorithmic bias • This will happen if you feed in the algorithm with data that is already biased or with insufficient data - The algorithm will predict biased classifications. Algorithms are mirrors • Sport people Our Community - Innovation Lab 28
Know your Model! xkdc.com/1838/ Our Community - Innovation Lab 29
SHAP (SHapley Additive exPlanations) AI Fairness 360 WEAT tests proposed in Caliskan et al. 2017 Our Community - Innovation Lab 30
Document everything! – this is how we tackle biases Choose transparency Our Community - Innovation Lab 31
Results and conclusions Church Religion Model = Religion Christian Reality – A fete in a Catholic school It is not feasible to classify human natural languages with 100% accuracy Our Community - Innovation Lab 32
Results and conclusions Out 200 applications classified by Users we found that: Church Religion Christian 63% 18% 19% right wrong Half right CLASSIEfier works similar to humans , not better not worse. ~ 70-80% accuracy • Our Community - Innovation Lab 33
Results and conclusions Church Religion Christian Approved Declined Grant applications Grant applications 85% accuracy 75% accuracy The model is also discriminating between good and bad applications • Our Community - Innovation Lab 34
Results and conclusions Church Seems like you are applying Religion for: Christian q Sports and recreation q Art and culture q Community and development CLASSIEfier is now feeding back into CLASSIE Our Community - Innovation Lab 35
CLASSIEfier – More than just an algorithm Writing and testing the Production – back and front Data preprocessing algorithm end product Maintenance Our Community - Innovation Lab 36
Linkedin: paola-oliva-altamirano Email: paolao@ourcommunity.com.au Innovation lab: https://www.ourcommunity.com.au/innovationlab DO YOU WANT TO LEARN MORE?
Recommend
More recommend