How have Data Science Skills Evolved? A case study using embeddings Maryam Jahanshahi Ph.D. Research Scientist TapRecruit.co http://bit.ly/dataengconf2018
TapRecruit uses NLP to understand career content Converting unstructured documents into structured data Smart Editor for JDs Pipeline Health Monitoring Salary Estimation Data-driven suggestions on Analytics dashboards to help Data-driven salary estimates both the content and language diagnose quality and diversity based on a job’s requirements use in job descriptions. issues in talent pipelines. rather than just title and location.
Language matters in job descriptions Same title, Same Title Finance Manager Finance Manager Different job Kraft Foods Roche Required Experience Senior (6-8 Years) Junior (3 Years) Required Responsibility No Managerial Experience Division Level Controller Preferred Skill Strategic Finance Role Required Education MBA / CPA
Language matters in job descriptions Same title, Same Title Finance Manager Finance Manager Different job Kraft Foods Roche Required Experience Senior (6-8 Years) Junior (3 Years) Required Responsibility No Managerial Experience Division Level Controller Preferred Skill Strategic Finance Role Required Education MBA / CPA Different title, Performance Senior Analyst, Same job Marketing Manager Customer Strategy PocketGems The Gap Required Experience Mid-Level Mid-Level Required Skills Quantitative Focus Quantitative Focus Required Experience iBanking Expertise Finance Expertise Required Skills Data Analysis Tools (SQL) Relational Database Experience Preferred Experience Consulting Experience Preferred External Consulting Experience Preferred Preferred Education MBA Preferred BA in Accounting, Finance, MBA Preferred
How have data science skills changed over time?
Strategies to identify changes within datasets MBA SQL PhD Tableau Python PowerBI Manual Feature Extraction: Require a priori selection of key attributes, therefore difficult to discover new attributes
Strategies to identify changes within datasets 1880 1920 1960 2000 MBA SQL force atom radiat state energy theory energy energy motion electron electron electron PhD Tableau differ energy measure magnet light measure ray field Python PowerBI Matter Quantum Electron Manual Feature Extraction: Dynamic Topic Models: Require a priori selection of key Uses a bag of words approach, attributes, therefore difficult to and require experimentation with discover new attributes topic number. Adapted from Blei and Lafferty, ICML 2006.
Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context
Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context Python
Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context Python Object- Programming orientated Language Java C++
Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Esperanto Context Word Context French German Python Object- Programming orientated Language Java C++ Japanese
Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Exxon Tillerson McMillon Wal-Mart Dauman McAdam Colao Viacom Verizon Vodafone Hierarchies Adapted from Stanford NLP GLoVE Project
Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Slowest Slower Exxon Tillerson Shortest Slow McMillon Wal-Mart Shorter Dauman McAdam Short Colao Viacom Stronger Verizon Vodafone Strongest Strong Hierarchies Comparatives and Superlatives Adapted from Stanford NLP GLoVE Project
Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Slowest Slower Exxon Tillerson Man Shortest Slow McMillon Wal-Mart Shorter Dauman McAdam Short King Woman Colao Viacom Queen Stronger Verizon Vodafone Strongest Strong Hierarchies Comparatives and Superlatives Woman :: Queen as Man :: ? Adapted from Stanford NLP GLoVE Project
Pretrained embeddings facilitate fast prototyping Corpus Generation Corpus Processing Language Model Generation Language Model Tuning Final Application
Pretrained embeddings facilitate fast prototyping Corpus Twitter Common Crawl GoogleNews Wikipedia Corpus Generation Tokens 27 B 42-840 B 100 B 6 B Corpus Processing Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Language Model Generation Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Language Model Tuning Final Application
Problems with pretrained embedding models Abbreviations vs Words Casing e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Words with multiple meanings Polysemy e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Phrases that have new meanings Multi-word Expressions e.g. Front-end vs front + end
Tools for developing custom language models Modularized for different data and modeling requirements SyntaxNet CoreNLP Corpus Processing Language Modeling Tokenization, POS tagging, Sentence Different word embedding models Segmentation, Dependency Parsing (GLoVE, word2vec, fastText)
Hyperparameter tuning on final model outputs Window sizes capture semantic similarity vs semantic relatedness Esperanto French German Python Object- Programming orientated Language Java C++ Japanese Small Window Size Capture Semantic similarity, Substitutes and Word-level differences
Hyperparameter tuning on final model outputs Window sizes capture semantic similarity vs semantic relatedness Esperanto Esperanto Statistical French French modeling SPSS German German Python Software Object- Python Programming Programming orientated Japanese C++ Language Java Java C++ Language Object-orientated Japanese Small Window Size Large Window Size Capture Semantic similarity, Capture Semantic relatedness, Substitutes and Word-level differences Alternatives and Domain-level differences
Career language embedding model Identified equal opportunity and perks language
Career language embedding model Identified equal opportunity and perks language
Career language embedding model Identified equal opportunity and perks language
Career language embedding model Identified 'soft' skills and language around experience
Career language embedding model Identified 'soft' skills and language around experience
I’ve got 300 dimensions… but time ain’t one
Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 2015 Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052
Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 Data hungry: Sufficient data for each 2015 time slice for a quality embedding. Requires alignment : Each time slice is trained independently, therefore dimensions are not comparable across slices. Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052
Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 Data hungry: Sufficient data for each Data efficient: Treats each time slice as 2015 time slice for a quality embedding. a sequential latent variable, enabling time slices with sparse data. Requires alignment : Each time slice Does not require alignment: Treating is trained independently, therefore dimensions are not comparable across time slice as a variable ensures slices. embeddings are connected across slices. Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052
Recommend
More recommend