An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content ¸iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University College London 29 July 2015

Motivation User attribute prediction from text is successful: ◮ Age (Rao et al. 2010 ACL) ◮ Gender (Burger et al. 2011 EMNLP) ◮ Location (Eisenstein et al. 2011 EMNLP) ◮ Personality (Schwartz et al. 2013 PLoS One) ◮ Impact (Lampos et al. 2014 EACL) ◮ Political orientation (Volkova et al. 2014 ACL) ◮ Mental illness (Coppersmith et al. 2014 ACL) Downstream applications are benefiting from this: ◮ Sentiment analysis (Volkova et al. 2013 EMNLP) ◮ Text classification (Hovy 2015 ACL)

However... Socio-economic factors (occupation, social class, education, income) play a vital role in language use (Bernstein 1960, Labov 1972 / 2006) No large scale user level dataset to date Applications: ◮ sociological analysis of language use ◮ embedding to downstream tasks (e.g. controlling for socio-economic status)

At a Glance Our contributions: ◮ Predicting new user attribute: occupation ◮ New dataset: user ←→ occupation ◮ Gaussian Process classification for NLP tasks ◮ Feature ranking and analysis using non-linear methods

Standard Occupational Classification Standardised job classification taxonomy Developed and used by the UK O ffi ce for National Statistics (ONS) Hierarchical: ◮ 1-digit (major) groups: 9 ◮ 2-digit (sub-major) groups: 25 ◮ 3-digit (minor) groups: 90 ◮ 4-digit (unit) groups: 369 Jobs grouped by skill requirements

Standard Occupational Classification C1 Managers, Directors and Senior O ffi cials ◮ 11 Corporate Managers and Directors ◮ 111 Chief Executives and Senior O ffi cials ◮ 1115 Chief Executives and Senior O ffi cials Job: chief executive, bank manager ◮ 1116 Elected O ffi cers and Representatives ◮ 112 Production Managers and Directors ◮ 113 Functional Managers and Directors ◮ 115 Financial Institution Managers and Directors ◮ 116 Managers and Directors in Transport and Logistics ◮ 117 Senior O ffi cers in Protective Services ◮ 118 Health and Social Services Managers and Directors ◮ 119 Managers and Directors in Retail and Wholesale ◮ 12 Other Managers and Proprietors

Standard Occupational Classification C2 Professional Occupations Job: mechanical engineer, pediatrist, postdoctoral researcher C3 Associate Professional and Technical Occupations Job: system administrator, dispensing optician C4 Administrative and Secretarial Occupations Job: legal clerk, company secretary C5 Skilled Trades Occupations Job: electrical fitter, tailor C6 Caring, Leisure, Other Service Occupations Job: school assistant, hairdresser C7 Sales and Customer Service Occupations Job: sales assistant, telephonist C8 Process, Plant and Machine Operatives Job: factory worker, van driver C9 Elementary Occupations Job: shelf stacker, bartender

Data 5,191 users ←→ 3-digit job group Users collected by self-disclosure of job title in profile Manually filtered by the authors 10M tweets, average 94.4 users per 3-digit group

Data Here we classify only at the 1-digit top level group (9 classes) Feature representation and labels available online Raw data available for research purposes on request (per Twitter TOS)

Features User Level features ( 18 ), such as: ◮ number of: ◮ followers ◮ friends ◮ listings ◮ tweets ◮ proportion of: ◮ retweets ◮ hashtags ◮ @-replies ◮ links ◮ average: ◮ tweets / day ◮ retweets / tweet

Features Focus on interpretable features for analysis Compute over reference corpus of 400M tweets: ◮ SVD embeddings and clusters ◮ Word2Vec (W2V) embeddings and clusters

SVD Features Compute word × word similarity matrix Similarity metric is Normalized PMI (Bouma 2009) using the entire tweet as context SVD with di ff erent number of dimensions (30, 50, 100, 200) User is represented by summing its word representations The low-dimensional features o ff er no interpretability

SVD Features Spectral clustering to get hard clusters of words (30, 50, 100, 200 clusters) Each cluster consists of distributionally similar words ←→ topic User is represented by the number of times he uses a word from each cluster.

Word2Vec Features Trained Word2Vec (layer size 50) on our Twitter reference corpus Spectral clustering on the word × word similiarity matrix (30, 50, 100, 200 clusters) Similarity is cosine similarity of words in the embedding space

Gaussian Processes Brings together several key ideas in one framework: ◮ Bayesian ◮ kernelised ◮ non-parametric ◮ non-linear ◮ modelling uncertainty Elegant and powerful framework, with growing popularity in machine learning and application domains

Gaussian Process Graphical Model View f ∼ GP ( m , k ) k y ∼ N ( f ( x ) , σ 2 ) ◮ f : R D − > R is a latent f σ function ◮ y is a noisy realisation of f ( x ) N ◮ k is the covariance x y function or kernel ◮ m and σ 2 are learnt from data

Gaussian Process Classification Pass latent function through logistic function to squash the input from ( −∞ , ∞ ) to obtain probability, π ( x ) = p ( y i = 1 | f i ) (similar to logistic regression) The likelihood is non-Gaussian and solution is not analytical Inference using Expectation propagation (EP) FITC approximation for large data

Gaussian Process Classification ARD kernel learns feature importance → features most discriminative between classes We learn 9 one-vs-all binary classifiers This way, we find the most predictive features consistent for all classes

Gaussian Process Resources Free book: http://www.gaussianprocess.org/gpml/chapters/

Gaussian Process Resources ◮ GPs for Natural Language Processing tutorial (ACL 2014) http://www.preotiuc.ro ◮ GP Schools in She ffi eld and roadshows in Kampala, Pereira, Nyeri, Melbourne http://ml.dcs.shef.ac.uk/gpss/ ◮ Annotated bibliography and other materials http://www.gaussianprocess.org ◮ GPy Toolkit (Python) https://github.com/SheffieldML/GPy

Prediction 55 50 45 40 34.2 34 35 31.5 30 25 User Level LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 50 43.8 45 43.1 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 50 48.2 47.9 44.2 43.8 45 43.1 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 49 48.4 50 48.2 47.9 44.2 43.8 45 43.1 42.5 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) W2V-E (50) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 52.7 51.7 49 48.4 50 48.2 47.9 46.9 44.2 43.8 45 43.1 42.5 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) W2V-E (50) W2V-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction Analysis User level features have no predictive value Clusters outperform embeddings Word2Vec features are better than SVD / NPMI for prediction Non-linear methods (SVM-RBF and GP) significantly outperform linear methods 52.7% accuracy for 9-class classification is decent

Class Comparison Jensen-Shannon Divergence between topic distributions across occupational classes Some clusters of occupations are observable 1 0.03 2 3 0.02 4 5 6 0.01 7 8 9 0.00 1 2 3 4 5 6 7 8 9

Feature Analysis Rank Manual Label Topic (most frequent words) 1 Arts art, design, print, collection, poster, painting, custom, logo, printing, drawing 2 Health risk, cancer, mental, stress, patients, treatment, surgery, disease, drugs, doctor 3 Beauty Care beauty, natural, dry, skin, mas- sage, plastic, spray, facial, treat- ments, soap 4 Higher Education students, research, board, stu- dent, college, education, library, schools, teaching, teachers 5 Software Engineering service, data, system, services, access, security, development, software, testing, standard Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking

Feature Analysis Rank Manual Label Topic (most frequent words) 7 Football van, foster, cole, winger, terry, reckons, youngster, rooney, fielding, kenny 8 Corporate patent, industry, reports, global, survey, leading, firm, 2015, in- novation, financial 9 Cooking recipe, meat, salad, egg, soup, sauce, beef, served, pork, rice 12 Elongated Words wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo 16 Politics human, culture, justice, religion, democracy, religious, humanity, tradition, ancient, racism Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking

Feature Analysis - Cumulative density functions Higher Education (#21) 1 C1 0.8 User probability C2 C3 0.6 C4 C5 0.4 C6 C7 0.2 C8 C9 0 0.001 0.01 0.05 Topic proportion Topic more prevalent → CDF line closer to bottom-right corner

Feature Analysis - Cumulative density functions Arts (#116) 1 C1 0.8 User probability C2 C3 0.6 C4 C5 0.4 C6 C7 0.2 C8 C9 0 0.001 0.01 0.05 Topic proportion Topic more prevalent → CDF line closer to bottom-right corner

An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos

Disclosures: None Topics to be covered: Distinction between general & special senses

UX/UI What is UX and UI? UX Process User Research User Research Creating User

Instance Variables The JOptionPane Class The JOptionPane Class displays a dialog for user

Physical Demands Analysis (PDA) Presenters: Brian Podruzny, Occupational Therapist, WCB-Alberta

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

STATISTICS OF OCCUPATIONAL INJURIES Resolution concerning statistics of occupational injuries

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

UI Software Organization The user interface l From previous class: Generally want to think of

Doctor of Occupational Therapy Class of 2020 Doctoral Capstone Projects Best Graduate Programs

Presented By: Ms Duduzile Mahlaba Occupational Medicine Directorate OUTLINE 1. Occupational

An unequal world: class, leisure and health outcomes in older people Pauline McGovern * &

OCCUPATIONAL TRANSITIONS IN THE METROPOLITAN LABOR MARKET - AN ANALYSIS BY COLOR AND GENDER IN THE

Past, Present and Future Class #2: Intro to Video Game User Interfaces Content based on

USABILITY INTRODUCTION USABILITY AND USER INTERFACE DESIGN TO BE PREPARED FOR THIS CLASS YOU

Australian Employment Projections Carmel ORegan Director Occupational and Industry Analysis

MANDATORY CODE OF PRACTICE FOR AN OCCUPATIONAL HEALTH PROGRAMME (OCCUPATIONAL HYGIENE AND

GSA OLU End-User Training GSA OLU End-User Training Training Objectives How to navigate the

Occupational and Environmental Respiratory Disease and Updates in Occupational and Environmental

Epistemic Network Analysis Todays Class Epistemic Network Analysis Epistemic Network

TargetVue Analysis of Online Anomalous User Hung-li Chen (Henry) 1 Target Vue: Visual

Security While protection has been discussed throughout the class kernel vs. user mode,

Occupational Therapy in HVI Occupational Therapys Role in the rehabilitation of patients with

Occupational Environmental Health Workshop Brantford John Oudyk MSc CIH ROH Occupational

An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos

Disclosures: None Topics to be covered: Distinction between general &amp; special senses

UX/UI What is UX and UI? UX Process User Research User Research Creating User

Instance Variables The JOptionPane Class The JOptionPane Class displays a dialog for user

Physical Demands Analysis (PDA) Presenters: Brian Podruzny, Occupational Therapist, WCB-Alberta

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

STATISTICS OF OCCUPATIONAL INJURIES Resolution concerning statistics of occupational injuries

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

UI Software Organization The user interface l From previous class: Generally want to think of

Doctor of Occupational Therapy Class of 2020 Doctoral Capstone Projects Best Graduate Programs

Presented By: Ms Duduzile Mahlaba Occupational Medicine Directorate OUTLINE 1. Occupational

An unequal world: class, leisure and health outcomes in older people Pauline McGovern * &amp;

OCCUPATIONAL TRANSITIONS IN THE METROPOLITAN LABOR MARKET - AN ANALYSIS BY COLOR AND GENDER IN THE

Past, Present and Future Class #2: Intro to Video Game User Interfaces Content based on

USABILITY INTRODUCTION USABILITY AND USER INTERFACE DESIGN TO BE PREPARED FOR THIS CLASS YOU

Australian Employment Projections Carmel ORegan Director Occupational and Industry Analysis

MANDATORY CODE OF PRACTICE FOR AN OCCUPATIONAL HEALTH PROGRAMME (OCCUPATIONAL HYGIENE AND

GSA OLU End-User Training GSA OLU End-User Training Training Objectives How to navigate the

Occupational and Environmental Respiratory Disease and Updates in Occupational and Environmental

Epistemic Network Analysis Todays Class Epistemic Network Analysis Epistemic Network

TargetVue Analysis of Online Anomalous User Hung-li Chen (Henry) 1 Target Vue: Visual

Security While protection has been discussed throughout the class kernel vs. user mode,

Occupational Therapy in HVI Occupational Therapys Role in the rehabilitation of patients with

Occupational Environmental Health Workshop Brantford John Oudyk MSc CIH ROH Occupational

Disclosures: None Topics to be covered: Distinction between general & special senses

An unequal world: class, leisure and health outcomes in older people Pauline McGovern * &