DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS ASHUTOSH BAHETI, 12CS10012 RAHUL GURNANI, 12CS10039 DHRUV JAIN, 12CS30043 NISHKARSH SHASTRI, 12CS10034 SABYASACHEE BARAUH, 12CS30029
OBJECTIVE 2 ● Identifying Personality of Quora users with respect to the big five personality traits using linguistic features based analysis of their answer ● Openness T o Experience ● Conscientiousness ● Extraversion ● Agreeableness ● Neuroticism
RELATED WORK 3 Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. T ausczik and James W. Pennebaker Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods." Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text." Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski
4 Project Timeline
5 Classifying essay data based on LIWC as feature Identifying the linguistic features for the Big V personality traits Extraction of textual features from the essays Classifying based on new features and LIWC Survey with the Quora users to get a labelled dataset Crawling the answers of Surveyed users Using the Quora Dump to expand LIWC Trained the model based on labelled Quora Dataset Calculated the accuracy of the trained model
Classification of Essay Data 6 Straightforward ML approach labelled essays with binary values for each personality sanitized the data present in the essays Created the trie structure for LIWC prefix matching Extracted the features based on LIWC word count for each category Applied SVM to the data using WEKA Accuracy of model found to be 53%
Features Identified for Extroversion 7 Word Variance (repetitivity) Type/Token Ratio Formality measure and Informality Measure F-Measure = (noun freq + adjective freq + preposition freq + article freq - pronoun freq - verb freq - adverb freq - interjection freq + 100)/2 I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon freq. ) * 100 Positivity of Text and Negativity Of Text Rich Vocabulary, use of difficult words Concrete and Frequent Words Use of more social words
Features Identified for Openness 8 Preference for longer words Words expressing tentativeness Avoidance of 1st person singular pronouns Present tense forms The avoidance of past tense indicates
Features Identified for Conscientiousness 9 Avoid negations Avoid words reflecting discrepancies (e.g., should and would) 2nd person pronouns Filler words (in males and not in females): More useful in speech analysis
Features Identified for Agreeableness 10 More positive emotions few negative emotions Few articles Negative and Positive emotion words Leisure activity
Features Identified for Neuroticism 11 1st person singular pronouns Noun Negative Multiple punctuations Fewer references to occupation
Extraction of Features 12 Python scripts using nltk to extract the features mentioned in previous five slides Speech based features were not extracted
NLP Techniques based features: 13 Discourse Parsing Used the discourse parsing on all the essays data. Created RST style discourse trees. Extracted main nucleus text from the data Extracted the relation count from the RST trees Normalized the relation count. Constructed the feature vector to include the discourse relation count
Expansion of LIWC Word Set 14 Seeded LDA and Word2Vec Methods
Expansion of LIWC: Seeded LDA 15 Seeded LDA treats each document as a mixture of topics. It treats topics as a probability distribution of words. We can give a prior asymetric probability to a word topic pair to seed the topic with the given word. We have used the gensim package and the eta parameter to implement seeded LDA, however it did not give better results due to overfitting.
Expansion of LIWC: Word2Vec 16 Applied Word2Vec modelling on Quora Dump Found the most similar words for each word present under the tag Compared the similarity with 1Billion WIki Text Added the most similar words thus found to new LIWC dictionary Trained the models on new LIWC dictionary
Expansion of Posemo,Negemo,Funct-words 17 Added More Positive Words,Negative Words[1] Added more functional words[2] 1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 2. Leah Gilner and Franc Morales at [Sequence Publishing] (http://www.sequencepublishing.com) for listing English function words
18 User Survey
Survey Method 19 Used a 10 question questionnaire - BFI 10 Contacted the Quora users having more than 30 answers 50 Users filled the survey Calculated the personality score for all the 5 personality traits between 1-10
Extraction of Data 20 Written the Python script to crawl all the answers of these users Sanitized the answers Pruned all the answers with less than 200 words Labelled the dataset thus obtained with survey results
21 Results
Only LIWC Features on labelled Essays 22 Rando Logisti Adabo m SMO c ost SVM Forest Openn 60.534 59.927 59.116 51.539 55.105 ess 8 % 1 % 7 % 7 % 3 % Consci entious 55.429 55.348 55.348 50.810 53.444 ness 5 % 5 % 5 % 4 % 1 % Extrave 54.578 54.781 54.862 51.742 53.201 rsion 6 % 2 % 2 % 3 % % Agreea 55.145 53.768 56.077 53.079 54.416 bleness 9 % 2 % 8 % 4 % 5 % Neuroti 55.996 56.118 54.335 50.040 52.593 cism 8 % 3 % 5 % 5 % 2 %
LIWC Features + New Extracted Features on labelled Essays 23 Rando Logisti Adabo m SMO c ost SVM Forest Openn 60.534 60.372 58.59 51.985 57.739 ess 8 % 8 % % 4 % 1 % Consci entious 56.361 55.024 55.226 51.215 53.282 ness 4 % 3 % 9 % 6 % % Extrave 55.186 55.510 55.510 51.256 52.512 rsion 4 % 5 % 5 % 1 % 2 % Agreea 54.902 53.687 56.726 53.038 52.107 bleness 8 % 2 % 1 % 9 % % Neuroti 56.969 57.739 54.092 50.688 51.782 cism 2 % 1 % 4 % 8 % 8 %
Expanded LIWC + New Extracted Features on labelled Essays 24 Rando Logisti Adabo m SMO c ost SVM Forest Openn 61.183 61.750 59.886 53.079 56.320 ess 1 % 4 % 5 % 4 % 9 % Consci entious 55.510 54.619 53.808 51.580 51.661 ness 5 % 1 % 8 % 2 % 3 % Extrave 54.213 54.335 55.875 52.025 50.607 rsion 9 % 5 % 2 % 9 % 8 % Agreea 55.348 54.213 54.213 51.661 51.742 bleness 5 % 9 % 9 % 3 % 3 % Neuroti 57.577 56.685 54.862 51.256 51.944 cism % 6 % 2 % 1 % 9 %
Expanded LIWC + New Extracted Features + Discourse Relations on labelled 25 Essays Rando Logisti Adabo m SMO c ost SVM Forest Openn 61.433 60.272 58.960 52.347 57.294 ess 6 % 6 % 1 % 3 % 3 % Consci entious 56.486 55.679 53.054 51.590 51.236 ness 6 % % % 1 % 7 % Extrave 54.114 53.457 55.527 52.549 53.054 rsion 1 % 8 % 5 % 2 % % Agreea 56.789 56.991 54.568 53.861 54.820 bleness 5 % 4 % 4 % 7 % 8 % Neuroti 56.84 57.041 53.861 53.508 53.356 cism % 9 % 7 % 3 % 9 %
Only LIWC Features on Labelled Quora Dataset 26 Rando Logisti Adabo m SMO c ost SVM Forest Openn 74.897 74.897 70.370 70.535 71.604 ess 1 % 1 % 4 % % 9 % Consci entious 68.971 66.913 68.971 68.971 69.794 ness 2 % 6 % 2 % 2 % 2 % Extrave 76.296 76.707 76.296 76.296 78.93 rsion 3 % 8 % 3 % 3 % % Agreea 67.818 66.584 63.456 63.456 66.172 bleness 9 % 4 % 8 % 8 % 8 % Neuroti 72.921 71.851 72.921 72.921 71.769 cism 8 % 9 % 8 % 8 % 5 %
Expanded LIWC + Features 27 Quora dataset Adab oost (rand Rand om om Logis Adab forest Fores SMO tic oost ) SVM t Open 75.39 73.90 72.75 77.28 71.02 74.73 ness 09 % 95 % 72 % 4 % 88 % 25 % Consc ientiou 70.12 67.57 68.97 73.66 68.55 71.27 sness 35 % 2 % 12 % 26 % 97 % 57 % Extrav 76.37 77.28 76.29 80.41 77.94 79.75 ersion 86 % 4 % 63 % 15 % 24 % 31 % Agree ablen 66.99 67.16 63.45 69.38 64.19 66.09 ess 59 % 05 % 68 % 27 % 75 % 05 % Neuro 73.00 70.04 72.92 75.22 71.93 72.34 ticism 41 % 12 % 18 % 63 % 42 % 57 %
Future Work 28 Expand LIWC by taking more unlabelled quora data Gathering richer labelled quora data by conducting paid personality surveys Evaluate on more labelled quora data Leveraging Discourse output to generate better discourse features Add more linguistic features by identifying patterns in quora answers
Thank You 29
Recommend
More recommend