In Information Retr trieval for or Se Senti timent An Anal alysis Weighting Schemes for enhancing classification accuracy Krithika Verma (kverma2@umbc.edu) CMSC-676 Information Retrieval
In Introduction • Sentiment analysis (SA) → mechanism to process vast amount of info and give insightful content. • Info available is form of → blogs, tweets, social n/w, reviews, etc. • Ex: Government sector, BI for politics, market research • What is classification in SA? ➢ Identify if text is subjective or objective ➢ polarity of a given text is determined, i.e. positive, negative or neutral • Document representation → important part in SA
De Delta TFIDF IDF for or SA • Words weighed using difference between tf.idf scores in positive and negative class documents. • SVM is used with delta tf.idf to show the improved accuracy • Why SVM? • Method ➢ Assigns feature values for a document based on difference of word’s tf.idf scores in positive and negative corpora. • Features in ➢ negative training set > positive training set positive score ➢ Positive training set > negative training set negative score
Delta tf.idf features are more sentimental Created subjectivity detector using Pang Lee’s subjectivity data -set transformation yields a clear improvement with a 99.9% confidence interval over the baseline bag of words
Related Wor ork – Sim imple tf tf.idf and del elta tf tf.idf • tf in classification results in decreased accuracy • Idf has no additional class preference. • To solve this issue, delta tf.idf was introduced by Martineau and Finin • Delta tf.idf better than simple tf , binary weighting scheme • Fails to take into consideration • non-linearity of tf to document relevancy • No smoothing for the dfi,j factor
SMART NO NOTATION AND ND BM BM25 • As per smart notation, term weight is a function of three triples ➢ term frequency factor (local factor) ➢ inverse document frequency function (global factor) ➢ Normalization • Ex: SMART.bnn → binary document representation nnn → Raw term frequency based representation
EXT EXTEN END SM SMART T NOTATI TION AND D BM BM25 25 WITH TH DEL DELTA TF TF.IDF • smoothing factor is added to the product of dfi with Ni rather than to the dfi alone • best accuracy of 96.90% is attained using BM25 tf weights with the BM25 delta idf variant
Related Superv rvised Term weighting Methods • Text classification → supervised learning task • Ex: inverse category frequency • fewer are the categories in which a term occurs, the greater is the discriminating power of the term • Fundamental elements of supervised term weighting used to compute global importance of a term ti for a category ck • Relevance frequency → considers the terms distribution in the positive and negative examples stating that, in multi-label text categorization, if more terms in positive examples than negative, greater the contribution to categorization
Superv rvised Variant of of tf tf.i .idf • Idea → avoids decreasing weight of terms contained in docs belonging to same category unlike what happens in idf. • Idfec (Inverse document frequency excluding category) • D T \ C k → training documents not labeled with ck • How it helps? ➢ Improves classification effectiveness ➢ term not belonging to category C k is penalized as in tf.idf ➢ Term appearing in category C k retains a high global weight • Similar to tf.rf , as both penalize weights of a term ti according to the number of negative examples where the ti appears
Example • A corpus of 100 training documents , containing two terms t 1 and t 2 with category C k • For term t 1 ➢ idf(t 1 ) = log(100/(27 + 5)) = log(3.125) → 0.49 ➢ rf(t 1 ,C k ) = log(2 + 27/5) = log(7.4) → 0.86 ➢ idfec (t 1 ,C k ) = log((65 + 5)/5) = log(14) → 1.14 • For term t 2 ➢ idf(t 2 ) = log(2.857) → 0.46 ➢ rf(t 2 , C k ) = log(2.4) → 0.38 ➢ idfec(t 1 , C k ) = log(2.8) → 0.44 • Conclusion: t 1 is more relevant to classify a category for a particular document
Related Wor ork • Delta TFIDF outperforms the baseline with a statistical significance of 95% on a two tailed t-test. ➢ Pang and Lee’s approach requires an additional trained SVM subjectivity classifier which requires even more labeled data ➢ Created subjectivity detector using Pang Lee’s subjectivity data -set ➢ transformation yields a clear improvement with a 99.9% confidence interval over the baseline bag of words • reviewed existing methods for both unsupervised and supervised and proposed a novel solution as a modification of the classic tf.idf scheme.
Futu ture Wor ork • plan to test the proposed weighting functions in other domains such as topic classification and additionally extend the approach to accommodate multi- class classification. • Improve the technique using redundancy , as redundancy is more effective than idf weights • plan to test further variants of supervised scheme and perform tests on larger datasets
References • Georgios Paltoglou and Mike Thelwall. A study of Information Retrieval weighting schemes for sentiment analysis. https://www.aclweb.org/anthology/P10-1141/ • Justin Martineau and Tim Finin. Delta TFIDF: An Improved Feature Space for Sentiment Analysis. https://www.researchgate.net/publication/221298092_Delta_TFIDF_An_Improved_Feature_Spac e_for_Sentiment_Analysis • A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf. Conference paper. https://www.researchgate.net/publication/299278964_A_Comparison_of_Term_Weighting_Sche mes_for_Text_Classification_and_Sentiment_Analysis_with_a_Supervised_Variant_of_tfidf • Supervised Term Weighting Metrics for Sentiment Analysis in Short Text. https://arxiv.org/abs/1610.03106
Recommend
More recommend