Semi-Supervised Sentiment Analysis in Hindi Naman Bansal Umair Z Ahmed
MOTIVATION Why Sentiment Analysis? Labeling the reviews with their sentiment would provide succinct • summaries to readers Helpful in business intelligence applications, recommender systems, • message filtering, … Why semi-supervised? Problems with Supervised polarity classification systems: Typically domain-specific • Expensive process of annotating a large amount of data (especially for • low resource languages)
PREVIOUS WORKS Dasgupta and Ng (2009) firstly mine the unambiguous reviews using spectral techniques, and then exploit them to classify the ambiguous reviews via a novel combination of active learning, transductive learning, and ensemble learning. Joshi et al. (2010) created H-SWN using English SentiWordNet and English-Hindi WordNet Linking. Bakliwal et al. (2012) created Hindi Subjective Lexicon and use Hindi WordNet to assign similar polarity to synonyms and opposite polarity to antonyms.
PREVIOUS WORKS ON INDIAN LANGUAGES Sentiment analysis for Indian Languages has primarily been focusing on using: Machine Translation to translate the data in English to Hindi. • Bi-Lingual dictionary for English and Indian Languages • Hindi WordNet expansion to exploit synonyms and antonym polarity • Un/Semi-supervised sentiment analysis techniques are under-investigated in NLP
DATASET IIT Bombay Movie Review Dataset • Open source • 300 Reviews (150 + 150) IIIT Hyderabad Product Review Dataset • On Request • 700 Reviews (350 + 350) Our contribution • Building movie review dataset from jagran.com
DATASET <movie sentiment =“ neg ” star =“ 2 ” link = http://www.jagran.com/entertainment/reviews- mickey-virus-movie-review-10821431.html> <review> चरॎचित टीवी एंकर मनीष पॉल की इस फिलॎम से बहुत उमॎमीदेः थीं। … </review> <SelectedLines> <line sentiment =“ pos ”> ममकी वाइरस पूरी तरह से मनीष पॉल की फिलॎम है और फिलॎम मेः उनकी इमेज क े हहसाब से ही दॄशॎय और सॎथथततयां रची गई थीं। मनीष ने अपने फकरकार को बखूबी तनभाया है। </ line> <line sentiment =“ neg ”> फिलॎम देखने क े बाद न मसि ि उमॎमीदेः धराशायी हुई बसॎलॎक अचॎछे खासे ववषय को यूं ही जाया हो जाने का अिसोस भी हो रहा है। उनक े अमभनय मेः इंटेमसंटी तो है लेफकन फकरदार थटीररयोटाइप होते जाए तो अचॎछा अमभनेता भी बोर कर सकता है। </ line> </SelectedLines> </movie>
PRE-PROCESSING DATA Remove: Punctuations Numbers Words of length one Words that occur only in a single review Words with high document frequency, many of which are stopwords or domain specific general-purpose words
DATA REPRESENTATION Each review is represented as a vector of unigrams, using binary weight equal to 1 for terms present in a vector. The dataset is represented as a Matrix where R is the number of training samples, T is the number of test samples, D is the number of feature words in the dataset.
PROPOSED APPROACH Deep Learning Deep Learning Architechture One Input Layer h 0 N hidden layers h 1 , h 2 , …, h N One Output Layer The input layer h 0 has D units, equal to the number of features of sample data x . We intend to seek the mapping function X L Y L using the L labeled data and R+T -L unlabeled data.
PROPOSED APPROACH The semi-supervised learning method based on ADN architecture can be divided into two stages: First, ADN architecture is constructed by greedy layer-wise unsupervised learning using RBMs as building blocks. All the unlabeled data together with L labelled data are utilized to find the parameter space W with N layers. Second, ADN architecture is trained according to the exponential loss function using gradient descent method . The parameter space W is retrained by an exponential loss function using L labelled data.
PROPOSED APPROACH Energy of the state(h k-1 ,h k ) as The probability that the model assigns to h k- 1 is: where Z ( θ ) denotes the normalizing constant.
PROPOSED APPROACH The probability of turning on unit t is a logistic function of the states of h k -1 and w k The probability of turning on unit t is a logistic function of the states of h k and w k The logistic function is:
PROPOSED APPROACH Optimization problem is formulized as The loss function is defined as
Recommend
More recommend