analysis in hindi
play

Analysis in Hindi Naman Bansal Umair Z Ahmed MOTIVATION Why - PowerPoint PPT Presentation

Semi-Supervised Sentiment Analysis in Hindi Naman Bansal Umair Z Ahmed MOTIVATION Why Sentiment Analysis? Labeling the reviews with their sentiment would provide succinct summaries to readers Helpful in business intelligence


  1. Semi-Supervised Sentiment Analysis in Hindi Naman Bansal Umair Z Ahmed

  2. MOTIVATION  Why Sentiment Analysis? Labeling the reviews with their sentiment would provide succinct • summaries to readers Helpful in business intelligence applications, recommender systems, • message filtering, …  Why semi-supervised? Problems with Supervised polarity classification systems: Typically domain-specific • Expensive process of annotating a large amount of data (especially for • low resource languages)

  3. PREVIOUS WORKS  Dasgupta and Ng (2009) firstly mine the unambiguous reviews using spectral techniques, and then exploit them to classify the ambiguous reviews via a novel combination of active learning, transductive learning, and ensemble learning.  Joshi et al. (2010) created H-SWN using English SentiWordNet and English-Hindi WordNet Linking.  Bakliwal et al. (2012) created Hindi Subjective Lexicon and use Hindi WordNet to assign similar polarity to synonyms and opposite polarity to antonyms.

  4. PREVIOUS WORKS ON INDIAN LANGUAGES  Sentiment analysis for Indian Languages has primarily been focusing on using: Machine Translation to translate the data in English to Hindi. • Bi-Lingual dictionary for English and Indian Languages • Hindi WordNet expansion to exploit synonyms and antonym polarity •  Un/Semi-supervised sentiment analysis techniques are under-investigated in NLP

  5. DATASET  IIT Bombay Movie Review Dataset • Open source • 300 Reviews (150 + 150)  IIIT Hyderabad Product Review Dataset • On Request • 700 Reviews (350 + 350)  Our contribution • Building movie review dataset from jagran.com

  6. DATASET <movie sentiment =“ neg ” star =“ 2 ” link = http://www.jagran.com/entertainment/reviews- mickey-virus-movie-review-10821431.html> <review> चरॎचित टीवी एंकर मनीष पॉल की इस फिलॎम से बहुत उमॎमीदेः थीं। … </review> <SelectedLines> <line sentiment =“ pos ”> ममकी वाइरस पूरी तरह से मनीष पॉल की फिलॎम है और फिलॎम मेः उनकी इमेज क े हहसाब से ही दॄशॎय और सॎथथततयां रची गई थीं। मनीष ने अपने फकरकार को बखूबी तनभाया है। </ line> <line sentiment =“ neg ”> फिलॎम देखने क े बाद न मसि ि उमॎमीदेः धराशायी हुई बसॎलॎक अचॎछे खासे ववषय को यूं ही जाया हो जाने का अिसोस भी हो रहा है। उनक े अमभनय मेः इंटेमसंटी तो है लेफकन फकरदार थटीररयोटाइप होते जाए तो अचॎछा अमभनेता भी बोर कर सकता है। </ line> </SelectedLines> </movie>

  7. PRE-PROCESSING DATA  Remove:  Punctuations  Numbers  Words of length one  Words that occur only in a single review  Words with high document frequency, many of which are stopwords or domain specific general-purpose words

  8. DATA REPRESENTATION Each review is represented as a vector of unigrams, using binary weight equal to 1 for  terms present in a vector. The dataset is represented as a Matrix where R is the number of training samples, T is the number of test samples, D is the number of feature words in the dataset.

  9. PROPOSED APPROACH Deep Learning Deep Learning Architechture  One Input Layer h 0  N hidden layers h 1 , h 2 , …, h N  One Output Layer  The input layer h 0 has D units, equal to the number  of features of sample data x . We intend to seek the mapping function X L  Y L  using the L labeled data and R+T -L unlabeled data.

  10. PROPOSED APPROACH  The semi-supervised learning method based on ADN architecture can be divided into two stages: First, ADN architecture is constructed by greedy layer-wise unsupervised  learning using RBMs as building blocks. All the unlabeled data together with L labelled data are utilized to find the parameter space W with N layers. Second, ADN architecture is trained according to the exponential loss  function using gradient descent method . The parameter space W is retrained by an exponential loss function using L labelled data.

  11. PROPOSED APPROACH Energy of the state(h k-1 ,h k ) as  The probability that the model assigns to h k- 1  is:  where Z ( θ ) denotes the normalizing constant.

  12. PROPOSED APPROACH The probability of turning on unit t is a logistic function of the states of h k -1 and w k The probability of turning on unit t is a logistic function of the states of h k and w k The logistic function is:

  13. PROPOSED APPROACH Optimization problem is formulized as The loss function is defined as

Recommend


More recommend