Introduction to Artificial Intelligence CoreNLP, Semantic Analysis, Naives Bayes Classifier Janyl Jumadinova November 18, 2016
CoreNLP ◮ Reference: http://stanfordnlp.github.io/CoreNLP/ ◮ Package available in /opt/corenlp/ ◮ Run: java -cp "/opt/corenlp/stanford-corenlp-3.7.0/*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -file input.txt 2/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ tokenize: Creates tokens from the given text. 3/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ tokenize: Creates tokens from the given text. ◮ ssplit: Separates a sequence of tokens into sentences. 3/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ tokenize: Creates tokens from the given text. ◮ ssplit: Separates a sequence of tokens into sentences. ◮ pos: Creates Parts of Speech (POS) tags for tokens. ◮ ner: Performs Named Entity Recognition classification. 3/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ lemma: Creates word lemmas for tokens. 4/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ lemma: Creates word lemmas for tokens. – The goal of lemmatization (as of stemming ) is to reduce related forms of a word to a common base form. 4/24
CoreNLP Annotators http://stanfordnlp.github.io/CoreNLP/annotators.html ◮ lemma: Creates word lemmas for tokens. – The goal of lemmatization (as of stemming ) is to reduce related forms of a word to a common base form. – Lemmatization usually uses a vocabulary and morphological analysis of words to: - remove inflectional endings only, and - to return the base or dictionary form of a word, which is known as the lemma . 4/24
Sentiment Analysis 5/24
Sentiment Analysis ◮ https://www.csc.ncsu.edu/faculty/healey/tweet_viz/ tweet_app/ ◮ http://www.alchemyapi.com/developers/ getting-started-guide/twitter-sentiment-analysis ◮ www.sentiment140.com 6/24
Sentiment analysis has many other names ◮ Opinion extraction ◮ Opinion mining ◮ Sentiment mining ◮ Subjectivity analysis 7/24
Sentiment analysis is the detection of attitudes ◮ “enduring, affectively colored beliefs, dispositions towards objects or persons” 8/24
Attitudes ◮ Holder (source) of attitude ◮ Target (aspect) of attitude 9/24
Attitudes ◮ Holder (source) of attitude ◮ Target (aspect) of attitude ◮ Type of attitude - From a set of types: Like, love, hate, value, desire, etc. - Or (more commonly) simple weighted polarity: positive, negative, neutral, together with strength 9/24
Attitudes ◮ Holder (source) of attitude ◮ Target (aspect) of attitude ◮ Type of attitude - From a set of types: Like, love, hate, value, desire, etc. - Or (more commonly) simple weighted polarity: positive, negative, neutral, together with strength ◮ Text containing the attitude - Sentence or entire document 9/24
Sentiment analysis ◮ Simplest task : Is the attitude of this text positive or negative? 10/24
Sentiment analysis ◮ Simplest task : Is the attitude of this text positive or negative? ◮ More complex : Rank the attitude of this text from 1 to 5 10/24
Sentiment analysis ◮ Simplest task : Is the attitude of this text positive or negative? ◮ More complex : Rank the attitude of this text from 1 to 5 ◮ Advanced : Detect the target, source, or complex attitude types 10/24
Baseline Algorithm ◮ Tokenization ◮ Feature Extraction ◮ Classification using different classifiers – Naive Bayes – MaxEnt – SVM 11/24
Sentiment Tokenization Issues ◮ Deal with HTML and XML markup ◮ Twitter/Facebook/... mark-up (names, hash tags) ◮ Capitalization (preserve for words in all caps) ◮ Phone numbers, dates ◮ Emoticons 12/24
Extracting Features for Sentiment Classification ◮ How to handle negation : I didn’t like this movie vs. I really like this movie 13/24
Extracting Features for Sentiment Classification ◮ How to handle negation : I didn’t like this movie vs. I really like this movie ◮ Which words to use? –Only adjectives –All words 13/24
Negation Add NOT to every word between negation and following punctuation 14/24
Naive Bayes Algorithm ◮ Simple (“naive”) classification method based on Bayes rule ◮ Relies on very simple representation of document: - Bag of words 15/24
Naive Bayes Algorithm 16/24
Naive Bayes Algorithm 17/24
Naive Bayes Algorithm 18/24
Naive Bayes Algorithm For a document d and a class c 19/24
Naive Bayes Algorithm 20/24
Naive Bayes Algorithm 21/24
Naive Bayes Algorithm 22/24
Binarized (Boolean feature) Multinomial Naive Bayes Intuition: ◮ Word occurrence may matter more than word frequency ◮ The occurrence of the word fantastic tells us a lot ◮ The fact that it occurs 5 times may not tell us much more. 23/24
Binarized (Boolean feature) Multinomial Naive Bayes Intuition: ◮ Word occurrence may matter more than word frequency ◮ The occurrence of the word fantastic tells us a lot ◮ The fact that it occurs 5 times may not tell us much more. Boolean Multinomial Naive Bayes Clips all the word counts in each document at 1 23/24
Neural Networks and Deep Learning: Next! ◮ http://nlp.stanford.edu/sentiment/ ◮ java -cp "/opt/corenlp/stanford-corenlp-3.7.0/*" -Xmx2g edu.stanford.nlp.sentiment.SentimentPipeline -file input.txt 24/24
Recommend
More recommend