Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u lar y of the w ords and a meas u re of their presence SENTIMENT ANALYSIS IN PYTHON

Ama z on prod u ct re v ie w s SENTIMENT ANALYSIS IN PYTHON

Sentiment anal y sis w ith BOW : E x ample This is the best book e v er . I lo v ed the book and highl y recommend it !!! {‘This’: 1, ‘is’: 1, ‘the’: 2 , ‘best’: 3 , ’book’: 2, ‘ever’: 1, ‘I’:1 , ‘loved’:1 , ‘and’: 1 , ‘highly’: 1, ‘recommend’: 1 , ‘it’: 1 } Lose w ord order and grammar r u les ! SENTIMENT ANALYSIS IN PYTHON

BOW end res u lt The o u tp u t w ill look something like this : SENTIMENT ANALYSIS IN PYTHON

Co u ntVectori z er f u nction import pandas as pd from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(max_features=1000) vect.fit(data.review) X = vect.transform(data.review) SENTIMENT ANALYSIS IN PYTHON

Co u ntVectori z er o u tp u t X <10000x1000 sparse matrix of type '<class 'numpy.int64'>' with 406668 stored elements in Compressed Sparse Row format> SENTIMENT ANALYSIS IN PYTHON

Transforming the v ectori z er # Transform to an array my_array = X.toarray() # Transform back to a dataframe, assign column names X_df = pd.DataFrame(my_array, columns=vect.get_feature_names()) SENTIMENT ANALYSIS IN PYTHON

Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Getting gran u lar w ith n - grams SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Conte x t matters I am happ y , not sad . I am sad , not happ y . P u � ing ' not ' in front of a w ord ( negation ) is one e x ample of ho w conte x t ma � ers . SENTIMENT ANALYSIS IN PYTHON

Capt u ring conte x t w ith a BOW Unigrams : single tokens Bigrams : pairs of tokens Trigrams : triples of tokens n - grams : seq u ence of n - tokens SENTIMENT ANALYSIS IN PYTHON

Capt u ring conte x t w ith BOW The w eather toda y is w onderf u l . Unigrams : { The , w eather , toda y, is w onderf u l } Bigrams : { The w eather , w eather toda y, toda y is , is w onderf u l } Trigrams : { The w eather toda y, w eather toda y is , toda y is w onderf u l } SENTIMENT ANALYSIS IN PYTHON

n - grams w ith the Co u ntVectori z er from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(ngram_range=(min_n, max_n)) # Only unigrams ngram_range=(1, 1) # Uni- and bigrams ngram_range=(1, 2) SENTIMENT ANALYSIS IN PYTHON

What is the best n ? Longer seq u ence of tokens Res u lts in more feat u res Higher precision of machine learning models Risk of o v er � � ing SENTIMENT ANALYSIS IN PYTHON

Specif y ing v ocab u lar y si z e CountVectorizer(max_feature, max_df, min_df) ma x_ feat u res : if speci � ed , it w ill incl u de onl y the top most freq u ent w ords in the v ocab u lar y If ma x_ feat u res = None , all w ords w ill be incl u ded ma x_ df : ignore terms w ith higher than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms min _ df : ignore terms w ith lo w er than speci � ed freq u enc y If it is set to integer , then absol u te co u nt ; if a � oat , then it is a proportion Defa u lt is 1.0, w hich means it does not ignore an y terms SENTIMENT ANALYSIS IN PYTHON

B u ild ne w feat u res from te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Goal of the v ideo Goal : Enrich the e x isting dataset w ith feat u res related to the te x t col u mn ( capt u ring the sentiment ) SENTIMENT ANALYSIS IN PYTHON

Prod u ct re v ie w s data reviews.head() SENTIMENT ANALYSIS IN PYTHON

Feat u res from the re v ie w col u mn Ho w long is each re v ie w? Ho w man y sentences does it contain ? What parts of speech are in v ol v ed ? Ho w man y p u nct u ation marks ? SENTIMENT ANALYSIS IN PYTHON

Tokeni z ing a string from nltk import word_tokenize anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.' word_tokenize(anna_k) ['Happy','families','are', 'all','alike',',', 'every','unhappy', 'family', 'is','unhappy','in', 'its','own','way','.'] SENTIMENT ANALYSIS IN PYTHON

Tokens from a col u mn # General form of list comprehension [expression for item in iterable] word_tokens = [word_tokenize(review) for review in reviews.review] type(word_tokens) list type(word_tokens[0]) list SENTIMENT ANALYSIS IN PYTHON

Tokens from a col u mn len_tokens = [] # Iterate over the word_tokens list for i in range(len(word_tokens)): len_tokens.append(len(word_tokens[i])) # Create a new feature for the length of each review reviews['n_tokens'] = len_tokens SENTIMENT ANALYSIS IN PYTHON

Dealing w ith p u nct u ation We did not address it b u t y o u can e x cl u de it A feat u re that meas u res the n u mber of p u nct u ation signs A re v ie w w ith man y p u nct u ation signs co u ld signal a v er y emotionall y charged opinion SENTIMENT ANALYSIS IN PYTHON

Re v ie w s w ith a feat u re for the length reviews.head() SENTIMENT ANALYSIS IN PYTHON

Can y o u g u ess the lang u age ? SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

Lang u age of a string in P y thon from langdetect import detect_langs foreign = 'Este libro ha sido uno de los mejores libros que he leido.' detect_langs(foreign) [es:0.9999945352697024] SENTIMENT ANALYSIS IN PYTHON

Lang u age of a col u mn Problem : Detect the lang u age of each of the strings and capt u re the most likel y lang u age in a ne w col u mn from langdetect import detect_langs reviews = pd.read_csv('product_reviews.csv') reviews.head() SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age languages = [] for row in range(len(reviews)): languages.append(detect_langs(reviews.iloc[row, 1])) languages [it:0.9999982541301151], [es:0.9999954153640488], [es:0.7142833997345875, en:0.2857160465706441], [es:0.9999942365605781], [es:0.999997956049055] ... SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age # Transform the first list to a string and split on a colon str(languages[0]).split(':') ['[es', '0.9999954153640488]'] str(languages[0]).split(':')[0] '[es' str(languages[0]).split(':')[0][1:] 'es' SENTIMENT ANALYSIS IN PYTHON

B u ilding a feat u re for the lang u age languages = [str(lang).split(':')[0][1:] for lang in languages] reviews['language'] = languages SENTIMENT ANALYSIS IN PYTHON

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe - PowerPoint PPT Presentation

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a bag - of -w ords ( BOW ) ? Describes the occ u rrence of w ords w ithin a doc u ment or a collection of doc u ments ( corp u s ) B u ilds a v ocab u

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u

9/11 ORAL HISTORY PROJECT I N T HEIR O WN W ORDS : F ROM O RAL H ISTORY TO V ISUAL A RT , M EDIA A

The repetition threshold for binary rich words Lucas Mol Joint work with James D. Currie and

+ + = F UTURE O UTLOOK V ISION 20 20 10 0 / 10 0 / 10 P LAN Goal: 100 Ph.D.s per year (from

There must be five (5) keYWords to 5.KeyV{ords be included in the abstract. 6. References Use

Tone is the aural channel. Its how you sound. As with Look, there will be many things that

W HERE DOES IT COME FROM ? W ORDS 05 E. Barcucci, R. Pinzani, M. Poneti, Exhaustive generation

Part 1: Preprocessing the Data MAC H IN E TR AN SL ATION IN P YTH ON Th u shan Ganegedara Data

Introd u ction to Teacher Forcing MAC H IN E TR AN SL ATION IN P YTH ON Th u shan Ganegedara

Results Presentation 1 Business Overview CONSERVATIVE DIVERSIFIED LONG TERM AND VALUE

The Stigma of Language: Words Matter! Jeanne Block, RN, MS Harm Reduction Coordinator La

Presentation to the Science Advisory Board Panel Lek Kadeli, Cindy Sonich-Mullin, Trish Erickson