stop w ords
play

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a - PowerPoint PPT Presentation

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u


  1. Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  2. What are stop w ords and ho w to find them ? Stop w ords : w ords that occ u r too freq u entl y and not considered informati v e Lists of stop w ords in most lang u ages {'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...} Conte x t ma � ers {'movie', 'movies', 'film', 'films', 'cinema'} SENTIMENT ANALYSIS IN PYTHON

  3. Stop w ords w ith w ord clo u ds Word clo u d , not remo v ing stop w ords Word clo u d w ith stop w ords remo v ed SENTIMENT ANALYSIS IN PYTHON

  4. Remo v e stop w ords from w ord clo u ds # Import libraries from wordcloud import WordCloud, STOPWORDS # Define the stopwords list my_stopwords = set(STOPWORDS) my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"]) # Generate and show the word cloud my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(name_string) plt.imshow(my_cloud, interpolation='bilinear') SENTIMENT ANALYSIS IN PYTHON

  5. Stop w ords w ith BOW from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS # Define the set of stop words my_stop_words = ENGLISH_STOP_WORDS.union(['film', 'movie', 'cinema', 'theatre']) vect = CountVectorizer(stop_words=my_stop_words) vect.fit(movies.review) X = vect.transform(movies.review) SENTIMENT ANALYSIS IN PYTHON

  6. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  7. Capt u ring a token pattern SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  8. String operators and comparisons # Checks if a string is composed only of letters my_string.isalpha() # Checks if a string is composed only of digits my_string.isdigit() # Checks if a string is composed only of alphanumeric characters my_string.isalnum() SENTIMENT ANALYSIS IN PYTHON

  9. String operators w ith list comprehension # Original word tokenization word_tokens = [word_tokenize(review) for review in reviews.review] # Keeping only tokens composed of letters cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens] len(word_tokens[0]) 87 len(cleaned_tokens[0]) 78 SENTIMENT ANALYSIS IN PYTHON

  10. Reg u lar e x pressions import re my_string = '#Wonderfulday' # Extract #, followed by any letter, small or capital x = re.search('#[A-Za-z]', my_string) x <re.Match object; span=(0, 2), match='#W'> SENTIMENT ANALYSIS IN PYTHON

  11. Token pattern w ith a BOW # Default token pattern in CountVectorizer '\b\w\w+\b' # Specify a particular token pattern CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b') SENTIMENT ANALYSIS IN PYTHON

  12. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  13. Stemming and lemmati z ation SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  14. What is stemming ? Stemming is the process of transforming w ords to their root forms , e v en if the stem itself is not a v alid w ord in the lang u age . staying, stays, stayed ----> stay house, houses, housing ----> hous SENTIMENT ANALYSIS IN PYTHON

  15. What is lemmati z ation ? Lemmati z ation is q u ite similar to stemming b u t u nlike stemming , it red u ces the w ords to roots that are v alid w ords in the lang u age . stay, stays, staying, stayed ----> stay house, houses, housing ----> house SENTIMENT ANALYSIS IN PYTHON

  16. Stemming v s . lemmati z ation Stemming Lemmati z ation Prod u ces roots of w ords Prod u ces act u al w ords Fast and e � cient to comp u te Slo w er than stemming and can depend on the part - of - speech SENTIMENT ANALYSIS IN PYTHON

  17. Stemming of strings from nltk.stem import PorterStemmer porter = PorterStemmer() porter.stem('wonderful') 'wonder' SENTIMENT ANALYSIS IN PYTHON

  18. Non - English stemmers Sno w ball Stemmer : Danish , D u tch , English , Finnish , French , German , H u ngarian , Italian , Nor w egian , Port u g u ese , Romanian , R u ssian , Spanish , S w edish from nltk.stem.snowball import SnowballStemmer DutchStemmer = SnowballStemmer("dutch") DutchStemmer.stem("beginen") 'begin' SENTIMENT ANALYSIS IN PYTHON

  19. Ho w to stem a sentence ? porter.stem('Today is a wonderful day!') 'today is a wonderful day!' tokens = word_tokenize('Today is a wonderful day!') stemmed_tokens = [porter.stem(token) for token in tokens] stemmed_tokens ['today', 'is', 'a', 'wonder', 'day', '!'] SENTIMENT ANALYSIS IN PYTHON

  20. Lemmati z ation of a string from nltk.stem import WordNetLemmatizer WNlemmatizer = WordNetLemmatizer() WNlemmatizer.lemmatize('wonderful', pos='a') 'wonderful' SENTIMENT ANALYSIS IN PYTHON

  21. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

  22. TfIdf : More w a y s to transform te x t SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist

  23. What are the components of TfIdf ? TF : term freq u enc y : Ho w o � en a gi v en w ord appears w ithin a doc u ment in the corp u s In v erse doc u ment freq u enc y : Log - ratio bet w een the total n u mber of doc u ments and the n u mber of doc u ments that contain a speci � c w ord Used to calc u late the w eight of w ords that do not occ u r freq u entl y SENTIMENT ANALYSIS IN PYTHON

  24. TfIDF score of a w ord TfIdf score : TfIdf = term frequency * inverse document frequency BOW does not acco u nt for length of a doc u ment , TfIDf does . TfIdf likel y to capt u re w ords common w ithin a doc u ment b u t not across doc u ments . SENTIMENT ANALYSIS IN PYTHON

  25. Ho w is TfIdf u sef u l ? T w i � er airline sentiment Lo w TfIdf scores : United , Virgin America High TfIdf scores : check - in process ( if rare across doc u ments ) More on TfIdf Since it penali z es freq u ent w ords , less need to deal w ith stop w ords e x plicitl y. Q u ite u sef u l in search q u eries and information retrie v al to rank the rele v ance of ret u rned res u lts . SENTIMENT ANALYSIS IN PYTHON

  26. TfIdf in P y thon # Import the TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer Arg u ments of T � dfVectori z er : ma x_ feat u res , ngrams _ range , stop _w ords , token _ pa � ern , ma x_ df , min _ df vect = TfidfVectorizer(max_features=100).fit(tweets.text) X = vect.transform(tweets.text) SENTIMENT ANALYSIS IN PYTHON

  27. TfidfVectori z er X <14640x100 sparse matrix of type '<class 'numpy.float64'>' with 119182 stored elements in Compressed Sparse Row format> X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names()) X_df.head() SENTIMENT ANALYSIS IN PYTHON

  28. Let ' s practice ! SE N TIME N T AN ALYSIS IN P YTH ON

Recommend


More recommend