b u ilding a bag of w ords model
play

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be


  1. B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be n u merical FEATURE ENGINEERING FOR NLP IN PYTHON

  3. Bag of w ords model E x tract w ord tokens Comp u te freq u enc y of w ord tokens Constr u ct a w ord v ector o u t of these freq u encies and v ocab u lar y of corp u s FEATURE ENGINEERING FOR NLP IN PYTHON

  4. Bag of w ords model e x ample Corp u s "The lion is the king of the jungle" "Lions have lifespans of a decade" "The lion is an endangered species" FEATURE ENGINEERING FOR NLP IN PYTHON

  5. Bag of w ords model e x ample Vocab u lar y → a , an , decade , endangered , have , is , jungle , king , lifespans , lion , Lions , of , species , the , The "The lion is the king of the jungle" [0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1] "Lions have lifespans of a decade" [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0] "The lion is an endangered species" [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1] FEATURE ENGINEERING FOR NLP IN PYTHON

  6. Te x t preprocessing Lions , lion → lion The , the → the No p u nct u ations No stop w ords Leads to smaller v ocab u laries Red u cing n u mber of dimensions helps impro v e performance FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Bag of w ords model u sing sklearn corpus = pd.Series([ 'The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species' ]) FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Bag of w ords model u sing sklearn # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer() # Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray()) array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3], [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0], [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64) FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  10. B u ilding a BoW Nai v e Ba y es classifier FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  11. Spam filtering message label WINNER !! As a v al u ed net w ork c u stomer y o u ha v e been selected to recei v e a $900 spam pri z e re w ard ! To claim call 09061701461 Ah , w ork . I v ag u el y remember that . What does it feel like ? ham FEATURE ENGINEERING FOR NLP IN PYTHON

  12. Steps 1. Te x t preprocessing 2. B u ilding a bag - of -w ords model ( or representation ) 3. Machine learning FEATURE ENGINEERING FOR NLP IN PYTHON

  13. Te x t preprocessing u sing Co u ntVectori z er Co u ntVectori z er arg u ments lowercase : False , True strip_accents : 'unciode' , 'ascii' , None stop_words : 'english' , list , None token_pattern : regex tokenizer : function FEATURE ENGINEERING FOR NLP IN PYTHON

  14. B u ilding the BoW model # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False) # Import train_test_split from sklearn.model_selection import train_test_split # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25) FEATURE ENGINEERING FOR NLP IN PYTHON

  15. B u ilding the BoW model ... ... # Generate training Bow vectors X_train_bow = vectorizer.fit_transform(X_train) # Generate test BoW vectors X_test_bow = vectorizer.transform(X_test) FEATURE ENGINEERING FOR NLP IN PYTHON

  16. Training the Nai v e Ba y es classifier # Import MultinomialNB from sklearn.naive_bayes import MultinomialNB # Create MultinomialNB object clf = MultinomialNB() # Train clf clf.fit(X_train_bow, y_train) # Compute accuracy on test set accuracy = clf.score(X_test_bow, y_test) print(accuracy) 0.760051 FEATURE ENGINEERING FOR NLP IN PYTHON

  17. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  18. B u ilding n - gram models FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  19. BoW shortcomings re v ie w label 'The movie was good and not boring' positi v e 'The movie was not good and boring' negati v e E x actl y the same BoW representation ! Conte x t of the w ords is lost . Sentiment dependent on the position of ' not '. FEATURE ENGINEERING FOR NLP IN PYTHON

  20. n - grams Contig u o u s seq u ence of n elements ( or w ords ) in a gi v en doc u ment . n = 1 → bag - of -w ords 'for you a thousand times over' n = 2, n - grams : [ 'for you', 'you a', 'a thousand', 'thousand times', 'times over' ] FEATURE ENGINEERING FOR NLP IN PYTHON

  21. n - grams 'for you a thousand times over' n = 3, n - grams : [ 'for you a', 'you a thousand', 'a thousand times', 'thousand times over' ] Capt u res more conte x t . FEATURE ENGINEERING FOR NLP IN PYTHON

  22. Applications Sentence completion Spelling correction Machine translation correction FEATURE ENGINEERING FOR NLP IN PYTHON

  23. B u ilding n - gram models u sing scikit - learn Generates onl y bigrams . bigrams = CountVectorizer(ngram_range=(2,2)) Generates u nigrams , bigrams and trigrams . ngrams = CountVectorizer(ngram_range=(1,3)) FEATURE ENGINEERING FOR NLP IN PYTHON

  24. Shortcomings C u rse of dimensionalit y Higher order n - grams are rare Keep n small FEATURE ENGINEERING FOR NLP IN PYTHON

  25. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Recommend


More recommend