tokeni z ation and lemmati z ation
play

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Te x t so u rces Ne w s articles T w eets Comments FEATURE ENGINEERING FOR NLP IN PYTHON Making te x t machine friendl y


  1. Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. Te x t so u rces Ne w s articles T w eets Comments FEATURE ENGINEERING FOR NLP IN PYTHON

  3. Making te x t machine friendl y Dogs , dog reduction , REDUCING , Reduce don't , do not won't , will not FEATURE ENGINEERING FOR NLP IN PYTHON

  4. Te x t preprocessing techniq u es Con v erting w ords into lo w ercase Remo v ing leading and trailing w hitespaces Remo v ing p u nct u ation Remo v ing stop w ords E x panding contractions Remo v ing special characters ( n u mbers , emojis , etc .) FEATURE ENGINEERING FOR NLP IN PYTHON

  5. Tokeni z ation "I have a dog. His name is Hachi." Tokens : ["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."] "Don't do this." Tokens : ["Do", "n't", "do", "this", "."] FEATURE ENGINEERING FOR NLP IN PYTHON

  6. Tokeni z ation u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of tokens tokens = [token.text for token in doc] print(tokens) ['Hello','!','I','do',"n't",'know','what','I',"'m",'doing','here','.'] FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Lemmati z ation Con v ert w ord into its base form reducing , reduces , reduced , reduction → reduce am , are , is → be n't → not 've → have FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Lemmati z ation u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of lemmas lemmas = [token.lemma_ for token in doc] print(lemmas) ['hello','!','-PRON-','do','not','know','what','-PRON','be','do','here', '.'] FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  10. Te x t cleaning FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  11. Te x t cleaning techniq u es Unnecessar y w hitespaces and escape seq u ences P u nct u ations Special characters ( n u mbers , emojis , etc .) Stop w ords FEATURE ENGINEERING FOR NLP IN PYTHON

  12. isalpha () "Dog".isalpha() "!".isalpha() True False "3dogs".isalpha() "?".isalpha() False False "12347".isalpha() False FEATURE ENGINEERING FOR NLP IN PYTHON

  13. A w ord of ca u tion Abbre v iations : U.S.A , U.K , etc . Proper No u ns : word2vec and xto10x . Write y o u r o w n c u stom f u nction (u sing rege x) for the more n u anced cases . FEATURE ENGINEERING FOR NLP IN PYTHON

  14. Remo v ing non - alphabetic characters string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """ import spacy # Generate list of tokens nlp = spacy.load('en_core_web_sm') doc = nlp(string) lemmas = [token.lemma_ for token in doc] FEATURE ENGINEERING FOR NLP IN PYTHON

  15. Remo v ing non - alphabetic characters ... ... # Remove tokens that are not alphabetic a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-'] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely' FEATURE ENGINEERING FOR NLP IN PYTHON

  16. Stop w ords Words that occ u r e x tremel y commonl y Eg . articles , be v erbs , prono u ns , etc . FEATURE ENGINEERING FOR NLP IN PYTHON

  17. Remo v ing stop w ords u sing spaC y # Get list of stopwords stopwords = spacy.lang.en.stop_words.STOP_WORDS string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """ FEATURE ENGINEERING FOR NLP IN PYTHON

  18. Remo v ing stop w ords u sing spaC y ... ... # Remove stopwords and non-alphabetic tokens a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg like good thing wow amazing song hooked definitely' FEATURE ENGINEERING FOR NLP IN PYTHON

  19. Other te x t preprocessing techniq u es Remo v ing HTML / XML tags Replacing accented characters ( s u ch as é ) Correcting spelling errors FEATURE ENGINEERING FOR NLP IN PYTHON

  20. A w ord of ca u tion Al w a y s u se onl y those te x t preprocessing techniq u es that are rele v ant to y o u r application . FEATURE ENGINEERING FOR NLP IN PYTHON

  21. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  22. Part - of - speech tagging FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  23. Applications Word - sense disambig u ation "The bear is a majestic animal" "Please bear with me" Sentiment anal y sis Q u estion ans w ering Fake ne w s and opinion spam detection FEATURE ENGINEERING FOR NLP IN PYTHON

  24. POS tagging Assigning e v er y w ord , its corresponding part of speech . "Jane is an amazing guitarist." POS Tagging : Jane → proper no u n is → v erb an → determiner amazing → adjecti v e guitarist → no u n FEATURE ENGINEERING FOR NLP IN PYTHON

  25. POS tagging u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Jane is an amazing guitarist" # Create a Doc object doc = nlp(string) FEATURE ENGINEERING FOR NLP IN PYTHON

  26. POS tagging u sing spaC y ... ... # Generate list of tokens and pos tags pos = [(token.text, token.pos_) for token in doc] print(pos) [('Jane', 'PROPN'), ('is', 'VERB'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')] FEATURE ENGINEERING FOR NLP IN PYTHON

  27. POS annotations in spaC y PROPN → proper no u n DET → determinant spaC y annotations at h � ps :// spac y. io / api / annotation FEATURE ENGINEERING FOR NLP IN PYTHON

  28. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  29. Named entit y recognition FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  30. Applications E � cient search algorithms Q u estion ans w ering Ne w s article classi � cation C u stomer ser v ice FEATURE ENGINEERING FOR NLP IN PYTHON

  31. Named entit y recognition Identif y ing and classif y ing named entities into prede � ned categories . Categories incl u de person , organi z ation , co u ntr y, etc . "John Doe is a software engineer working at Google. He lives in France." Named Entities John Doe → person Google → organi z ation France → co u ntr y ( geopolitical entit y) FEATURE ENGINEERING FOR NLP IN PYTHON

  32. NER u sing spaC y import spacy string = "John Doe is a software engineer working at Google. He lives in France." # Load model and create Doc object nlp = spacy.load('en_core_web_sm') doc = nlp(string) # Generate named entities ne = [(ent.text, ent.label_) for ent in doc.ents] print(ne) [('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')] FEATURE ENGINEERING FOR NLP IN PYTHON

  33. NER annotations in spaC y More than 15 categories of named entities NER annotations at h � ps :// spac y. io / api / annotation # named - entities FEATURE ENGINEERING FOR NLP IN PYTHON

  34. A w ord of ca u tion Not perfect Performance dependent on training and test data Train models w ith speciali z ed data for n u anced cases Lang u age speci � c FEATURE ENGINEERING FOR NLP IN PYTHON

  35. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Recommend


More recommend