introd u ction to nlp feat u re engineering
play

Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E - PowerPoint PPT Presentation

Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist N u merical data Iris dataset sepal length sepal w idth petal length petal w idth class 6.3 2.9 5.6 1.8 Iris


  1. Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. N u merical data Iris dataset sepal length sepal w idth petal length petal w idth class 6.3 2.9 5.6 1.8 Iris -v irginica 4.9 3.0 1.4 0.2 Iris - setosa 5.6 2.9 3.6 1.3 Iris -v ersicolor 6.0 2.7 5.1 1.6 Iris -v ersicolor 7.2 3.6 6.1 2.5 Iris -v irginica FEATURE ENGINEERING FOR NLP IN PYTHON

  3. One - hot encoding se x female male female male female ... FEATURE ENGINEERING FOR NLP IN PYTHON

  4. One - hot encoding se x one - hot encoding female → male → female → male → female → ... ... FEATURE ENGINEERING FOR NLP IN PYTHON

  5. One - hot encoding se x one - hot encoding se x_ female se x_ male female → 1 0 male → 0 1 female → 1 0 male → 0 1 female → 1 0 ... ... ... ... FEATURE ENGINEERING FOR NLP IN PYTHON

  6. One - hot encoding w ith pandas # Import the pandas library import pandas as pd # Perform one-hot encoding on the 'sex' feature of df df = pd.get_dummies(df, columns=['sex']) FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Te x t u al data Mo v ie Re v ie w Dataset re v ie w class This mo v ie is for dog lo v ers . A v er y poignant ... positi v e The mo v ie is forge � able . The plot lacked ... negati v e A tr u l y ama z ing mo v ie abo u t dogs . A gripping ... positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Te x t pre - processing Con v erting to lo w ercase E x ample : Reduction to reduction Con v erting to base - form E x ample : reduction to reduce FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Vectori z ation re v ie w class This mo v ie is for dog lo v ers . A v er y poignant ... positi v e The mo v ie is forge � able . The plot lacked ... negati v e A tr u l y ama z ing mo v ie abo u t dogs . A gripping ... positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  10. Vectori z ation 0 1 2 ... n class 0.03 0.71 0.00 ... 0.22 positi v e 0.45 0.00 0.03 ... 0.19 negati v e 0.14 0.18 0.00 ... 0.45 positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  11. Basic feat u res N u mber of w ords N u mber of characters A v erage length of w ords T w eets FEATURE ENGINEERING FOR NLP IN PYTHON

  12. POS tagging Word POS I Prono u n ha v e Verb a Article dog No u n FEATURE ENGINEERING FOR NLP IN PYTHON

  13. Named Entit y Recognition Does no u n refer to person , organi z ation or co u ntr y? No u n NER Brian Person DataCamp Organi z ation FEATURE ENGINEERING FOR NLP IN PYTHON

  14. Concepts co v ered Te x t Preprocessing Basic Feat u res Word Feat u res Vectori z ation FEATURE ENGINEERING FOR NLP IN PYTHON

  15. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  16. Basic feat u re e x traction FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  17. N u mber of characters "I don't know." # 13 characters # Compute the number of characters text = "I don't know." num_char = len(text) # Print the number of characters print(num_char) 13 # Create a 'num_chars' feature df['num_chars'] = df['review'].apply(len) FEATURE ENGINEERING FOR NLP IN PYTHON

  18. N u mber of w ords # Split the string into words text = "Mary had a little lamb." words = text.split() # Print the list containing words print(words) ['Mary', 'had', 'a', 'little', 'lamb.'] # Print number of words print(len(words)) 5 FEATURE ENGINEERING FOR NLP IN PYTHON

  19. N u mber of w ords # Function that returns number of words in string def word_count(string): # Split the string into words words = string.split() # Return length of words list return len(words) # Create num_words feature in df df['num_words'] = df['review'].apply(word_count) FEATURE ENGINEERING FOR NLP IN PYTHON

  20. A v erage w ord length #Function that returns average word length def avg_word_length(x): # Split the string into words words = x.split() # Compute length of each word and store in a separate list word_lengths = [len(word) for word in words] # Compute average word length avg_word_length = sum(word_lengths)/len(words) # Return average word length return(avg_word_length) FEATURE ENGINEERING FOR NLP IN PYTHON

  21. A v erage w ord length # Create a new feature avg_word_length df['avg_word_length'] = df['review'].apply(doc_density) FEATURE ENGINEERING FOR NLP IN PYTHON

  22. Special feat u res FEATURE ENGINEERING FOR NLP IN PYTHON

  23. Hashtags and mentions # Function that returns number of hashtags def hashtag_count(string): # Split the string into words words = string.split() # Create a list of hashtags hashtags = [word for word in words if word.startswith('#')] # Return number of hashtags return len(hashtags) hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy") 2 FEATURE ENGINEERING FOR NLP IN PYTHON

  24. Other feat u res N u mber of sentences N u mber of paragraphs Words starting w ith an u ppercase All - capital w ords N u meric q u antities FEATURE ENGINEERING FOR NLP IN PYTHON

  25. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  26. Readabilit y tests FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  27. O v er v ie w of readabilit y tests Determine readabilit y of an English passage Scale ranging from primar y school u p to college grad u ate le v el A mathematical form u la u tili z ing w ord , s y llable and sentence co u nt Used in fake ne w s and opinion spam detection FEATURE ENGINEERING FOR NLP IN PYTHON

  28. Readabilit y te x t e x amples Flesch reading ease G u nning fog inde x Simple Meas u re of Gobbled y gook ( SMOG ) Dale - Chall score FEATURE ENGINEERING FOR NLP IN PYTHON

  29. Readabilit y test e x amples Flesch reading ease G u nning fog inde x Simple Meas u re of Gobbled y gook ( SMOG ) Dale - Chall score FEATURE ENGINEERING FOR NLP IN PYTHON

  30. Flesch reading ease One of the oldest and most w idel y u sed tests Dependent on t w o factors : Greater the a v erage sentence length , harder the te x t is to read " This is a short sentence ." " This is longer sentence w ith more w ords and it is harder to follo w than the � rst sentence ." Greater the a v erage n u mber of s y llables in a w ord , harder the te x t is to read " I li v e in m y home ." " I reside in m y domicile ." Higher the score , greater the readabilit y FEATURE ENGINEERING FOR NLP IN PYTHON

  31. Flesch reading ease score interpretation Reading ease score Grade Le v el 90-100 5 80-90 6 70-80 7 60-70 8-9 50-60 10-12 30-50 College 0-30 College Grad u ate FEATURE ENGINEERING FOR NLP IN PYTHON

  32. G u nning fog inde x De v eloped in 1954 Also dependent on a v erage sentence length Greater the percentage of comple x w ords , harder the te x t is to read Higher the inde x, lesser the readabilit y FEATURE ENGINEERING FOR NLP IN PYTHON

  33. G u nning fog inde x interpretation Fog inde x Grade le v el Fog inde x Grade le v el 17 College grad u ate 10 High school sophomore 16 College senior 9 High school freshman 15 College j u nior 8 Eighth grade 14 College sophomore 7 Se v enth grade 13 College freshman 6 Si x th grade 12 High school senior 11 High school j u nior FEATURE ENGINEERING FOR NLP IN PYTHON

  34. The te x tatistic librar y # Import the Textatistic class from textatistic import Textatistic # Create a Textatistic Object readability_scores = Textatistic(text).scores # Generate scores print(readability_scores['flesch_score']) print(readability_scores['gunningfog_score']) 21.14 16.26 FEATURE ENGINEERING FOR NLP IN PYTHON

  35. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Recommend


More recommend