B u ilding tf - idf doc u ment v ectors FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist
n - gram modeling Weight of dimension dependent on the freq u enc y of the w ord corresponding to the dimension . Doc u ment contains the w ord human in �v e places . Dimension corresponding to human has w eight 5 . FEATURE ENGINEERING FOR NLP IN PYTHON
Moti v ation Some w ords occ u r v er y commonl y across all doc u ments Corp u s of doc u ments on the u ni v erse One doc u ment has jupiter and universe occ u rring 20 times each . jupiter rarel y occ u rs in the other doc u ments . universe is common . Gi v e more w eight to jupiter on acco u nt of e x cl u si v it y. FEATURE ENGINEERING FOR NLP IN PYTHON
Applications A u tomaticall y detect stop w ords Search Recommender s y stems Be � er performance in predicti v e modeling for some cases FEATURE ENGINEERING FOR NLP IN PYTHON
Term freq u enc y- in v erse doc u ment freq u enc y Proportional to term freq u enc y In v erse f u nction of the n u mber of doc u ments in w hich it occ u rs FEATURE ENGINEERING FOR NLP IN PYTHON
Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j FEATURE ENGINEERING FOR NLP IN PYTHON
Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j FEATURE ENGINEERING FOR NLP IN PYTHON
Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents containing term i i FEATURE ENGINEERING FOR NLP IN PYTHON
Mathematical form u la N ) ( df i = tf ⋅ log w i , j i , j → weight of term i in document j w i , j → term frequency of term i in document j tf i , j N → number of documents in the corpus df → number of documents cotaining term i i E x ample : 20 = 5 ⋅ log ( ) ≈ 2 w library , document 8 FEATURE ENGINEERING FOR NLP IN PYTHON
tf - idf u sing scikit - learn # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray()) [[0. 0. 0. 0. 0.25434658 0.33443519 0.33443519 0. 0.25434658 0. 0.25434658 0. 0.76303975] [0. 0.46735098 0. 0.46735098 0. 0. 0. 0.46735098 0. 0.46735098 0.35543247 0. 0. ] ... FEATURE ENGINEERING FOR NLP IN PYTHON
Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON
Cosine similarit y FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist
1 Image co u rtes y techninpink . com FEATURE ENGINEERING FOR NLP IN PYTHON
The dot prod u ct Consider t w o v ectors , V = ( v , v ,⋯ , v ), W = ( w , w ,⋯ , w ) 1 2 1 2 n n Then the dot prod u ct of V and W is , V ⋅ W = ( v × w ) + ( v × w ) + ⋯ + ( v × w ) 1 1 2 2 n n E x ample : A = (4,7,1) , B = (5,2,3) A ⋅ B = (4 × 5) + (7 × 2) + ⋯ (1 × 3) = 20 + 14 + 3 = 37 A ⋅ Bd FEATURE ENGINEERING FOR NLP IN PYTHON
Magnit u de of a v ector For an y v ector , V = ( v , v ,⋯ , v ) 1 2 n The magnit u de is de � ned as , ∣∣ V ∣∣ = √ ( v ) + ( v ) + ... + ( v ) 1 2 2 2 n 2 E x ample : A = (4,7,1) , B = (5,2,3) ∣∣ A ∣∣ = √ (4) + (7) + (1) 2 2 2 √66 √16 + 49 + 1 filler = = FEATURE ENGINEERING FOR NLP IN PYTHON
The cosine score A : (4,7,1) B : (5,2,3) The cosine score , A ⋅ B cos ( A , B ) = ∣ A ∣ ⋅ ∣ B ∣ 37 fillerslorem = √ √ 66 × 38 fillersl = 0.7388 FEATURE ENGINEERING FOR NLP IN PYTHON
Cosine Score : points to remember Val u e bet w een -1 and 1. In NLP , v al u e bet w een 0 and 1. Rob u st to doc u ment length . FEATURE ENGINEERING FOR NLP IN PYTHON
Implementation u sing scikit - learn # Import the cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Define two 3-dimensional vectors A and B A = (4,7,1) B = (5,2,3) # Compute the cosine score of A and B score = cosine_similarity([A], [B]) # Print the cosine score print(score) array([[ 0.73881883]]) FEATURE ENGINEERING FOR NLP IN PYTHON
Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON
B u ilding a plot line based recommender FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist
Mo v ie recommender Title O v er v ie w A pro v incial bo y related to a Shanghai crime famil y is recr u ited b y his u ncle Shanghai into cosmopolitan Shanghai in the 1930 s to be a ser v ant to a ganglord ' s Triad mistress . Cr y, the A So u th - African preacher goes to search for his w a yw ard son w ho has Belo v ed commi � ed a crime in the big cit y. Co u ntr y FEATURE ENGINEERING FOR NLP IN PYTHON
Mo v ie recommender get_recommendations("The Godfather") 1178 The Godfather: Part II 44030 The Godfather Trilogy: 1972-1990 1914 The Godfather: Part III 23126 Blood Ties 11297 Household Saints 34717 Start Liquidation 10821 Election 38030 Goodfellas 17729 Short Sharp Shock 26293 Beck 28 - Familjen Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON
Steps 1. Te x t preprocessing 2. Generate tf - idf v ectors 3. Generate cosine similarit y matri x FEATURE ENGINEERING FOR NLP IN PYTHON
The recommender f u nction 1. Take a mo v ie title , cosine similarit y matri x and indices series as arg u ments . 2. E x tract pair w ise cosine similarit y scores for the mo v ie . 3. Sort the scores in descending order . 4. O u tp u t titles corresponding to the highest scores . 5. Ignore the highest similarit y score ( of 1). FEATURE ENGINEERING FOR NLP IN PYTHON
Generating tf - idf v ectors # Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object vectorizer = TfidfVectorizer() # Generate matrix of tf-idf vectors tfidf_matrix = vectorizer.fit_transform(movie_plots) FEATURE ENGINEERING FOR NLP IN PYTHON
Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import cosine_similarity # Generate cosine similarity matrix cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON
The linear _ kernel f u nction Magnit u de of a tf - idf v ector is 1 Cosine score bet w een t w o tf - idf v ectors is their dot prod u ct . Can signi � cantl y impro v e comp u tation time . Use linear_kernel instead of cosine_similarity . FEATURE ENGINEERING FOR NLP IN PYTHON
Generating cosine similarit y matri x # Import cosine_similarity from sklearn.metrics.pairwise import linear_kernel # Generate cosine similarity matrix cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. , 0.00758112], [0.27435345, 1. , 0.1246955 , ..., 0. , 0. , 0.00740494], ..., [0.00758112, 0.00740494, 0. , ..., 0. , 0. , 1. ]]) FEATURE ENGINEERING FOR NLP IN PYTHON
The get _ recommendations f u nction get_recommendations('The Lion King', cosine_sim, indices) 7782 African Cats 5877 The Lion King 2: Simba's Pride 4524 Born Free 2719 The Bear 4770 Once Upon a Time in China III 7070 Crows Zero 739 The Wizard of Oz 8926 The Jungle Book 1749 Shadow of a Doubt 7993 October Baby Name: title, dtype: object FEATURE ENGINEERING FOR NLP IN PYTHON
Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON
Be y ond n - grams : w ord embeddings FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist
The problem w ith BoW and tf - idf 'I am happy' 'I am joyous' 'I am sad' FEATURE ENGINEERING FOR NLP IN PYTHON
Word embeddings Mapping w ords into an n - dimensional v ector space Prod u ced u sing deep learning and h u ge amo u nts of data Discern ho w similar t w o w ords are to each other Used to detect s y non y ms and anton y ms Capt u res comple x relationships King - Queen → Man - Woman France - Paris → Russia - Moscow Dependent on spac y model ; independent of dataset y o u u se FEATURE ENGINEERING FOR NLP IN PYTHON
Word embeddings u sing spaC y import spacy # Load model and create Doc object nlp = spacy.load('en_core_web_lg') doc = nlp('I am happy') # Generate word vectors for each token for token in doc: print(token.vector) [-1.0747459e+00 4.8677087e-02 5.6630421e+00 1.6680446e+00 -1.3194644e+00 -1.5142369e+00 1.1940931e+00 -3.0168812e+00 ... FEATURE ENGINEERING FOR NLP IN PYTHON
Recommend
More recommend