CSE 6240: Web Search and Text Mining. Spring 2020 Word Embeddings Prof. Srijan Kumar with Arindum Roy and Roshan Pati 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Administrivia • Homework: Will be released today after class • Project Reminder: Teams due Monday Jan 20. • A fun exercise at the end of the class! 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Homework Policy • Late day policy: 3 late days (3 x 24 hour chunks) – Use as needed • Collaboration: – OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity. – Please list the students you collaborated with. • Zero tolerance on plagiarism – Follow the GT academic honesty rules 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Recap So Far 1. IR and text processing 2. Evaluation of IR system 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Today’s Lecture • Representing words and phrases – Neural network basics – Word2vec – Continuous bag of words – Skip-gram model Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing a Word: One Hot Encoding • Given a vocabulary dog cat person holding tree computer using 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing a Word: One Hot Encoding • Given a vocabulary dog 1 cat 2 person 3 holding 4 tree 5 computer 6 using 7 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing a Word: One Hot Encoding • Given a vocabulary, convert to One Hot Encoding dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ] tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ] using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ] 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Recap: Bag of Words Model • Represent a document as a collection of words (after cleaning the document) – The order of words is irrelevant – The document “ John is quicker than Mary ” is indistinguishable from the doc “ Mary is quicker than John ” • Rank documents according to the overlap between query words and document words 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing Phrases: Bag of Words bag of words representation Dog Cat Person Holding Tree Computer Using 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Representing Phrases: Bag of Words bag of words representation person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ] person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ] person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ] Dog Cat Person Holding Tree Computer Using person using computer person holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ] 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Distributional Hypothesis [Lenci, 2008] • The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts • Similarity in meaning ∝ Similarity of context • Simple definition: context = surrounding words 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
What Is The Meaning Of “Barwadic”? • he handed her glass of bardiwac . • Beef dishes are made to complement the bardiwac . • Nigel staggered to his feet, face flushed from too much bardiwac . • Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac . • The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
What Is The Meaning Of “Barwadic”? • he handed her glass of barwadic . • Beef dishes are made to complement the barwadic . • Nigel staggered to his feet, face flushed from too much barwadic . • Malbec, one of the lesser-known barwadic grapes, responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic . • The drinks were delicious: blood-red barwadic as well as light, sweet Rhenish. Inference: barwadic is an alcoholic beverage made from grapes 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Geometric Interpretation: Co-occurrence As Feature • Recall the term-document matrix – Rows are terms, columns are documents, cells represent the number of time a term appears in a document • Here we create a word-word co-occurrence matrix – Rows and columns are words – Cell (R,C) means “how many times does word C appear in the neighborhood of word R” • Neighborhood = a window of fixed size around the word 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Row Vectors in Co-occurrence Matrix • Row vector describes the usage of the word in the corpus/document • Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space • Example: n = 2 • Dimensions = ‘get’ and ‘use’ Co-occurrence matrix 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Distance And Similarity • Selected two dimensions ‘get’ and ‘use’ • Similarity between words = spatial proximity in the dimension space • Measured by the Euclidean distance 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Distance And Similarity • Exact position in the space depends on the frequency of the word • More frequent words will appear farther from the origin • E.g., say ‘dog’ is more frequent than ‘cat’ • Does not mean it is more important • Solution: Ignore the length and look only at the direction 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Angle And Similarity • Angle ignores the exact location of the point • Method: Normalize by the length of vectors or use only the angle as a distance measure • Standard metric: Cosine similarity between vectors 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Issues with Co-occurrence Matrix • Problem with using the co-occurrence directly: – The resulting vectors are very high dimensional – Dimension size = Number of words in the corpus Billions! • – Down-sampling dimensions is not straight-forward How many columns to select? • Which columns to select? • • Solution: Compression or Dimensionality Reduction Techniques 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
SVD for Dimensionality Reduction • SVD = Singular Value Decomposition • For an input matrix X – U = left-singular vector of X, and V = right-singular vector of X – S is a diagonal matrix Diagonal values of S are called Singular Values • • Matrix U is a get a r-dimension vector for every row of X 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Word Visualization via Dimensionality Reduction 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Issues with SVD • Computational cost for SVD on an N x M matrix is O(NM 2 ) , where N < M • Impossible for large number of word vocabularies or documents • Impractical for real corpus • It is hard to incorporate out-of-sample or new words/documents • Entire row in the matrix will be 0 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Word2Vec: Representing Word Meanings Key idea: Predict the surrounding words of every word Benefits: • Faster • Easier to incorporate new words and documents Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013. 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Recommend
More recommend