natural language processing
play

Natural Language Processing Info 159/259 Lecture 8: Vector - PowerPoint PPT Presentation

David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202 South Hall DB office hours on


  1. David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 
 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley

  2. Announcements • Homework 2 party today 5-7pm: 202 South Hall • DB office hours on Monday 10/25 10-noon (no office hours this Friday) • No quiz 10/3 or 10/5

  3. http://dlabctawg.github.io 356 Barrows Hall (D-Lab) 
 Wed 3-5pm

  4. Recurrent neural network • RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history. from last time

  5. Recurrent neural network from last time Goldberg 2017

  6. Recurrent neural network • Each time step has two inputs: • x i (the observation at time step i); one-hot vector, feature vector or distributed representation. • s i-1 (the output of the previous state); base case: s 0 = 0 vector from last time

  7. Training RNNs • Given this definition of an RNN: s i = R ( x i , s i − 1 ) = g ( s i − 1 W s + x i W x + b ) y i = O ( s i ) = ������� ( s i W o + b o ) • We have five sets of parameters to learn: W s , W x , W o , b, b o from last time

  8. Lexical semantics “You shall know a word by the company it keeps” 
 [Firth 1957]

  9. everyone likes ______________ a bottle of ______________ is on the table ______________ makes you drunk a cocktail with ______________ and seltzer

  10. Context “You shall know a word by the company it keeps” 
 [Firth 1957] • A few different ways we can encode the notion of “company” (or context).

  11. context everyone likes ______________ a bottle of ______________ is on the table ______________ makes you drunk a cocktail with ______________ and seltzer

  12. Distributed representation • Vector representation that encodes information about the distribution of contexts a word appears in • Words that appear in similar contexts have similar representations (and similar meanings, by the distributional hypothesis).

  13. Term-document matrix Romeo 
 Julius 
 Hamlet Macbeth Richard III Tempest Othello King Lear & Juliet Caesar knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44 Context = appearing in the same document.

  14. Vector Vector Hamlet King Lear representation of 1 2 the document; 2 12 vector size = V 17 17 64 48 75 44

  15. Vectors knife 1 1 4 2 2 2 sword 17 2 7 12 2 17 Vector representation of the term; vector size = number of documents

  16. Weighting dimensions • Not all dimensions are equally informative

  17. TF-IDF • Term frequency-inverse document frequency • A scaling to represent a feature as function of how frequently it appears in a data point but accounting for its frequency in the overall collection • IDF for a given term = the number of documents in collection / number of documents that contain term

  18. TF-IDF • Term frequency ( tf t,d ) = the number of times term t occurs in document d • Inverse document frequency = inverse fraction of number of documents containing ( D t ) among total number of documents N f ( t, d ) = tf t,d × log N tfid D t

  19. IDF Hamlet Macbet Romeo 
 Richard Julius 
 Tempes King Othello IDF h & Juliet III Caesar t Lear knife 1 1 4 2 2 2 0.12 dog 2 6 6 2 12 0.20 sword 17 2 7 12 2 17 0.12 love 64 135 63 12 48 0.20 like 75 38 34 36 34 41 27 44 0 IDF for the informativeness of the terms when comparing documents

  20. PMI • Mutual information provides a measure of how independent two variables (X and Y) are. • Pointwise mutual information measures the independence of two outcomes (x and y)

  21. PMI P ( x, y ) log 2 P ( x ) P ( y ) P ( w, c ) What’s this value for w and c log 2 w = word, c = context that never occur together? P ( w ) P ( c ) � � P ( w, c ) PPMI = max log 2 P ( w ) P ( c ) , 0

  22. Macbet Romeo 
 Richard Julius 
 King Hamlet Tempest Othello total h & Juliet III Caesar Lear knife 1 1 4 2 2 2 12 dog 2 6 6 2 12 28 sword 17 2 7 12 2 17 57 love 64 135 63 12 48 322 like 75 38 34 36 34 41 27 44 329 total 159 41 186 119 34 59 27 123 748 135 748 PMI (love , R&J) = 186 748 × 322 748

  23. Term-term matrix • Rows and columns are both words; cell counts = the number of times word w i and w j show up in the same document. • More common to define document = some smaller context (e.g., a window of 5 tokens)

  24. Term-document matrix Romeo 
 Julius 
 Hamlet Macbeth Richard III Tempest Othello King Lear & Juliet Caesar knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44

  25. Term-term matrix knife dog sword love like knife 6 5 6 5 5 dog 5 5 5 5 5 sword 6 5 6 5 5 love 5 5 5 5 5 like 5 5 5 5 8

  26. Term-term matrix Jurafsky and Martin 2017

  27. write a book write a poem • First-order co-occurrence (syntagmatic association): write co-occurs with book in the same sentence. • Second-order co-occurrence (paradigmatic association): book co-occurs with poem (since each co-occur with write)

  28. Syntactic context Lin 1998; Levy and Goldberg 2014

  29. Cosine Similarity � F i = 1 x i y i cos ( x , y ) = �� F �� F i = 1 x 2 i = 1 y 2 i i • We can calculate the cosine similarity of two vectors to judge the degree of their similarity [Salton 1971] • Euclidean distance measures the magnitude of distance between two points • Cosine similarity measures their orientation

  30. Intrinsic Evaluation human word 1 word 2 score • Relatedness: midday noon 9.29 correlation (Spearman/Pearson) journey voyage 9.29 between vector car automobile 8.94 similarity of pair of words and human … … … judgments professor cucumber 0.31 king cabbage 0.23 WordSim-353 (Finkelstein et al. 2002)

  31. Intrinsic Evaluation • Analogical reasoning (Mikolov et al. 2013). For analogy 
 Germany : Berlin :: France : ???, find closest vector to 
 v(“Berlin”) - v(“Germany”) + v(“France”) target possibly impossibly certain uncertain generating generated shrinking shrank think thinking look looking Baltimore Maryland Oakland California shrinking shrank slowing slowed Rabat Morocco Astana Kazakhstan

  32. Sparse vectors A 0 a 0 aa 0 aal 0 aalii 0 aam 0 Aani 0 aardvark 1 aardwolf 0 ... 0 “aardvark” zymotoxic 0 zymurgy 0 Zyrenian 0 Zyrian 0 V-dimensional vector, single 1 for Zyryan 0 zythem 0 the identity of the element Zythia 0 zythum 0 Zyzomys 0 Zyzzogeton 0

  33. Dense 1 vectors → 0.7 1.3 -4.5

  34. Singular value decomposition • Any n ⨉ p matrix X can be decomposed into the product of three matrices (where m = the number of linearly independent rows) 9 4 3 1 2 ⨉ ⨉ 7 9 8 1 n x m m x p m x m (diagonal)

  35. Singular value decomposition • We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix 9 4 0 0 0 ⨉ ⨉ 0 0 0 0 n x m m x p m x m (diagonal)

  36. Singular value decomposition • We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix (the k largest singular values) 9 4 0 0 0 ⨉ ⨉ 0 0 0 0 n x m m x m m x p

  37. Romeo 
 Richard Julius 
 King Hamlet Macbeth Tempest Othello & Juliet III Caesar Lear knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44 Hamle Macbet Romeo 
 Richar Julius 
 Tempe King knife Othello t h & Juliet d III Caesar st Lear dog sword love like

  38. Low-dimensional Low-dimensional representation for representation for terms (here 2-dim) documents (here 2-dim) Hamle Macbet Romeo 
 Richar Julius 
 Tempe King knife Othello t h & Juliet d III Caesar st Lear dog sword love like

  39. Latent semantic analysis • Latent Semantic Analysis/Indexing (Deerwester et al. 1998) is this process of applying SVD to the term-document co-occurence matrix • Terms typically weighted by tf-idf • This is a form of dimensionality reduction (for terms, from a D-dimensionsal sparse vector to a K- dimensional dense one), K << D.

  40. Dense vectors from prediction • Learning low-dimensional representations of words by framing a predicting task: using context to predict words in a surrounding window • Transform this into a supervised prediction problem; similar to language modeling but we’re ignoring order within the context window

  41. Dense vectors from prediction x y a cocktail with gin a gin and seltzer cocktail gin with gin and gin seltzer gin Window size = 3

  42. Dimensionality reduction … … the 1 the a 0 an 0 4.1 for 0 in 0 -0.9 on 0 dog 0 cat 0 … … the is a point in V-dimensional space the is a point in 2-dimensional space

  43. W V x 1 gin y gin h 1 x 2 cocktail y cocktail h 2 globe x 3 y globe W x V y -0.5 1.3 gin 0 1 4.1 0.7 0.1 cocktail 0.4 0.08 1 -0.9 1.3 0.3 0 globe 1.7 3.1 0 0

Recommend


More recommend