matrice terme document
play

Matrice terme-document QuatreVT Voyage Bal Bte Hum. Mme Bovary - PowerPoint PPT Presentation

Matrice terme-document QuatreVT Voyage Bal Bte Hum. Mme Bovary 119 Kw 82 kw 128 kw 117 kw bataille 35 4 6 2 clair 105 26 96 52 facile 12 19 6 10 politique 11 0 9 5 voyage 17 196 94 44 idiot 2 1 2 6 amour 19


  1. Matrice terme-document QuatreVT Voyage Bal Bête Hum. Mme Bovary 119 Kw 82 kw 128 kw 117 kw bataille 35 4 6 2 clair 105 26 96 52 facile 12 19 6 10 politique 11 0 9 5 voyage 17 196 94 44 idiot 2 1 2 6 amour 19 0 47 94 Quatrevingt-treize (Hugo) Le voyage en ballon (Verne) La bête humaine (Zola) Mme Bovary (Flaubert) 9 / 13

  2. Documents comme vecteurs QuatreVT Voyage Bal Bête Hum. Mme Bovary bataille 35 4 6 2 amour 19 0 47 94 10 / 13

  3. Documents comme vecteurs QuatreVT Voyage Bal Bête Hum. Mme Bovary voyage 17 196 94 44 amour 19 0 47 94 11 / 13

  4. Vecteurs terme-document On peut inverser la représentation : les dimensions sont maintenant les documents, les vecteurs permettent de décrire des mots. QuatreVT Voyage Bal Bête Hum. Mme Bovary 119 Kw 82 kw 128 kw 117 kw bataille 35 4 6 2 clair 105 26 96 52 facile 12 19 6 10 politique 11 0 9 5 voyage 17 196 94 44 idiot 2 1 2 6 amour 19 0 47 94 amour (comme politique ) est le genre de mot qui n’apparaît pas dans “Le voyage en ballon”. 12 / 13

  5. On peut visualiser les mots dans l’espace (Quatrevingt-treize, Mme Bovary) : bataille (35,2) politique (11,5) amour (19,94) voyage (17,44) 13 / 13

  6. Comptages distributionnels arriver tomber habiller mourir bataille 246 100 2 180 voyage 470 83 4 116 homme 1 819 1 205 339 1 499 femme 890 660 384 1 088 6 / 13

  7. Comptages distributionnels arriver tomber habiller mourir bataille 246 100 2 180 55 331 voyage 470 83 4 116 208 520 homme 1 819 1 205 339 1 499 668 289 femme 890 660 384 1 088 346 093 7 / 13

  8. Comptages distributionnels (normalisés) arriver tomber habiller mourir bataille 44 18 0 32 voyage 23 4 0 6 homme 27 18 5 22 femme 26 19 11 31 8 / 13

  9. Matrice terme-terme More common: word-word matrix (or "term-context matrix") Two words are similar in meaning if their context vectors are similar sugar, a sliced lemon, a tablespoonful of apricot jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 14 / 13

  10. 4 information 3 [6,4] result 2 digital [1,1] 1 1 2 3 4 5 6 data 15 / 13

  11. Reminders from linear algebra N � dot-product ( � w ) = � v · � w = v i w i = v 1 w 1 + v 2 w 2 + ... + v N w N v , � i = 1 � N � � � v 2 vector length | � v | = � i i = 1 16 / 13

  12. Cosine for computing similarity Sec. 6.3 N � v i w i w ) = � v · � w i = 1 cosine ( � w | = v , � | � v || � � � N N � � � � � � v 2 w 2 � � i i i = 1 i = 1 v i is the count for word v in context i w i is the count for word w in context i. a · � � � b = | � a || b | cos θ a · � � b = cos θ Cos( v,w ) is the cosine similarity of v and w � | � a || b | 17 / 13

  13. Cosine as a similarity metric -1: vectors point in opposite directions +1: vectors point in same directions 0: vectors are orthogonal Frequency is non-negative, so cosine range 0-1 51 18 / 13

  14. large data computer apricot 1 0 0 v •     N ∑ digital 0 1 2 v i w i cos(  v ,  w v w w ) =  v  w = v •  w =  i = 1 information 1 6 1 N N 2 2 ∑ ∑ v i w i i = 1 i = 1 Which pair of words is more similar? 1 + 0 + 0 1 cosine(apricot,information) = = = .16 1 + 0 + 0 1 + 36 + 1 38 cosine(digital,information) = 0 + 6 + 2 8 = = .58 0 + 1 + 4 1 + 36 + 1 38 5 cosine(apricot,digital) = 0 + 0 + 0 = 0 1 + 0 + 0 0 + 1 + 4 52 19 / 13

  15. Visualizing cosines (well, angles) Dimension 1: ‘large’ 3 2 apricot information 1 digital 1 2 3 4 5 6 7 Dimension 2: ‘data’ 20 / 13

  16. Représentations lexicales contextuelles (Smith, 2020) Mots comme des vecteurs distributionnels Computational Linguistics Volume 18, Number 4 Exemple : clustering .evaluation plan F- assessment ~" letter request analysis ,., memo understanding case opinion 1 I question -'-7 conversation charge----I ~__ discussion statement L-] draft ~ day accounts year people week customers iL I month individuals half iI quarter employees students ] reps i~ representatives representative rep Figure 2 Sample subtrees from a 1,000-word mutual information tree. Source : (Brown et al. , 1992) 31 / 54 to this single cluster and the leaves of which correspond to the words in the vocabulary. Intermediate nodes of the tree correspond to groupings of words intermediate between single words and the entire vocabulary. Words that are statistically similar with respect to their immediate neighbors in running text will be close together in the tree. We have applied this tree-building algorithm to vocabularies of up to 5,000 words. Figure 2 shows some of the substructures in a tree constructed in this manner for the 1,000 most frequent words in a collection of office correspondence. Beyond 5,000 words this algorithm also fails of practicality. To obtain clusters for larger vocabularies, we proceed as follows. We arrange the words in the vocabulary in order of frequency with the most frequent words first and assign each of the first C words to its own, distinct class. At the first step of the algorithm, we assign the (C Jr 1) st most probable word to a new class and merge that pair among the resulting C + 1 classes for which the loss in average mutual information is least. At the k th step of the algorithm, we assign the (C + k) th most probable word to a new class. This restores the number of classes to C + 1, and we again merge that pair for which the loss in average mutual information is least. After V - C steps, each of the words in the vocabulary will have been assigned to one of C classes. We have used this algorithm to divide the 260,741-word vocabulary of Table I into 1,000 classes. Table 2 contains examples of classes that we find particularly interesting. Table 3 contains examples that were selected at random. Each of the lines in the tables contains members of a different class. The average class has 260 words and so to make the table manageable, we include only words that occur at least ten times and 474

  17. Représentations lexicales contextuelles (Smith, 2020) Mots comme des vecteurs distributionnels Exemple : brown clusters sur worpus de tweets Figure 1. Example Brown clusters. These were derived from 56M tweets, see Owoputi et al.. 26 for details. Shown are the 10 most frequent words in clusters in the section of the hierarchy with prefix bit string 00110. Intermediate nodes in the tree correspond to clusters that contain all words in their descendants. Note that differently spelled variants of words tend to cluster together, as do words that express similar meanings, including hashtags. The full set of clusters can be explored at http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html. Note that there are several Unicode characters that are visually similar to the apostrophe, resulting in different strings with similar usage. Source : (Smith, 2020) 32 / 54

Recommend


More recommend