2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline ▶ Introduction ▶ Algorithm Learning Vector Representation of Words Paragraph Vector: A distributed memory model Paragraph Vector without word ordering: Distributed bag of words ▶ Experiments ▶ Conclusion ▶ Demo 2 1
2017-11-27 I ntro duc tio n ▶ Many machine learning algorithms require the input to be represented as a fixed-length feature vector. ▶ When it comes to texts, one of the most common fixed-length features is bag-of-words. 3 Ba g o f Wo rds 4 2
2017-11-27 Ba g o f Wo rds Disa dva nta g e s ▶ The word order is lost, and thus different sentences can have exactly the same representation, as long as the same words are used. ▶ Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and high dimensionality. ▶ Bag-of-words and bag-of-n-grams have very little sense about the semantics of the words or more formally the distances between the words. (powerful, Paris, strong) 5 Wo rd E mb e dding word(i-k) word(i-k) sum word(i) word(i) projection word(i-k+1) word(i-k+1) … … word(i+k) word(i+k) CBOW Skipgram 6 3
2017-11-27 Wo rd E mb e dding 7 Pro po se d Me tho d ▶ Distributed Representations of Sentences and Documents model was proposed. ▶ Paragraph Vector, an unsupervised algorithm that learns fixed- length feature representations from variable-length pieces of texts. ▶ Proposed algorithm represents each document by a dense vector which is trained to predict words in the document. 8 4
2017-11-27 L e a rning Ve c to r Re pre se nta tio n o f Wo rds ▶ The task is to predict a word given the other words in a context. 9 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ Paragraph vectors are used for prediction ▶ Every paragraph is mapped to a unique vector. ▶ Every word is also mapped to a unique vector 10 5
2017-11-27 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ The contexts are sampled from a sliding window over paragraph ▶ Paragraph vector is shared across all contexts from the same paragraph. ▶ Word vectors are shared across paragraphs 11 Adva nta g e s o ve r BOW Semantics of the words. In this space, “powerful” is closer to “strong” than to “Paris” Take into consideration the word order. 12 6
2017-11-27 Pa ra g ra ph Ve c to r Distrib ute d Ba g o f Wo rds ( PV-DBOW ) ▶ In this version, the paragraph vector is trained to predict the words in a small window. 13 E xpe rime nt ▶ Each paragraph vector is a combination of two vectors: one learned by PV-DM and one learned by PV-DBOW. ▶ Sentiment Analysis. ▶ Stanford sentiment treebank 11855 sentences ▶ ▶ IMDB 100000 movie reviews ▶ ▶ Information Retrieval 14 7
2017-11-27 Sta nfo rd se ntime nt tre e b a nk ▶ Learn the representations for all the sentences ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM ▶ Logistic Regression was used for prediction ▶ Every sentence has label which goes from 0.0 to 1.0 15 Sta nfo rd se ntime nt tre e b a nk 16 8
2017-11-27 I MDB ▶ Using Neural Networks and Logistic Regression for prediction ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM 17 I MDB 18 9
2017-11-27 I nfo rma tio n Re trie va l calls from ( 000 ) 000 - 0000 . 3913 calls reported from this number . ▶ according to 4 reports the identity of this caller is american airlines . do you want to find out who called you from +1 000 - 000 - 0000 , +1 ▶ 0000000000 or ( 000) 000 - 0000 ? see reports and share information you have about this caller allina health clinic patients for your convenience , you can pay your ▶ allina health clinic bill online . pay your clinic bill now , question and answers... 19 Ob se rva tio ns ▶ PV-DM is consistently better than PV-DBOW ▶ PV-DM alone can achieve good results ▶ The combination of PV-DM and PV-DOW can gain best results. ▶ A good guess for window size is between 5 and 12. ▶ The proposed method must be run in parallel. 20 10
2017-11-27 Adva nta g e s a nd Disa dva nta g e s The proposed method is competitive with state-of-the-art methods. ▶ The good performance demonstrates the merits of Paragraph vector ▶ in capturing the semantics of paragraphs. It is scalable (sentences, paragraphs, and documents). ▶ Paragraph vectors have the potential to overcome many weaknesses ▶ of bag-of-words (word orders, word meaning, …) Paragraph vector can be expensive. ▶ Too many parameters. ▶ If the input corpus is one with lots of misspellings like tweets, this ▶ algorithm may not be a good choice 21 De mo 22 11
2017-11-27 Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot vh2 sat 0 vector vector 0 0 1 0 … 0 1 0 0 on 0 0 0 … 0 23 We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer cat � 0 ��� 0 0 0 0 0 … 0 V-dim 0 0 �′ ��� 0 sat 0 0 0 0 1 0 … N-dim � V-dim 1 0 ��� on 0 0 0 0 … V-dim N will be the size of word vector 0 24 12
Slide 23 vh2 One hot encoding technique is used to encode categorical integer features using a one-hot aka one-of-K scheme. Suppose you have ‘color’ feature which can take values ‘green’, ‘red’, and ‘blue’. One hot encoding will convert this ‘color’ feature to three features, namely, ‘is_green’, ‘is_red’, and ‘is_blue’ which all are binary. vagelis hristidis, 2016-11-06
2017-11-27 � � � � ��� � � ��� ��� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 2.4 Input layer 1 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.6 0 � 0 … … … … … … … … … … � … 0 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.8 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � ��� � � �� � 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 25 � � � � �� � � �� ��� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 1.8 Input layer 0 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.9 0 � 0 … … … … … … … … … … � … 1 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.9 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � ��� � � �� � 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 26 13
2017-11-27 Input layer 0 1 0 0 Hidden layer Output layer � cat 0 ��� 0 0 0 0 0 … 0 V-dim 0 0 � � � ���������� � � � � � � � 0 ��� 0 0 0 0 1 � � 0 … � 1 0 ��� N-dim 0 on � ��� � 0 0 V-dim 0 … V-dim N will be the size of word vector 0 27 Input layer 0 We would prefer � � close to � � ��� 1 0 0 Hidden layer Output layer cat � 0 ��� 0 0 0 0.01 0 0 0.02 … 0 V-dim 0.00 0 0 � � � � � � � 0 ��� 0.02 0 � � � ���������� 0.01 0 0 0 1 0.02 � � 0 … 0.01 � 1 0 ��� N-dim 0.7 on 0 � � ��� 0 … 0 V-dim 0.00 0 … � � V-dim N will be the size of word vector 0 28 14
2017-11-27 � � ��� 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 Contain word’s vectors Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 0 … … … … … … … … … … 1 … … … … … … … … … … 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 Output layer x cat 0 0 0 � 0 ��� 0 0 … 0 V-dim 0 � 0 � ��� 0 sat 0 0 0 0 1 0 � … ��� V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 29 15
Recommend
More recommend