Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - PDF document

2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline ▶ Introduction ▶ Algorithm Learning Vector Representation of Words Paragraph Vector: A distributed memory model Paragraph Vector without word ordering: Distributed bag of words ▶ Experiments ▶ Conclusion ▶ Demo 2 1

2017-11-27 I ntro duc tio n ▶ Many machine learning algorithms require the input to be represented as a fixed-length feature vector. ▶ When it comes to texts, one of the most common fixed-length features is bag-of-words. 3 Ba g o f Wo rds 4 2

2017-11-27 Ba g o f Wo rds Disa dva nta g e s ▶ The word order is lost, and thus different sentences can have exactly the same representation, as long as the same words are used. ▶ Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and high dimensionality. ▶ Bag-of-words and bag-of-n-grams have very little sense about the semantics of the words or more formally the distances between the words. (powerful, Paris, strong) 5 Wo rd E mb e dding word(i-k) word(i-k) sum word(i) word(i) projection word(i-k+1) word(i-k+1) … … word(i+k) word(i+k) CBOW Skipgram 6 3

2017-11-27 Wo rd E mb e dding 7 Pro po se d Me tho d ▶ Distributed Representations of Sentences and Documents model was proposed. ▶ Paragraph Vector, an unsupervised algorithm that learns fixed- length feature representations from variable-length pieces of texts. ▶ Proposed algorithm represents each document by a dense vector which is trained to predict words in the document. 8 4

2017-11-27 L e a rning Ve c to r Re pre se nta tio n o f Wo rds ▶ The task is to predict a word given the other words in a context. 9 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ Paragraph vectors are used for prediction ▶ Every paragraph is mapped to a unique vector. ▶ Every word is also mapped to a unique vector 10 5

2017-11-27 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ The contexts are sampled from a sliding window over paragraph ▶ Paragraph vector is shared across all contexts from the same paragraph. ▶ Word vectors are shared across paragraphs 11 Adva nta g e s o ve r BOW  Semantics of the words. In this space, “powerful” is closer to “strong” than to “Paris”  Take into consideration the word order. 12 6

2017-11-27 Pa ra g ra ph Ve c to r Distrib ute d Ba g o f Wo rds ( PV-DBOW ) ▶ In this version, the paragraph vector is trained to predict the words in a small window. 13 E xpe rime nt ▶ Each paragraph vector is a combination of two vectors: one learned by PV-DM and one learned by PV-DBOW. ▶ Sentiment Analysis. ▶ Stanford sentiment treebank 11855 sentences ▶ ▶ IMDB 100000 movie reviews ▶ ▶ Information Retrieval 14 7

2017-11-27 Sta nfo rd se ntime nt tre e b a nk ▶ Learn the representations for all the sentences ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM ▶ Logistic Regression was used for prediction ▶ Every sentence has label which goes from 0.0 to 1.0 15 Sta nfo rd se ntime nt tre e b a nk 16 8

2017-11-27 I MDB ▶ Using Neural Networks and Logistic Regression for prediction ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM 17 I MDB 18 9

2017-11-27 I nfo rma tio n Re trie va l calls from ( 000 ) 000 - 0000 . 3913 calls reported from this number . ▶ according to 4 reports the identity of this caller is american airlines . do you want to find out who called you from +1 000 - 000 - 0000 , +1 ▶ 0000000000 or ( 000) 000 - 0000 ? see reports and share information you have about this caller allina health clinic patients for your convenience , you can pay your ▶ allina health clinic bill online . pay your clinic bill now , question and answers... 19 Ob se rva tio ns ▶ PV-DM is consistently better than PV-DBOW ▶ PV-DM alone can achieve good results ▶ The combination of PV-DM and PV-DOW can gain best results. ▶ A good guess for window size is between 5 and 12. ▶ The proposed method must be run in parallel. 20 10

2017-11-27 Adva nta g e s a nd Disa dva nta g e s The proposed method is competitive with state-of-the-art methods. ▶ The good performance demonstrates the merits of Paragraph vector ▶ in capturing the semantics of paragraphs. It is scalable (sentences, paragraphs, and documents). ▶ Paragraph vectors have the potential to overcome many weaknesses ▶ of bag-of-words (word orders, word meaning, …) Paragraph vector can be expensive. ▶ Too many parameters. ▶ If the input corpus is one with lots of misspellings like tweets, this ▶ algorithm may not be a good choice 21 De mo 22 11

2017-11-27 Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot vh2 sat 0 vector vector 0 0 1 0 … 0 1 0 0 on 0 0 0 … 0 23 We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer cat � 0 �� 0 0 0 0 0 … 0 V-dim 0 0 �′ �� 0 sat 0 0 0 0 1 0 … N-dim � V-dim 1 0 �� on 0 0 0 0 … V-dim N will be the size of word vector 0 24 12

vh2 One hot encoding technique is used to encode categorical integer features using a one-hot aka one-of-K scheme. Suppose you have ‘color’ feature which can take values ‘green’, ‘red’, and ‘blue’. One hot encoding will convert this ‘color’ feature to three features, namely, ‘is_green’, ‘is_red’, and ‘is_blue’ which all are binary. vagelis hristidis, 2016-11-06

2017-11-27 � � � � �� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 2.4 Input layer 1 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.6 0 � 0 … … … … … … … … … … � … 0 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.8 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � �� 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 25 � � � � �� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 1.8 Input layer 0 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.9 0 � 0 … … … … … … … … … … � … 1 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.9 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � �� 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 26 13

2017-11-27 Input layer 0 1 0 0 Hidden layer Output layer � cat 0 �� 0 0 0 0 0 … 0 V-dim 0 0 � � � �� 0 �� 0 0 0 0 1 � � 0 … � 1 0 �� N-dim 0 on � �� 0 0 V-dim 0 … V-dim N will be the size of word vector 0 27 Input layer 0 We would prefer � � close to � � �� 1 0 0 Hidden layer Output layer cat � 0 �� 0 0 0 0.01 0 0 0.02 … 0 V-dim 0.00 0 0 � � � � � � � 0 �� 0.02 0 � � � �� 0.01 0 0 0 1 0.02 � � 0 … 0.01 � 1 0 �� N-dim 0.7 on 0 � � �� 0 … 0 V-dim 0.00 0 … � � V-dim N will be the size of word vector 0 28 14

2017-11-27 � � �� 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 Contain word’s vectors Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 0 … … … … … … … … … … 1 … … … … … … … … … … 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 Output layer x cat 0 0 0 � 0 �� 0 0 … 0 V-dim 0 � 0 � �� 0 sat 0 0 0 0 1 0 � … �� V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 29 15

Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - PDF document

2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline Introduction Algorithm Learning Vector Representation of Words

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Co u nt data and Poisson distrib u tion G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Pa ne l 6 T he Impac t of the State Ho w fa r ha s the sta te mo difie d the distrib utio n

Spe c ie s Distrib utio n Mo de ling L E CT URE 25 SPE CI E S DI ST RI BUT I ON MODE

Pa xo s ma de L ive Ho w Go o g le e mplo ys pa xo s to b uild a re plic a te d lo g Qing

What ' s in a Ba y esian Model ? BAYE SIAN R E G R E SSION MOD E L IN G W ITH R STAN AR M

Private te & Confid fidentia ntial. l. Only ly for distrib ibutors rs and advis isors

Blockchain for Agriculture A solution looking for a problem? Di Distrib ibuted Le Ledger Tech

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job

SFIO progress on Swiss-Tx SCS meeting on Frangipani: a scalable distrib- uted file system to

The normal distrib u tion FOU N DATION S OF P R OBABIL ITY IN R Da v id Robinson Chief Data

Brand In Store Display Distrib tributi tion on Brasla Cosmetics Ayur Store e Images ges

T HROUGHPUT ACHIEVED AT TIME T = 1,000 BAL = 0.01 10000 BAL = 0.05 BAL = 0.10 BAL =

Co mpa ny Ove rvie w Ho ng K o ng b a se d 3PL se rvic e pro vide r 250 sta ff & a

Auto ma te d DR Sc re e ning a nd So la r Ma p b y: Ob je c tive a nd Go a ls T ra nspa re

Conditional probabilities P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico

Distributed Design of Glocal Controllers Hampe pei i Sasahar ahara a (KTH), ), Takayuk uki

The Design of the WxOCaml library An experiment with binding C++ libraries Fabrice Le Fessant

Colt MainRoach McAnlis Developer Advocate at Google Gathered here today... Texture

Beer er Com ompan pany 11 November 2015 Disclaimer laimer NOT FOR RELEAS ASE, PUBLICATI

Inferential Statistics Concepts IN TR OD U C TION TO L IN E AR MOD E L IN G IN P YTH ON Jason

Friction, reversibility, fluctuations in nonequilibrium and chaotic hypothesis (V.Lucarini &

Levi vi F Fuller er Wastewater Treatment Plant Operations Supervisor August 22, 2014 ABOUT D

Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - PDF document

2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline Introduction Algorithm Learning Vector Representation of Words

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Co u nt data and Poisson distrib u tion G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Pa ne l 6 T he Impac t of the State Ho w fa r ha s the sta te mo difie d the distrib utio n

Spe c ie s Distrib utio n Mo de ling L E CT URE 25 SPE CI E S DI ST RI BUT I ON MODE

Pa xo s ma de L ive Ho w Go o g le e mplo ys pa xo s to b uild a re plic a te d lo g Qing

What ' s in a Ba y esian Model ? BAYE SIAN R E G R E SSION MOD E L IN G W ITH R STAN AR M

Private te &amp; Confid fidentia ntial. l. Only ly for distrib ibutors rs and advis isors

Blockchain for Agriculture A solution looking for a problem? Di Distrib ibuted Le Ledger Tech

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job

SFIO progress on Swiss-Tx SCS meeting on Frangipani: a scalable distrib- uted file system to

The normal distrib u tion FOU N DATION S OF P R OBABIL ITY IN R Da v id Robinson Chief Data

Brand In Store Display Distrib tributi tion on Brasla Cosmetics Ayur Store e Images ges

T HROUGHPUT ACHIEVED AT TIME T = 1,000 BAL = 0.01 10000 BAL = 0.05 BAL = 0.10 BAL =

Co mpa ny Ove rvie w Ho ng K o ng b a se d 3PL se rvic e pro vide r 250 sta ff &amp; a

Auto ma te d DR Sc re e ning a nd So la r Ma p b y: Ob je c tive a nd Go a ls T ra nspa re

Conditional probabilities P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico

Distributed Design of Glocal Controllers Hampe pei i Sasahar ahara a (KTH), ), Takayuk uki

The Design of the WxOCaml library An experiment with binding C++ libraries Fabrice Le Fessant

Colt MainRoach McAnlis Developer Advocate at Google Gathered here today... Texture

Beer er Com ompan pany 11 November 2015 Disclaimer laimer NOT FOR RELEAS ASE, PUBLICATI

Inferential Statistics Concepts IN TR OD U C TION TO L IN E AR MOD E L IN G IN P YTH ON Jason

Friction, reversibility, fluctuations in nonequilibrium and chaotic hypothesis (V.Lucarini &amp;

Levi vi F Fuller er Wastewater Treatment Plant Operations Supervisor August 22, 2014 ABOUT D

Private te & Confid fidentia ntial. l. Only ly for distrib ibutors rs and advis isors

Co mpa ny Ove rvie w Ho ng K o ng b a se d 3PL se rvic e pro vide r 250 sta ff & a

Friction, reversibility, fluctuations in nonequilibrium and chaotic hypothesis (V.Lucarini &