inside out
play

Inside Out: Two Jointly Predictive Models for Word Representations - PowerPoint PPT Presentation

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of


  1. Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences K S e y A L C a b , g o o r a o l t 网络数据科学与技术 n o h r c y 重点实验室 e o T f N & e e t w n c o r c e i k D a t a S February 14, 2016

  2. Word Representation Sentiment Machine Analysis Translation [Kalchbrenner and [Maas et al., Blunsom, 2013] 2011] Language POS Word Taging Modeling Representation [Collobert et al., [Bengio et al., 2003] 2011] Word-Sense Parsing Disambiguation [Socher et al., [Collobert et al., 2011] 2011]

  3. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx Input Window word of interest Text cat sat on the mat i -th output = P ( w t = i | context ) w 1 w 1 w 1 Feature 1 . . . 1 2 N . . . softmax w K . . . . . . Feature K w K w K . . . 1 2 N Figure 2 presents a schematic of the singular value de- the SVD geometrically. The rows of the reduced matrices most computation here composition for a t x d matrix of terms by documents. Lookup Table of singular vectors are taken as coordinates of points repre- In general, for X = T,S, D,’ the matrices To, D,, and SO senting the documents and terms in a k dimensional space. LT W 1 must all be of full rank. The beauty of an SVD, however, With appropriate resealing of the axes, by quantities related is that it allows a simple strategy for optimal approximate . to the associated diagonal values of S, dot products between . d . tanh fit using smaller matrices. If the singular values in S, are points in the space can be used to compare the correspond- . . . . . . ordered by size, the first k largest may be kept and the re- LT W K ing objects. The next section details these comparisons. maining smaller ones set to zero. The product of the result- Computing Fundamental Comparison Quantities concat ing matrices is a matrix X which is only approximately from the SVD Model. There are basically three sorts of Linear equal to X, and is of rank k. It can be shown that the new comparisons of interest: those comparing two terms (“How similar are terms i and j?“), those comparing two docu- matrix X is the matrix of rank k, which is closest in the M 1 × · ments (“How similar are documents i and j?“), and those C ( w t − n + 1 ) C ( w t − 2 ) C ( w t − 1 ) least squares sense to X. Since zeros were introduced into S,, the representation can be simplified by deleting the comparing a term and a document (“How associated are n 1 . . . . . . . . . . . . hu zero rows and columns of S, to obtain a new diagonal ma- term i and document j?“). In standard information retrieval trix S, and then deleting the corresponding HardTanh columns of approaches, these amount respectively, to comparing two Table Matrix C T,, and D, to obtain T and D respectively. The result is a rows, comparing two columns, or examining individual look−up shared parameters reduced model: cells of the original matrix of term by document data, X. in C across words Here we make similar comparisons, but use the matrix X, X=i= Linear TSD’ since it is presumed to represent the importtnt and reliable w t − n + 1 w t − 2 index for index for index for w t − 1 which is the rank-k model with the best possible least- patterns underlying the data in X. Since X = TSD ‘, the M 2 × · Word squares-fit to X. It is this reduced model, presented in relevant quantities can be computed just using the smaller Figure 3, that we use to approximate our data. n 2 hu = #tags matrices, T, D, and S. The amount of dimension reduction, i.e., the choice of Comparing Two Terms. The dot product between two k, is critical to our work. Ideally, we want a value of k that row vectors of X reflects the extent to which two terms is large enough to fit all the real structure in the data, but have a similar pattemAtf occurrence across the set of docu- Representation small enough so that we do not also fit the sampling error ments. The matrix XX’ is the square symmetric matrix or unimportant details. The proper way to make such containing all these term-to-term dot products. Since S is choices is an open issue in the factor analytic literature. diagonal and D is orthonormal. It is easy to verify that: In practice, we currently use an operational criterion- a Models value of k which yields good retrieval performance. Geometric Interpretation of the SVD Model. For Note that this means that the i,j cell of XX’ can be ob- purposes of intuition and discussion it is useful to interpret tained by taking the dot product between the i and j rows documents terms X TO INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT mxm mxd L w(t-2) w(t-2) txd txm X TO SO DO w(t-1) w(t-1) SUM w(t) w(t) Singular value decomposition of the term x document matrix, X. Where: To has orthogonal, unit-length columns (To’ To = I) Do has orthogonal, unit-length columns (D,’ Do = I) w(t+1) w(t+1) So is the diagonal matrix of singular values t is the number of rows of X d is the number of columns of X w(t+2) w(t+2) m is the rank of X (< min(t,d)) FIG. 2. Schematic of the Singular Value Decomposition (SVD) of a rectangular term by document matrix. The original term by document matrix is decomposed into three matrices each with linearly independent components. CBOW Skip-gram 398 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1990

  4. The Distributional Hypothesis [Harris, 1954, Firth, 1957] “ You shall know a word by the company it keeps .” —J.R. Firth

  5. The Distributional Hypothesis We found a cute little ? wampimuk sleeping in a tree.

  6. The Distributional Hypothesis We found a cute little ? wampimuk sleeping in a tree.

  7. What if there were no context for a word?

  8. What if there were no context for a word? Which word is closer to buy, buys or sells?

  9. Morphology How words are built from morphemes

  10. Morphology How words are built from morphemes breakable: break, able buys: buy, s

  11. Limitation of Morphology Words do not share morpheme dog ? husky

  12. Motivation Example: “ . . . glass is breakable, take care . . . ” EXTERNAL CONTEXTS INTERNAL MORPHEMES glass is break breakable take able care

  13. Model

  14. BEING . . . N c i − 2 ∑ log p ( w i |P c L = i ) Context # ‰ i = 1 P c c i − 1 i c i + 1 c i + 2 w i . . .

  15. BEING . . . N c i − 2 ( ) ∑ log p ( w i |P c i ) + log p ( w i |P m L = i ) Context # ‰ i = 1 P c c i − 1 i c i + 1 c i + 2 w i . . . # ‰ P m m ( 1 ) i i m ( 2 ) Morphology i . . .

  16. BEING . . . N c i − 2 ( ) ∑ log p ( w i |P c i ) + log p ( w i |P m L = i ) Context # ‰ i = 1 P c c i − 1 i Negative Sampling N ( w i · # ‰ W log σ ( − # w · # ‰ ‰ ∑ c i + 1 log σ ( #‰ P c P c L = i ) + k · E ˜ ˜ i ) w ∼ P ˜ i = 1 w i · # ‰ w · # ‰ ) c i + 2 W log σ ( − # ‰ + log σ ( #‰ P m P m ˜ w i i ) + k · E ˜ i ) w ∼ P ˜ . . . # ‰ P m 1 m ( 1 ) σ ( x ) = i 1 + exp ( − x ) i m ( 2 ) Morphology i . . .

  17. SEING . . . ( i + l c i − 2 N ) ∑ ∑ L = log p ( c j | w i ) i = 1 j = i − l c i − 1 j ̸ = i Context c i + 1 w i c i + 2 . . .

  18. SEING . . . ( i + l s ( w i ) c i − 2 N ) ∑ ∑ ∑ log p ( m ( z ) L = log p ( c j | w i ) + | w i ) i i = 1 j = i − l z = 1 c i − 1 j ̸ = i Context c i + 1 w i c i + 2 . . . m ( 1 ) i Morphology m ( 2 ) i . . .

  19. SEING . . . ( i + l s ( w i ) c i − 2 N ) ∑ ∑ ∑ log p ( m ( z ) L = log p ( c j | w i ) + | w i ) i i = 1 j = i − l z = 1 c i − 1 j ̸ = i Negative Sampling Context c i + 1 ( i + l N ( C log σ ( − # ‰ ) ∑ ∑ log σ ( # c j · #‰ ‰ ˜ c · #‰ L = w i ) + k · E ˜ w i ) c ∼ P ˜ w i c i + 2 i = 1 j = i − l j ̸ = i . . . s ( w i ) # ‰ ) ( m ( z ) − # ‰ ) ∑ · #‰ m · #‰ ˜ m ( 1 ) + log σ ( w i )+ k · E ˜ M log σ ( w i ) m ∼ P ˜ i i z = 1 1 Morphology m ( 2 ) σ ( x ) = 1 + exp ( − x ) i . . .

  20. Related Work I � ��� � � � ������������� ��� ���� ��� ������ � � � � � � � � � � ����������� �� ����� � ��� � � � � � �� ��������� CLBL++ Context-sensitive morphological RNN [Botha and Blunsom, 2014] [Luong et al., 2013] • Neural language model with morphological information • Simple and straightforward models can acquire better word representations

Recommend


More recommend