nlu lecture 5 word representations and morphology
play

NLU lecture 5: Word representations and morphology Adam Lopez - PowerPoint PPT Presentation

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential epistemology Word representations and word2vec Word representations and compositional morphology Reading: Mikolov et al. 2013, Luong et al.


  1. NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk

  2. • Essential epistemology • Word representations and word2vec • Word representations and compositional morphology Reading: Mikolov et al. 2013, Luong et al. 2013

  3. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & Deals with Artifacts theorems theories Truth is Forever Temporary It works Many, including Mathematics Physics applied C.S. Examples C.S. theory Biology e.g. NLP F.L. theory Linguistics

  4. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & Deals with Artifacts theorems theories Truth is Forever Temporary It works Many, including Physics Mathematics applied C.S. Examples Biology C.S. theory e.g. MT Linguistics

  5. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & morphological Deals with Artifacts theorems theories properties of words (facts) Truth is Forever Temporary It works Many, including Physics Mathematics applied C.S. Examples Biology C.S. theory e.g. MT Linguistics

  6. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & morphological Deals with Artifacts theorems theories properties of words (facts) Truth is Forever Temporary It works Optimality Many, including Physics Mathematics theory applied C.S. Examples Biology C.S. theory e.g. MT Linguistics

  7. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & morphological Deals with Artifacts Optimality theorems theories properties of theory is words (facts) finite-state Truth is Forever Temporary It works Optimality Many, including Physics Mathematics theory applied C.S. Examples Biology C.S. theory e.g. MT Linguistics

  8. Essential epistemology Empirical Exact sciences Engineering sciences Axioms & Facts & morphological We can Deals with Artifacts Optimality theorems theories properties of represent theory is words (facts) morphological finite-state properties of Truth is Forever Temporary It works words with Optimality finite-state Many, including Physics Mathematics theory applied C.S. Examples Biology automata C.S. theory e.g. MT Linguistics

  9. Remember the bandwagon

  10. Word representations

  11. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 p ( e i | e i − n +1 , . . . , e i − 1 ) = C e i − 1 C W V e i − 2 e i softmax C e i − 3 tanh x=x

  12. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 p ( e i | e i − n +1 , . . . , e i − 1 ) = Every word is a vector (a one-hot vector) C e i − 1 The concatenation of these vectors is C W V e i − 2 e i an n -gram softmax C e i − 3 tanh x=x

  13. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 p ( e i | e i − n +1 , . . . , e i − 1 ) = Word embeddings are vectors: continuous C e i − 1 representations of each word. C W V e i − 2 e i softmax C e i − 3 tanh x=x

  14. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 n -grams are p ( e i | e i − n +1 , . . . , e i − 1 ) = vectors: continuous representations of C e i − 1 n -grams (or, via recursion, larger C W V e i − 2 e i structures) softmax C e i − 3 tanh x=x

  15. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 p ( e i | e i − n +1 , . . . , e i − 1 ) = a discrete probability distribution over V outcomes is a vector: C e i − 1 V non-negative reals summing to 1. C W V e i − 2 e i softmax C e i − 3 tanh x=x

  16. Feedforward model | e | Y p ( e ) = p ( e i | e i − n +1 , . . . , e i − 1 ) i =1 No matter what we p ( e i | e i − n +1 , . . . , e i − 1 ) = do in NLP, we’ll (almost) always C e i − 1 have words… Can we reuse C W V e i − 2 e i these vectors? softmax C e i − 3 tanh x=x

  17. Design a POS tagger using an RRNLM

  18. Design a POS tagger using an RRNLM What are some difficulties with this? What limitation do you have in learning a POS tagger that you don’t have when learning a LM?

  19. Design a POS tagger using an RRNLM What are some difficulties with this? What limitation do you have in learning a POS tagger that you don’t have when learning a LM? One big problem: LIMITED DATA

  20. “You shall know a word by the company it keeps” –John Rupert Firth (1957)

  21. Learning word representations using language modeling • Idea: we’ll learn word representations using a language model, then reuse them in our POS tagger (or any other thing we predict from words). • Problem: Bengio language model is slow. Imagine computing a softmax over 10,000 words!

  22. Continuous bag-of-words (CBOW)

  23. Skip-gram

  24. Skip-gram

  25. Learning skip-gram

  26. Learning skip-gram

  27. Word representations capture some world knowledge

  28. Continuous Word Representations walk man read king woman walks queen reads Syntactic Semantics

  29. Will it learn this?

  30. (Additional) limitations of word2vec • Closed vocabulary assumption • Cannot exploit functional relationships in learning ?

  31. Is this language? What our data contains: A Lorillard spokeswoman said, “This is an old story.” What word2vec thinks our data contains: A UNK UNK said, “This is an old story.”

  32. Is it ok to ignore words?

  33. Is it ok to ignore words?

  34. What we know about linguistic structure Morpheme : the smallest meaningful unit of language “loves” love +s root/stem : love affix : -s morph. analysis : 3rd.SG.PRES

  35. What if we embed morphemes rather than words? Basic idea: compute representation recursively from children f is an activation function (e.g. tanh) Vectors in green are morpheme embeddings (parameters) Vectors in grey are computed as above (functions)

  36. Train compositional morpheme model by minimizing distance to reference vector Target output: reference vector p r contructed vector is p c Minimize:

  37. Or, train in context using backpropagation (Basically a feedforward LM) Vectors in blue are word or n-gram embeddings (parameters) Vectors in green are morpheme embeddings (parameters) Vectors in grey are computed as above (functions)

  38. Where do we get morphemes? • Use an unsupervised morphological analyzer (we’ll talk about unsupervised learning later on). • How many morphemes are there?

  39. New stems are invented every day! fleeking, fleeked, and fleeker are all attested…

  40. Representations learned by compositional morphology model

  41. Summary • Deep learning is not magic and will not solve all of your problems, but representation learning is a very powerful idea. • Word representations can be transferred between models. • Word2vec trains word representations using an objective based on language modeling—so it can be trained on unlabeled data. • Sometimes called unsupervised, but objective is supervised! • Vocabulary is not finite. • Compositional representations based on morphemes make our models closer to open vocabulary.

Recommend


More recommend