introduction to latent sequences expectation maximization
play

Introduction to Latent Sequences & Expectation Maximization - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC EMBEDDINGS/DISTRIBUTED REPRESENTATIONS COURSE SO FAR REVIEW Neural Language Models given some context w i-3 w i-2 w i-1 create/use


  1. Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC

  2. • EMBEDDINGS/DISTRIBUTED REPRESENTATIONS • COURSE SO FAR REVIEW

  3. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 )) w i predict the next word

  4. (Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) Mikolov et al. (2013)

  5. Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co- occurrence matrices Dense vector representations: 2. Singular value Learn more in: decomposition/Latent • Your project Semantic Analysis • Paper (673) 3. Neural-network-inspired • Other classes (478/678) models (skip-grams, CBOW) 4. Brown clusters

  6. Shared Intuition Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

  7. Intrinsic Evaluation: Cosine Similarity Divide the dot product by the length of the Are the vectors parallel? two vectors -1: vectors point in opposite directions +1: vectors point in This is the cosine of the same directions angle between them 0: vectors are orthogonal

  8. Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

  9. Course Recap So Far Basics of Probability Basics of language modeling Requirements to be a Goal: model (be able to distribution (“proportional to”, ∝ ) predict) and give a score to Definitions of conditional language (whole sequences of probability, joint probability, and characters or words) independence Simple count-based Bayes rule, (probability) chain rule model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

  10. Course Recap So Far Basics of Probability Tasks and Classification (use Requirements to be a distribution Bayes rule!) (“proportional to”, ∝ ) Posterior Definitions of conditional decoding, noisy channel model probability, joint probability, and independence Evaluations: accuracy, Bayes rule, (probability) chain rule precision, recall, and F β (F 1 ) scores Basics of language modeling Goal: model (be able to predict) and Naïve Bayes (given the give a score to language (whole sequences of label, generate/explain each characters or words) feature independently) and Simple count-based model connection to language Smoothing (and why we need it): modeling Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

  11. Course Recap So Far Basics of Probability Maximum Entropy Models Requirements to be a distribution (“proportional to”, ∝ ) Meanings of feature Definitions of conditional probability, joint probability, and independence functions and weights Bayes rule, (probability) chain rule Basics of language modeling Use for language Goal: model (be able to predict) and give a score to language (whole sequences of characters or modeling or conditional words) Simple count-based model classification (“posterior in Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff one go”) Evaluation: perplexity How to learn the Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model weights: gradient descent Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

  12. Course Recap So Far Basics of Probability Distributed Representations Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and & Neural Language Models independence Bayes rule, (probability) chain rule What embeddings are Basics of language modeling Goal: model (be able to predict) and give a score to and what their motivation is language (whole sequences of characters or words) Simple count-based model A common way to Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity evaluate: cosine similarity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent

  13. Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

  14. LATENT SEQUENCES AND LATENT VARIABLE MODELS

  15. Is Language Modeling “Latent?” p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

  16. Is Language Modeling “Latent?” Most* of What We’ve Discussed: Not Really p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) but the generation process these values are unknown (explanation)is transparent *Neural language modeling as an exception

  17. Is Document Classification “Latent?” Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

  18. Is Document Classification “Latent?” As We’ve Discussed Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. argmax 𝑌 exp(𝜄 ⋅ 𝑔 𝑦,𝑧 )

  19. Is Document Classification “Latent?” As We’ve Discussed: Not Really Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. these values are unknown but the generation process (explanation)is transparent argmax 𝑌 exp(𝜄 ⋅ 𝑔 𝑦,𝑧 )

  20. Ambiguity → Part of Speech Tagging British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands Adjective Noun Verb British Left Waffles on Falkland Islands Noun Verb Noun

  21. observed text orthography morphology lexemes syntax semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

  22. Latent Modeling observed text orthography explain what you see/annotate morphology lexemes with things “of syntax importance” you don’t semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

  23. Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

  24. Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands)

  25. Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) 1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

  26. Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) 1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) 2. Produce a tag sequence for this sentence

Recommend


More recommend