Introduction to Latent Sequences & Expectation Maximization - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC

• EMBEDDINGS/DISTRIBUTED REPRESENTATIONS • COURSE SO FAR REVIEW

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 )) w i predict the next word

(Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) Mikolov et al. (2013)

Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co- occurrence matrices Dense vector representations: 2. Singular value Learn more in: decomposition/Latent • Your project Semantic Analysis • Paper (673) 3. Neural-network-inspired • Other classes (478/678) models (skip-grams, CBOW) 4. Brown clusters

Shared Intuition Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

Intrinsic Evaluation: Cosine Similarity Divide the dot product by the length of the Are the vectors parallel? two vectors -1: vectors point in opposite directions +1: vectors point in This is the cosine of the same directions angle between them 0: vectors are orthogonal

Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

Course Recap So Far Basics of Probability Basics of language modeling Requirements to be a Goal: model (be able to distribution (“proportional to”, ∝ ) predict) and give a score to Definitions of conditional language (whole sequences of probability, joint probability, and characters or words) independence Simple count-based Bayes rule, (probability) chain rule model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

Course Recap So Far Basics of Probability Tasks and Classification (use Requirements to be a distribution Bayes rule!) (“proportional to”, ∝ ) Posterior Definitions of conditional decoding, noisy channel model probability, joint probability, and independence Evaluations: accuracy, Bayes rule, (probability) chain rule precision, recall, and F β (F 1 ) scores Basics of language modeling Goal: model (be able to predict) and Naïve Bayes (given the give a score to language (whole sequences of label, generate/explain each characters or words) feature independently) and Simple count-based model connection to language Smoothing (and why we need it): modeling Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

Course Recap So Far Basics of Probability Maximum Entropy Models Requirements to be a distribution (“proportional to”, ∝ ) Meanings of feature Definitions of conditional probability, joint probability, and independence functions and weights Bayes rule, (probability) chain rule Basics of language modeling Use for language Goal: model (be able to predict) and give a score to language (whole sequences of characters or modeling or conditional words) Simple count-based model classification (“posterior in Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff one go”) Evaluation: perplexity How to learn the Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model weights: gradient descent Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

Course Recap So Far Basics of Probability Distributed Representations Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and & Neural Language Models independence Bayes rule, (probability) chain rule What embeddings are Basics of language modeling Goal: model (be able to predict) and give a score to and what their motivation is language (whole sequences of characters or words) Simple count-based model A common way to Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity evaluate: cosine similarity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent

Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

LATENT SEQUENCES AND LATENT VARIABLE MODELS

Is Language Modeling “Latent?” p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

Is Language Modeling “Latent?” Most* of What We’ve Discussed: Not Really p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) but the generation process these values are unknown (explanation)is transparent *Neural language modeling as an exception

Is Document Classification “Latent?” Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification “Latent?” As We’ve Discussed Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. argmax 𝑌 exp(𝜄 ⋅ 𝑔 𝑦,𝑧 )

Is Document Classification “Latent?” As We’ve Discussed: Not Really Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. these values are unknown but the generation process (explanation)is transparent argmax 𝑌 exp(𝜄 ⋅ 𝑔 𝑦,𝑧 )

Ambiguity → Part of Speech Tagging British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands Adjective Noun Verb British Left Waffles on Falkland Islands Noun Verb Noun

observed text orthography morphology lexemes syntax semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

Latent Modeling observed text orthography explain what you see/annotate morphology lexemes with things “of syntax importance” you don’t semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands)

Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) 1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

Latent Sequence Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) 1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) 2. Produce a tag sequence for this sentence

Introduction to Latent Sequences & Expectation Maximization - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC EMBEDDINGS/DISTRIBUTED REPRESENTATIONS COURSE SO FAR REVIEW Neural Language Models given some context w i-3 w i-2 w i-1 create/use

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos

Expectation Maximization CMSC 473/673 UMBC Recap from last time (and the first unit) N-gram

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Iola Missionary Baptist Church There Is Power in the Blood Hymn # 132 Would you be free from

What is Data? Part 1: Definitions and Types INFO-1301, Quantitative Reasoning 1 University of

Ski Area Wayfinding Integrating Digital Maps with the Mountain Experience ICA Commission on

Deep Exploration via Bootstrapped DQN Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin

Methodology of Semantic Fieldwork Matthewson 2004 L 510 Ioco

CREATING PARTNERSHIP THROUGH POLICY A Greenbelt Fund and National Good Food Network Webinar

My Summer Recount My Summer Recount. By Marta Cavare By marta cavere On the 13th of July I went

Linguistic and Knowledge Resources Vincenzo Maltese University of Trento LDKR course 2014