Introduction to Latent Sequences & Expectation Maximization - - PowerPoint PPT Presentation

β–Ά
introduction to latent sequences expectation maximization
SMART_READER_LITE
LIVE PREVIEW

Introduction to Latent Sequences & Expectation Maximization - - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC EMBEDDINGS/DISTRIBUTED REPRESENTATIONS COURSE SO FAR REVIEW Neural Language Models given some context w i-3 w i-2 w i-1 create/use


slide-1
SLIDE 1

Introduction to Latent Sequences & Expectation Maximization

CMSC 473/673 UMBC

slide-2
SLIDE 2

REVIEW

  • EMBEDDINGS/DISTRIBUTED REPRESENTATIONS
  • COURSE SO FAR
slide-3
SLIDE 3

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3,π‘₯π‘—βˆ’2,π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3,π‘₯π‘—βˆ’2,π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi

slide-4
SLIDE 4

(Some) Properties of Embeddings

Capture β€œlike” (similar) words Capture relationships

Mikolov et al. (2013)

vector(β€˜king’) – vector(β€˜man’) + vector(β€˜woman’) β‰ˆ vector(β€˜queen’) vector(β€˜Paris’) - vector(β€˜France’) + vector(β€˜Italy’) β‰ˆ vector(β€˜Rome’)

slide-5
SLIDE 5

Learn more in:

  • Your project
  • Paper (673)
  • Other classes (478/678)

Four kinds of vector models

Sparse vector representations

1. Mutual-information weighted word co-

  • ccurrence matrices

Dense vector representations:

2. Singular value decomposition/Latent Semantic Analysis 3. Neural-network-inspired models (skip-grams, CBOW) 4. Brown clusters

slide-6
SLIDE 6

Shared Intuition

Model the meaning of a word by β€œembedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (β€œword number 545”) or the string itself

slide-7
SLIDE 7

Intrinsic Evaluation: Cosine Similarity

Divide the dot product by the length of the two vectors This is the cosine of the angle between them Are the vectors parallel?

  • 1: vectors point in
  • pposite directions

+1: vectors point in same directions 0: vectors are orthogonal

slide-8
SLIDE 8

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

slide-9
SLIDE 9

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity

slide-10
SLIDE 10

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity

Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

slide-11
SLIDE 11

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in

  • ne go”)

How to learn the weights: gradient descent

slide-12
SLIDE 12

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language(whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and FΞ²(F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in one go”) How to learn the weights: gradient descent

Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

slide-13
SLIDE 13

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding, noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in one go”) How to learn the weights: gradient descent Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

slide-14
SLIDE 14

LATENT SEQUENCES AND LATENT VARIABLE MODELS

slide-15
SLIDE 15

Is Language Modeling β€œLatent?”

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

slide-16
SLIDE 16

Is Language Modeling β€œLatent?” Most* of What We’ve Discussed: Not Really

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

these values are unknown but the generation process (explanation)is transparent

*Neural language modeling as an exception

slide-17
SLIDE 17

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?”

slide-18
SLIDE 18

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?” As We’ve Discussed

argmaxπ‘Œ exp(πœ„ β‹… 𝑔 𝑦,𝑧 )

slide-19
SLIDE 19

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?” As We’ve Discussed: Not Really

argmaxπ‘Œ exp(πœ„ β‹… 𝑔 𝑦,𝑧 )

these values are unknown but the generation process (explanation)is transparent

slide-20
SLIDE 20

Ambiguity β†’ Part of Speech Tagging

British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands

Adjective Noun Verb Noun Verb Noun

slide-21
SLIDE 21
  • rthography

morphology

Adapted from Jason Eisner, Noah Smith

lexemes syntax semantics pragmatics discourse

  • bserved text
slide-22
SLIDE 22

Adapted from Jason Eisner, Noah Smith

Latent Modeling

explain what you see/annotate with things β€œof importance” you don’t

  • rthography

morphology lexemes syntax semantics pragmatics discourse

  • bserved text
slide-23
SLIDE 23

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

slide-24
SLIDE 24

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-25
SLIDE 25

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-26
SLIDE 26

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) 2. Produce a tag sequence for this sentence Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-27
SLIDE 27

Noisy Channel Model

Decode Rerank

π‘ž π‘Œ 𝑍) ∝ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ)

possible (clean)

  • utput
  • bserved

(noisy) text translation/ decode model (clean) language model

slide-28
SLIDE 28

Latent Sequence Model: Machine Translation

Decode Rerank

π‘ž π‘Œ 𝑍) ∝ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ)

possible (clean)

  • utput
  • bserved

(noisy) text

translation/ decode model

(clean) language model

slide-29
SLIDE 29

Latent Sequence Model: Machine Translation

Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

slide-30
SLIDE 30

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

slide-31
SLIDE 31

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations!

slide-32
SLIDE 32

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words)

slide-33
SLIDE 33

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words) Alignment? Words can have different meaning/senses

slide-34
SLIDE 34

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words)

π‘ž English French) ∝ π‘ž French English) βˆ— π‘ž(English)

Why Reverse? Alignment? Words can have different meaning/senses

slide-35
SLIDE 35

How to Learn With Latent Variables (Sequences)

Expectation Maximization

slide-36
SLIDE 36

Example: Unigram Language Modeling

slide-37
SLIDE 37

Example: Unigram Language Modeling

maximize (log-)likelihood to learn the probability parameters

slide-38
SLIDE 38

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

add complexity to better explain what we see

slide-39
SLIDE 39

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

add complexity to better explain what we see

examples of latent classes z:

  • part of speech tag
  • topic (β€œsports” vs. β€œpolitics”)
slide-40
SLIDE 40

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

add complexity to better explain what we see

goal: maximize (log-)likelihood

slide-41
SLIDE 41

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood

slide-42
SLIDE 42

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood if we did observe z, estimating the probability parameterswould be easy… but we don’t! :(

slide-43
SLIDE 43

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameterswould be easy… but we don’t! :(

slide-44
SLIDE 44

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

slide-45
SLIDE 45

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

slide-46
SLIDE 46

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

slide-47
SLIDE 47

Marginal(ized) Probability

w z1 & w z2 & w z3 & w z4 & w

π‘ž π‘₯ = π‘ž 𝑨1, π‘₯ + π‘ž 𝑨2, π‘₯ + π‘ž 𝑨3, π‘₯ + π‘ž(𝑨4, π‘₯)

slide-48
SLIDE 48

Marginal(ized) Probability

w z1 & w z2 & w z3 & w z4 & w

slide-49
SLIDE 49

Marginal(ized) Probability

w z1 & w z2 & w z3 & w z4 & w

slide-50
SLIDE 50

Marginal(ized) Probability

w z1 & w z2 & w z3 & w z4 & w

slide-51
SLIDE 51

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w

slide-52
SLIDE 52

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1,𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹―π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

goal: maximize marginalized (log-)likelihood w z1 & w z2 & w z3 & w z4 & w if we did observe z, estimating the probability parameterswould be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-53
SLIDE 53

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

slide-54
SLIDE 54

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameterswould be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-55
SLIDE 55

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameterswould be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-56
SLIDE 56

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-57
SLIDE 57

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

slide-58
SLIDE 58

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

We’ve already seen this type of counting, when computing the gradient in maxent models.

slide-59
SLIDE 59

Expectation Maximization (EM): M-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

π‘ž 𝑒+1 (𝑨) π‘ž(𝑒)(𝑨)

estimated counts

slide-60
SLIDE 60

EM Math

slide-61
SLIDE 61

EM Math

E-step: count under uncertainty M-step: maximize log-likelihood

slide-62
SLIDE 62

EM Math

E-step: count under uncertainty M-step: maximize log-likelihood

  • ld parameters

posterior distribution

slide-63
SLIDE 63

EM Math

E-step: count under uncertainty M-step: maximize log-likelihood

  • ld parameters

new parameters new parameters posterior distribution

slide-64
SLIDE 64

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

slide-65
SLIDE 65

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • nly observe these

(record heads vs. tails

  • utcome)

don’t observe this

slide-66
SLIDE 66

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: vowel or constonant? part of speech?

slide-67
SLIDE 67

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-68
SLIDE 68

Three Coins/Unigram With Class Example

Imagine three coins

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” Three parameters to estimate: Ξ», Ξ³, and ψ

slide-69
SLIDE 69

Three Coins/Unigram With Class Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-70
SLIDE 70

Three Coins/Unigram With Class Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-71
SLIDE 71

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

slide-72
SLIDE 72

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

slide-73
SLIDE 73

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

marginal likelihood rewrite joint using Bayes rule

slide-74
SLIDE 74

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

π‘ž H | heads = .8 π‘ž T | heads = .2

slide-75
SLIDE 75

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

π‘ž H = π‘ž H | heads βˆ— π‘ž heads + π‘ž H | tails * π‘ž(tails) = .8 βˆ— .6 + .6 βˆ— .4

π‘ž H | heads = .8 π‘ž T | heads = .2

slide-76
SLIDE 76

Three Coins/Unigram With Class Example

H H T H T H H T H T T T

π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667

Use posteriors to update parameters

π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-77
SLIDE 77

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

π‘ž heads = # heads from penny # total flips of penny fully observed setting

  • ur setting: partially-observed

π‘ž heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-78
SLIDE 78

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting: partially-observed

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-79
SLIDE 79

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting:

partially-

  • bserved

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny = 2 βˆ— π‘ž heads | obs. H + 4 βˆ— π‘ž heads | obs. π‘ˆ 6 β‰ˆ 0.444 π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-80
SLIDE 80

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm:

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-81
SLIDE 81

Related to EM

Latent clustering K-means:

https://www.csee.umbc.edu/courses/undergraduate/473/f18/kmeans/

Gaussian mixture modeling