Bayesian Inference for Parameter Estimation + Topic Modeling Matt - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Inference for Parameter Estimation + Topic Modeling Matt Gormley Lecture 20 Nov. 4, 2019 1

Reminders • Homework 3: Structured SVM – Out: Fri, Oct. 24 – Due: Wed, Nov. 6 at 11:59pm • Homework 4: Topic Modeling – Out: Wed, Nov. 6 – Due: Mon, Nov. 18 at 11:59pm 3

TOPIC MODELING 4

Topic Modeling Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks • Organize the documents into thematic categories • Describe the evolution of those categories over time • Enable a domain expert to analyze and understand the content • Find relationships between the categories • Understand how authorship influences the content

Topic Modeling Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks • Organize the documents into thematic categories • Describe the evolution of those categories over time • Enable a domain expert to analyze and understand the content • Find relationships between the categories • Understand how authorship influences the content Topic Modeling: A method of (usually unsupervised) discovery of latent or hidden structure in a corpus • Applied primarily to text corpora, but techniques are more general • Provides a modeling toolbox • Has prompted the exploration of a variety of new inference methods to accommodate large-scale datasets

Topic Modeling Dirichlet-multinomial regression (DMR) topic model on ICML (Mimno & McCallum, 2008) http:// www.cs.umass.edu/~mimno/icml100.html

Topic Modeling • Map of NIH Grants (Talley et al., 2011) https://app.nihmaps.org/

Other Applications of Topic Models • Spacial LDA (Wang & Grimson, 2007) Manual LDA SLDA

Outline • Applications of Topic Modeling • Latent Dirichlet Allocation (LDA) 1. Beta-Bernoulli 2. Dirichlet-Multinomial 3. Dirichlet-Multinomial Mixture Model 4. LDA • Bayesian Inference for Parameter Estimation – Exact inference – EM – Monte Carlo EM – Gibbs sampler – Collapsed Gibbs sampler • Extensions of LDA – Correlated topic models – Dynamic topic models – Polylingual topic models – Supervised LDA

BAYESIAN INFERENCE FOR NAÏVE BAYES 12

Beta-Bernoulli Model • Beta Distribution 1 B ( α , β ) x α − 1 (1 − x ) β − 1 f ( φ | α , β ) = 4 3 α = 0 . 1 , β = 0 . 9 f ( φ | α , β ) α = 0 . 5 , β = 0 . 5 2 α = 1 . 0 , β = 1 . 0 α = 5 . 0 , β = 5 . 0 α = 10 . 0 , β = 5 . 0 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 φ

Beta-Bernoulli Model • Generative Process ⇤ ∼ Beta ( � , ⇥ ) [ draw distribution over words ] For each word n ∈ { 1 , . . . , N } x n ∼ Bernoulli ( ⇤ ) [ draw word ] • Example corpus (heads/tails) H T T H H T T H H H x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10

Dirichlet-Multinomial Model • Dirichlet Distribution 1 B ( α , β ) x α − 1 (1 − x ) β − 1 f ( φ | α , β ) = 4 3 α = 0 . 1 , β = 0 . 9 f ( φ | α , β ) α = 0 . 5 , β = 0 . 5 2 α = 1 . 0 , β = 1 . 0 α = 5 . 0 , β = 5 . 0 α = 10 . 0 , β = 5 . 0 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 φ

Dirichlet-Multinomial Model • Dirichlet Distribution K ⇥ K k =1 Γ ( � k ) 1 p ( ⌅ ⇤ α k − 1 ⇤ ⇤ | α ) = where B ( α ) = k Γ ( � K B ( α ) k =1 � k ) k =1 15 3 10 p ( ~ � | ~ ↵ ) 2 . 5 p 5 ( � ~ | ~ ↵ ) 2 0 0 1 . 5 0 0 . 25 0 . 25 1 0 . 5 � 1 0 . 8 0 . 5 1 � 0 . 8 1 0 . 6 0 . 75 0 . 6 0 . 75 0 . 4 0 . 4 � 2 � 2 0 . 2 0 . 2 1 1 0 0

Dirichlet-Multinomial Model • Generative Process φ ∼ Dir ( β ) [ draw distribution over words ] For each word n ∈ { 1 , . . . , N } x n ∼ Mult (1 , φ ) [ draw word ] • Example corpus the he is the and the she she is is x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10

Dirichlet-Multinomial Mixture Model • Generative Process !"#$%& ,)$-!(.*/ '"%()*+!& • Example corpus the he is the and the she she is is x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 34 Document 1 Document 2 Document 3

Dirichlet-Multinomial Mixture Model • Generative Process For each topic k ∈ { 1 , . . . , K } : φ k ∼ Dir ( β ) [ draw distribution over words ] θ ∼ Dir ( α ) [ draw distribution over topics ] For each document m ∈ { 1 , . . . , M } z m ∼ Mult (1 , θ ) [ draw topic assignment ] For each word n ∈ { 1 , . . . , N m } x mn ∼ Mult (1 , φ z m ) [ draw word ] • Example corpus the he is the and the she she is is x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 34 Document 1 Document 2 Document 3

Bayesian Inference for Naïve Bayes Whiteboard : – Naïve Bayes is not Bayesian – What if we observed both words and topics? – Dirichlet-Multinomial in the fully observed setting is just Naïve Bayes – Three ways of estimating parameters: 1. MLE for Naïve Bayes 2. MAP estimation for Naïve Bayes 3. Bayesian parameter estimation for Naïve Bayes 20

Dirichlet-Multinomial Model • The Dirichlet is conjugate to the Multinomial φ ∼ Dir ( β ) [ draw distribution over words ] For each word n ∈ { 1 , . . . , N } x n ∼ Mult (1 , φ ) [ draw word ] • The posterior of ⇤ is p ( ⇤ | X ) = p ( X | φ ) p ( φ ) P ( X ) • Define the count vector n such that n t denotes the number of times word t appeared • Then the posterior is also a Dirichlet distribution: p ( ⇤ | X ) ∼ Dir ( β + n )

LATENT DIRICHLET ALLOCATION (LDA) 22

Mixture vs. Admixture (LDA) !"#$%& ,)$-!(.*/ '"%()*+!& !"#$%& ,0')$-!(.*/ '"%()*+!& Diagrams from Wallach, JHU 2011, slides

Latent Dirichlet Allocation • Generative Process !"#$%& ,0')$-!(.*/ '"%()*+!& • Example corpus the he is the and the she she is is x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 34 Document 1 Document 2 Document 3

Latent Dirichlet Allocation • Generative Process For each topic k ∈ { 1 , . . . , K } : φ k ∼ Dir ( β ) [ draw distribution over words ] For each document m ∈ { 1 , . . . , M } θ m ∼ Dir ( α ) [ draw distribution over topics ] For each word n ∈ { 1 , . . . , N m } z mn ∼ Mult (1 , θ m ) [ draw topic assignment ] x mn ∼ φ z mi [ draw word ] • Example corpus the he is the and the she she is is x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 34 Document 1 Document 2 Document 3

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 • The generative story begins with only a Dirichlet prior over the topics. • Each topic is defined as a Multinomial distribution over the vocabulary, parameterized by ϕ k 26

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.012 0.012 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 • The generative story begins with only a Dirichlet prior over the topics. • Each topic is defined as a Multinomial distribution over the vocabulary, parameterized by ϕ k 27

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup • A topic is visualized as its high probability words . • A pedagogical label is used to identify the topic. 28

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 {hockey} 0.000 0.000 0.000 0.000 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup • A topic is visualized as its high probability words . • A pedagogical label is used to identify the topic. 29

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 { Canadian gov. } {government} {hockey} {U.S. gov.} {baseball} {Japan} • A topic is visualized as its high probability words. • A pedagogical label is used to identify the topic. 30

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 { Canadian gov. } {government} {hockey} {U.S. gov.} {baseball} {Japan} Dirichlet( α ) θ 1 = 31

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 { Canadian gov. } {government} {hockey} {U.S. gov.} {baseball} {Japan} Dirichlet( α ) θ 1 = The 54/40' boundary dispute is still unresolved, and Canadian and US 32

(Blei, Ng, & Jordan, 2003) LDA for Topic Modeling Dirichlet( β ) 0.012 0.012 ϕ 3 ϕ 1 ϕ 2 ϕ 4 ϕ 5 ϕ 6 0.006 0.006 0.006 0.006 0.006 0.006 0.000 0.000 0.000 0.000 0.000 0.000 { Canadian gov. } {government} {hockey} {U.S. gov.} {baseball} {Japan} Dirichlet( α ) θ 1 = The 54/40' boundary dispute is still unresolved, and Canadian and US 33

Bayesian Inference for Parameter Estimation + Topic Modeling Matt - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Inference for Parameter Estimation + Topic Modeling Matt Gormley Lecture 20 Nov. 4, 2019 1

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Analytics, Inference and Computation in Cosmology: Exercises on Bayesian Inference Roberto

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Quantum circuits for the CSIDH: optimizing quantum evaluation of isogenies Daniel J. Bernstein

Symbolism for Generative Grammars The book chapter gives a good explanation of the background

Higher connectivity in linear -terms as 3-valent graphs Noam Zeilberger an update on

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

The K-FAC method for neural network optimization James Martens Thanks to my various

The Demographics of Web Search Ingmar Weber, Carlos Castillo Yahoo! Research Barcelona Warm-up

E0358 Uday Kumar Reddy B uday@csa.iisc.ernet.in Dept of CSA, Indian Institute of Science,

a (mod m ) b is congruent to modulo a m b mod mod a m b m