Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - PowerPoint PPT Presentation

Preliminaries Clustering – A Parametric Approach Frequentist approach: Gaussian Mixture Models with K mixtures Distribution over classes: π = ( π 1 , . . . , π K ) Each cluster has a mean and covariance: φ i = ( µ i , Σ i ) Then K � p ( x | π, φ ) = π k p ( x | φ k ) k =1 Use Expectation Maximization (EM) to maximize the likelihood of the data with respect to ( π, φ ) . Kurt T. Miller Dr. Nonparametric Bayes 20

Preliminaries Clustering – A Parametric Approach Frequentist approach: Gaussian Mixture Models with K mixtures Alternate definition: K � G = π k δ φ k k =1 where δ φ k is an atom at φ k . G Ω Then θ i θ i ∼ G x i ∼ p ( x | θ i ) x i N Kurt T. Miller Dr. Nonparametric Bayes 21

Parametric Bayesian Clustering Clustering – A Parametric Approach Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures Distribution over classes: π = ( π 1 , . . . , π K ) π ∼ Dirichlet( α/K, . . . , α/K ) (We’ll review the Dirichlet Distribution in a several slides.) Each cluster has a mean and covariance: φ k = ( µ k , Σ k ) ( µ k , Σ k ) ∼ Normal-Inverse-Wishart( ν ) We still have K � p ( x | π, φ ) = π k p ( x | φ k ) k =1 Kurt T. Miller Dr. Nonparametric Bayes 22

Parametric Bayesian Clustering Clustering – A Parametric Approach Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. G 0 φ k ∼ G 0 π ∼ Dirichlet( α/K, . . . , α/K ) G α Ω K � G = π k δ φ k θ i i =1 ∼ θ i G ∼ p ( x | θ i ) x i x i N Kurt T. Miller Dr. Nonparametric Bayes 23

Parametric Bayesian Clustering The Dirichlet Distribution We had π ∼ Dirichlet( α 1 , . . . , α K ) The Dirichlet density is defined as �� K � Γ k =1 α k π α 1 − 1 π α 2 − 1 · · · π α K − 1 p ( π | α ) = 1 2 � K K k =1 Γ( α k ) where π K = 1 − � K − 1 k =1 π k . The expectations of π are α i E ( π i ) = � K i =1 α i Kurt T. Miller Dr. Nonparametric Bayes 24

Parametric Bayesian Clustering The Beta Distribution A special case of the Dirichlet distribution is the Beta distribution for when K = 2 . p ( π | α 1 , α 2 ) = Γ ( α 1 + α 2 ) Γ( α 1 )Γ( α 2 ) π α 1 − 1 (1 − π ) α 2 − 1 3.5 α 1 =1.0, α 2 =0.1 3 α 1 =1.0, α 2 =1.0 α 1 =1.0, α 2 =5.0 2.5 α 1 =1.0, α 2 =10.0 α 1 =9.0, α 2 =3.0 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Kurt T. Miller Dr. Nonparametric Bayes 25

Parametric Bayesian Clustering The Dirichlet Distribution In three dimensions: p ( π | α 1 , α 2 , α 3 ) = Γ ( α 1 + α 2 + α 3 ) Γ( α 1 )Γ( α 2 )Γ( α 3 ) π α 1 − 1 π α 2 − 1 (1 − π 1 − π 2 ) α 3 − 1 1 2 α = (2 , 2 , 2) α = (5 , 5 , 5) α = (2 , 2 , 25) Kurt T. Miller Dr. Nonparametric Bayes 26

Parametric Bayesian Clustering Draws from the Dirichlet Distribution 0.8 0.5 0.4 0.4 0.6 0.3 0.3 0.4 0.2 0.2 0.2 0.1 0.1 0 0 0 α = (2 , 2 , 2) 1 2 3 1 2 3 1 2 3 0.5 0.8 0.8 0.4 0.6 0.6 0.3 0.4 0.4 0.2 0.2 0.2 0.1 0 0 0 α = (5 , 5 , 5) 1 2 3 1 2 3 1 2 3 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 α = (2 , 2 , 5) 1 2 3 1 2 3 1 2 3 Kurt T. Miller Dr. Nonparametric Bayes 27

Parametric Bayesian Clustering Key Property of the Dirichlet Distribution The Aggregation Property: If ( π 1 , . . . , π i , π i +1 , . . . , π K ) ∼ Dir( α 1 , . . . , α i , α i +1 , . . . , α K ) then ( π 1 , . . . , π i + π i +1 , . . . , π K ) ∼ Dir( α 1 , . . . , α i + α i +1 , . . . , α K ) Kurt T. Miller Dr. Nonparametric Bayes 28

Parametric Bayesian Clustering Key Property of the Dirichlet Distribution The Aggregation Property: If ( π 1 , . . . , π i , π i +1 , . . . , π K ) ∼ Dir( α 1 , . . . , α i , α i +1 , . . . , α K ) then ( π 1 , . . . , π i + π i +1 , . . . , π K ) ∼ Dir( α 1 , . . . , α i + α i +1 , . . . , α K ) This is also valid for any aggregation: � � � K � � � π 1 + π 2 , π k ∼ Beta α 1 + α 2 , α k k =3 K k =3 Kurt T. Miller Dr. Nonparametric Bayes 28

Parametric Bayesian Clustering Multinomial-Dirichlet Conjugacy Let Z ∼ Multinomial( π ) and π ∼ Dir( α ) . Posterior: p ( π | z ) ∝ p ( z | π ) p ( π ) K )( π α 1 − 1 · · · π α K − 1 ( π z 1 1 · · · π z K = ) 1 K ( π z 1 + α 1 − 1 · · · π z K + α K − 1 = ) 1 K which is Dir( α + z ) . Kurt T. Miller Dr. Nonparametric Bayes 29

Parametric Bayesian Clustering Clustering – A Parametric Approach Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. G 0 φ k ∼ G 0 π ∼ Dirichlet( α/K, . . . , α/K ) G α Ω K � G = π k δ φ k θ i i =1 ∼ θ i G ∼ p ( x | θ i ) x i x i N Kurt T. Miller Dr. Nonparametric Bayes 30

Parametric Bayesian Clustering Bayesian Mixture Models We no longer want just the maximum likelihood parameters, we want the full posterior: p ( π, φ | X ) ∝ p ( X | π, φ ) p ( π, φ ) Unfortunately, this is not analytically tractable. Two main approaches to approximate inference: • Markov Chain Monte Carlo (MCMC) methods • Variational approximations Kurt T. Miller Dr. Nonparametric Bayes 31

Parametric Bayesian Clustering Monte Carlo Methods Suppose we wish to reason about p ( θ | X ) , but we cannot compute this distribution exactly. If instead, we can sample θ ∼ p ( θ | X ) , what can we do? p ( θ | X ) Samples from p ( θ | X ) This is the idea behind Monte Carlo methods. Kurt T. Miller Dr. Nonparametric Bayes 32

Parametric Bayesian Clustering Markov Chain Monte Carlo (MCMC) We do not have access to an oracle that will give use samples θ ∼ p ( θ | X ) . How do we get these samples? Markov Chain Monte Carlo (MCMC) methods have been developed to solve this problem. We focus on Gibbs sampling , a special case of the Metropolis-Hastings algorithm . Kurt T. Miller Dr. Nonparametric Bayes 33

Parametric Bayesian Clustering Gibbs sampling An MCMC technique Assume θ consists of several parameters θ = ( θ 1 , . . . , θ m ) . In the finite mixture model, θ = ( π, µ 1 , . . . , µ K , Σ 1 , . . . , Σ K ) . Then do • Initialize θ (0) = ( θ (0) 1 , . . . , θ (0) m ) at time step 0. • For t = 1 , 2 , . . . , draw θ ( t ) given θ ( t − 1) in such a way that eventually θ ( t ) are samples from p ( θ | X ) . Kurt T. Miller Dr. Nonparametric Bayes 34

Parametric Bayesian Clustering Gibbs sampling An MCMC technique In Gibbs sampling, we only need to be able to sample θ ( t ) ∼ p ( θ i | θ ( t ) 1 , . . . , θ ( t ) i − 1 , θ ( t − 1) i +1 , . . . , θ ( t − 1) , X ) . i m If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ ( t ) from p ( θ | X ) . Kurt T. Miller Dr. Nonparametric Bayes 35

Parametric Bayesian Clustering Gibbs sampling An MCMC technique In Gibbs sampling, we only need to be able to sample θ ( t ) ∼ p ( θ i | θ ( t ) 1 , . . . , θ ( t ) i − 1 , θ ( t − 1) i +1 , . . . , θ ( t − 1) , X ) . i m If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ ( t ) from p ( θ | X ) . Example: θ = ( θ 1 , θ 2 ) and p ( θ ) ∼ N ( µ, Σ) . 5 5 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −3 −2 −1 0 1 2 3 4 5 −3 −2 −1 0 1 2 3 4 5 First 50 samples First 500 samples Kurt T. Miller Dr. Nonparametric Bayes 35

Parametric Bayesian Clustering Bayesian Mixture Models - MCMC inference Introduce “membership” indicators z i where z i ∼ Multinomial( π ) indicates which cluster the i th data point belongs to. p ( π, Z, φ | X ) ∝ p ( X | Z, φ ) p ( Z | π ) p ( π, φ ) α G 0 π φ k z i K x i N Kurt T. Miller Dr. Nonparametric Bayes 36

Parametric Bayesian Clustering Gibbs sampling for the Bayesian Mixture Model Randomly initialize Z, π, φ . Repeat until we have enough samples: 1. Sample each z i from K � z i | Z − i , π, φ, X ∝ π k p ( x i | φ k ) 1 1 { z i = k } k =1 2. Sample each π from π | Z, φ, X ∼ Dir( n 1 + α/K, . . . , n K + α/K ) where n i is the number of points assigned to cluster i . 3. Sample each φ k from the NIW posterior based on Z and X . Kurt T. Miller Dr. Nonparametric Bayes 37

Parametric Bayesian Clustering MCMC in Action Bad Initialization Point Iteration 25 Iteration 65 [Matlab demo] Kurt T. Miller Dr. Nonparametric Bayes 38

Parametric Bayesian Clustering Collapsed Gibbs Sampler Idea for an improvement: we can marginalize out some variables due to conjugacy, so do not need to sample it. This is called a collapsed sampler . Here marginalize out π . Randomly initialize Z, φ . Repeat: 1. Sample each z i from K � 1 z i | Z − i , φ, X ∝ ( n k + α/K ) p ( x i | φ k ) 1 { z i = k } k =1 2. Sample each φ k from the NIW posterior based on Z and X . Kurt T. Miller Dr. Nonparametric Bayes 39

Parametric Bayesian Clustering Note about the likelihood term For easy visualization, we used a Gaussian mixture model. You should use the appropriate likelihood model for your application! Kurt T. Miller Dr. Nonparametric Bayes 40

Parametric Bayesian Clustering Summary: Parametric Bayesian clustering • First specify the likelihood - application specific. • Next specify a prior on all parameters. • Exact posterior inference is intractable. Can use a Gibbs sampler for approximate inference. Kurt T. Miller Dr. Nonparametric Bayes 41

5 minute break Kurt T. Miller Dr. Nonparametric Bayes 42

Parametric Bayesian Clustering How to Choose K ? Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K . Kurt T. Miller Dr. Nonparametric Bayes 43

Parametric Bayesian Clustering How to Choose K ? Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K . What if we just let K → ∞ in our parametric model? Kurt T. Miller Dr. Nonparametric Bayes 43

Parametric Bayesian Clustering Thought Experiment Let K → ∞ . φ k ∼ G 0 π ∼ Dirichlet( α/K, . . . , α/K ) K � G = π k δ φ k i =1 ∼ θ i G ∼ p ( x | θ i ) x i Kurt T. Miller Dr. Nonparametric Bayes 44

Parametric Bayesian Clustering Thought Experiment: Collapsed Gibbs Sampler Randomly initialize Z, φ . Repeat: 1. Sample each z i from K � 1 z i | Z − i , φ, X ∝ ( n k + α/K ) p ( x i | φ k ) 1 { z i = k } k =1 K � → n k p ( x i | φ k ) 1 1 { z i = k } k =1 Note that n k = 0 for empty clusters. 2. Sample each φ k based on Z and X . Kurt T. Miller Dr. Nonparametric Bayes 45

Parametric Bayesian Clustering Thought Experiment: Collapsed Gibbs Sampler What about empty clusters? Lump all empty clusters together. Let K + be the number of occupied clusters. Then the posterior probability of sitting at any empty cluster is: α/K × ( K − K + ) f ( x i | G 0 ) z i | Z − i , φ, X ∝ → αf ( x i | G 0 ) � for f ( x i | G 0 ) = p ( x | φ ) dG 0 ( φ ) . Kurt T. Miller Dr. Nonparametric Bayes 46

Parametric Bayesian Clustering Key ideas to be discussed today • A parametric Bayesian approach to clustering • Defining the model • Markov Chain Monte Carlo (MCMC) inference • A nonparametric approach to clustering • Defining the model - The Dirichlet Process! • MCMC inference • Extensions Kurt T. Miller Dr. Nonparametric Bayes 47

The Dirichlet Process Model A Nonparametric Bayesian Approach to Clustering We must again specify two things: • The likelihood term (how data is affected by the parameters): p ( X | θ ) Identical to the parametric case. • The prior (the prior distirubution on the parameters): p ( θ ) The Dirichlet Process! Exact posterior inference is still intractable. But we have already derived the Gibbs update equations! Kurt T. Miller Dr. Nonparametric Bayes 48

The Dirichlet Process Model What is the Dirichlet Process? Image from http://www.nature.com/nsmb/journal/v7/n6/fig tab/nsb0600 443 F1.html Kurt T. Miller Dr. Nonparametric Bayes 49

The Dirichlet Process Model What is the Dirichlet Process? ( G ( A 1 ) , . . . , G ( A n )) ∼ Dir( α 0 G 0 ( A 1 ) , . . . , α 0 G 0 ( A n )) Kurt T. Miller Dr. Nonparametric Bayes 50

The Dirichlet Process Model The Dirichlet Process A flexible, nonparametric prior over an infinite number of clusters/classes as well as the parameters for those classes. Kurt T. Miller Dr. Nonparametric Bayes 51

The Dirichlet Process Model Parameters for the Dirichlet Process • α - The concentration parameter. • G 0 - The base measure. A prior distribution for the cluster specific parameters. The Dirichlet Process (DP) is a distribution over distributions . We write G ∼ DP ( α, G 0 ) to indicate G is a distribution drawn from the DP. It will become clearer in a bit what α and G 0 are. Kurt T. Miller Dr. Nonparametric Bayes 52

The Dirichlet Process Model The DP, CRP, and Stick-Breaking Process G ∼ DP( α , G 0 ) G 0 Ω Stick-Breaking Process (just the weights) G α The CRP describes the θ i partitions of θ when G is marginalized out. x i N Kurt T. Miller Dr. Nonparametric Bayes 53

The Dirichlet Process Model The Dirichlet Process Definition : Let G 0 be a probability measure on the measurable space (Ω , B ) and α ∈ R + . The Dirichlet Process DP ( α, G 0 ) is the distribution on probability measures G such that for any finite partition ( A 1 , . . . , A m ) of Ω , ( G ( A 1 ) , . . . , G ( A m )) ∼ Dir( αG 0 ( A 1 ) , . . . , αG 0 ( A m )) . A A Ω 5 2 A A 3 1 A 4 (Ferguson, ’73) Kurt T. Miller Dr. Nonparametric Bayes 54

The Dirichlet Process Model Mathematical Properties of the Dirichlet Process Suppose we sample • G ∼ DP ( α, G 0 ) • θ 1 ∼ G What is the posterior distribution of G given θ 1 ? Kurt T. Miller Dr. Nonparametric Bayes 55

The Dirichlet Process Model Mathematical Properties of the Dirichlet Process Suppose we sample • G ∼ DP ( α, G 0 ) • θ 1 ∼ G What is the posterior distribution of G given θ 1 ? � α 1 � G | θ 1 ∼ DP α + 1 , α + 1 G 0 + α + 1 δ θ 1 More generally � n � 1 α � G | θ 1 , . . . , θ n ∼ DP α + n, α + nG 0 + δ θ i α + n i =1 Kurt T. Miller Dr. Nonparametric Bayes 55

The Dirichlet Process Model Mathematical Properties of the Dirichlet Process With probability 1, a sample G ∼ DP ( α, G 0 ) is of the form ∞ � G = π k δ φ k k =1 (Sethuraman, ’94) Kurt T. Miller Dr. Nonparametric Bayes 56

The Dirichlet Process Model The Dirichlet Process and Clustering Draw G ∼ DP ( α, G 0 ) to get ∞ � G = π k δ φ k k =1 Use this in a mixture model: G 0 G α Ω θ i x i N Kurt T. Miller Dr. Nonparametric Bayes 57

The Dirichlet Process Model The Stick-Breaking Process • Define an infinite sequence of Beta random variables: β k ∼ Beta(1 , α ) k = 1 , 2 , . . . • And then define an infinite sequence of mixing proportions as: π 1 = β 1 k − 1 � = (1 − β l ) k = 2 , 3 , . . . π k β k l =1 • This can be viewed as breaking off portions of a stick: ... β β (1−β ) 1 2 1 • When π are drawn this way, we can write π ∼ GEM( α ) . Kurt T. Miller Dr. Nonparametric Bayes 58

The Dirichlet Process Model The Stick-Breaking Process • We now have an explicit formula for each π k : � k − 1 π k = β k l =1 (1 − β l ) • We can also easily see that � ∞ k =1 π k = 1 (wp1) : K � 1 − π k = 1 − β 1 − β 2 (1 − β 1 ) − β 3 (1 − β 1 )(1 − β 2 ) − · · · k =1 = (1 − β 1 )(1 − β 2 − β 3 (1 − β 2 ) − · · · ) K � = (1 − β k ) k =1 → 0 (wp1 as K → ∞ ) • So now G = � ∞ k =1 π k δ φ k has a clean definition as a random measure Kurt T. Miller Dr. Nonparametric Bayes 59

The Dirichlet Process Model The Stick-Breaking Process G 0 Ω φ k ∞ G π k α ∞ θ i x i N Kurt T. Miller Dr. Nonparametric Bayes 60

The Dirichlet Process Model The Chinese Restaurant Process (CRP) • A random process in which n customers sit down in a Chinese restaurant with an infinite number of tables • first customer sits at the first table • m th subsequent customer sits at a table drawn from the following distribution: P (previously occupied table i |F m − 1 ) ∝ n i P (the next unoccupied table |F m − 1 ) ∝ α where n i is the number of customers currently at table i and where F m − 1 denotes the state of the restaurant after m − 1 customers have been seated � � � � � � � � � � � Kurt T. Miller Dr. Nonparametric Bayes 61

The Dirichlet Process Model The CRP and Clustering • Data points are customers; tables are clusters • the CRP defines a prior distribution on the partitioning of the data and on the number of tables • This prior can be completed with: • a likelihood—e.g., associate a parameterized probability distribution with each table • a prior for the parameters—the first customer to sit at table k chooses the parameter vector for that table ( φ k ) from the prior � � � � � φ 1 φ φ 2 φ 3 � 4 � � � � � • So we now have a distribution—or can obtain one—for any quantity that we might care about in the clustering setting Kurt T. Miller Dr. Nonparametric Bayes 62

The Dirichlet Process Model The CRP Prior, Gaussian Likelihood, Conjugate Prior φ k = ( µ k , Σ k ) ∼ N ( a, b ) ⊗ IW ( α, β ) x i ∼ N ( φ k ) for a data point i sitting at table k Kurt T. Miller Dr. Nonparametric Bayes 63

The Dirichlet Process Model The CRP and the DP OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP? Kurt T. Miller Dr. Nonparametric Bayes 64

The Dirichlet Process Model The CRP and the DP OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP? Important fact : The CRP is exchangeable . Remember De Finetti’s Theorem: If ( x 1 , x 2 , . . . ) are infinitely exchangeable , then ∀ n � � n � � p ( x 1 , . . . , x n ) = p ( x i | G ) dP ( G ) i =1 for some random variable G . Kurt T. Miller Dr. Nonparametric Bayes 64

The Dirichlet Process Model The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP . Kurt T. Miller Dr. Nonparametric Bayes 65

The Dirichlet Process Model The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP . G 0 That means, when we integrate out G , we get the CRP. G α Ω n � θ i � p ( θ 1 , . . . , θ n ) = p ( θ i | G ) dP ( G ) i =1 x i N Kurt T. Miller Dr. Nonparametric Bayes 65

The Dirichlet Process Model The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP . In English, this means that if the DP is the prior on G , then the CRP defines how points are assigned to clusters when we integrate out G . Kurt T. Miller Dr. Nonparametric Bayes 66

The Dirichlet Process Model The DP, CRP, and Stick-Breaking Process Summary G ∼ DP( α , G 0 ) G 0 Ω Stick-Breaking Process (just the weights) G α The CRP describes the θ i partitions of θ when G is marginalized out. x i N Kurt T. Miller Dr. Nonparametric Bayes 67

Inference for the Dirichlet Process Inference for the DP - Gibbs sampler We introduce the indicators z i and use the CRP representation. Randomly initialize Z, φ . Repeat: 1. Sample each z i from K � z i | Z − i , φ, X ∝ n k p ( x i | φ k ) 1 1 { z i = k } + αf ( x i | G 0 ) 1 1 { z i = K +1 } k =1 2. Sample each φ k based on Z and X only for occupied clusters. This is the sampler we saw earlier, but now with some theoretical basis. Kurt T. Miller Dr. Nonparametric Bayes 68

Inference for the Dirichlet Process MCMC in Action for the DP What does this look like in action? ()*+,-./.)0.1)/.2-+32.-/ '% & % ()*+,)-./0!" !" ! & ! ! ! " ! # ! $ % $ # " ! ' " &'()*'+,-.$% ! ! ' ! !" ! # ! $ ! % ! & " & % $ # " # $ % ! $ Show Matlab demo ! # ! ! ! " ! # ! $ % $ # " [Matlab demo] Kurt T. Miller Dr. Nonparametric Bayes 69

Inference for the Dirichlet Process Improvements to the MCMC algorithm • Collapse out the φ k if conjugate model. • Split-merge algorithms. Kurt T. Miller Dr. Nonparametric Bayes 70

Inference for the Dirichlet Process Summary: Nonparametric Bayesian clustering • First specify the likelihood - application specific. • Next specify a prior on all parameters - the Dirichlet Process! • Exact posterior inference is intractable. Can use a Gibbs sampler for approximate inference. This is based on the CRP representation. Kurt T. Miller Dr. Nonparametric Bayes 71

Inference for the Dirichlet Process Key ideas to be discussed today • A parametric Bayesian approach to clustering • Defining the model • Markov Chain Monte Carlo (MCMC) inference • A nonparametric approach to clustering • Defining the model - The Dirichlet Process! • MCMC inference • Extensions Kurt T. Miller Dr. Nonparametric Bayes 72

Hierarchical Dirichlet Process Hierarchical Bayesian Models Original Bayesian idea View parameters as random variables - place a prior on them. Kurt T. Miller Dr. Nonparametric Bayes 73

Hierarchical Dirichlet Process Hierarchical Bayesian Models Original Bayesian idea View parameters as random variables - place a prior on them. “Problem”? Often the priors themselves need parameters. Kurt T. Miller Dr. Nonparametric Bayes 73

Hierarchical Dirichlet Process Hierarchical Bayesian Models Original Bayesian idea View parameters as random variables - place a prior on them. “Problem”? Often the priors themselves need parameters. Solution Place a prior on these parameters! Kurt T. Miller Dr. Nonparametric Bayes 73

Hierarchical Dirichlet Process Multiple Learning Problems Example: x i ∼ N ( θ i , σ 2 ) in m different groups. θ 2 θ 1 θ m x 1 j x 2 j x mj · · · N 1 N 2 N m How to estimate θ i for each group? Kurt T. Miller Dr. Nonparametric Bayes 74

Hierarchical Dirichlet Process Multiple Learning Problems Example: x i ∼ N ( θ i , σ 2 ) in m different groups. Treat θ i s as random variables sampled from a common prior θ i ∼ N ( θ 0 , σ 2 0 ) θ 0 θ 2 θ m θ 1 · · · x 1 j x 2 j x mj N 1 N 2 N m Kurt T. Miller Dr. Nonparametric Bayes 75

Hierarchical Dirichlet Process Recall Plate Notation: θ 0 θ i x ij N i m is equivalent to θ 0 θ 2 θ 1 θ m · · · x 1 j x 2 j x mj N 1 N 2 N m Kurt T. Miller Dr. Nonparametric Bayes 76

Hierarchical Dirichlet Process Let’s Be Bold! Independent estimation Hierarchical Bayesian θ 0 θ 1 θ 2 θ m θ i ⇒ x 1 j x 2 j x mj · · · x ij N 1 N 2 N m N i m Kurt T. Miller Dr. Nonparametric Bayes 77

Hierarchical Dirichlet Process Let’s Be Bold! Independent estimation Hierarchical Bayesian θ 0 θ 1 θ 2 θ m θ i ⇒ x 1 j x 2 j x mj · · · x ij N 1 N 2 N m N i m What do we do if we have DPs for multiple related datasets? H 1 H 2 H m G 1 G 2 α m G m α 1 α 2 ⇒ θ 1 i θ 2 i · · · θ mi x 1 i x 2 i x mi N 1 N 2 N m Kurt T. Miller Dr. Nonparametric Bayes 77

Hierarchical Dirichlet Process Let’s Be Bold! Independent estimation Hierarchical Bayesian θ 0 θ 1 θ 2 θ m θ i ⇒ x 1 j x 2 j x mj · · · x ij N 1 N 2 N m N i m What do we do if we have DPs for multiple related datasets? H H 1 H 2 H m G 0 G 1 G 2 α m G m α 1 α 2 G i α ⇒ θ 1 i θ 2 i · · · θ mi θ ij x 1 i x 2 i x mi x ij N 1 N 2 N m N i m Kurt T. Miller Dr. Nonparametric Bayes 77

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - PowerPoint PPT Presentation

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt Miller CS 294: Practical Machine Learning November 19, 2009 Today we will discuss Nonparametric Bayesian methods. Kurt T. Miller Dr.

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Citizens Advisory Committee Meeting #4 A REBUILD BY DESIGN PROJECT February 28, 2017 AG AGENDA

1 Enabling technologies Advances in sensor and actuator technology GPS, control of quantum

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

PreK-12 Professional Development Friday, December 7, 2018 AM General Session: Sustaining and

What happens after patient harm? Why one family from Kansas is fighting to stop the secrecy and

State-of-the-art Machine Learning based Modeling Attacks Phuong Ha Nguyen, Durga P. Sahoo, Kaleel

Positive Emotions: Science & Practice Barbara Fredrickson, PhD, UNC-CH Psychology Mary

Finite Approximations of Discrete Random Measures Jonathan H. Huggins Postdoctoral Research

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - PowerPoint PPT Presentation

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt Miller CS 294: Practical Machine Learning November 19, 2009 Today we will discuss Nonparametric Bayesian methods. Kurt T. Miller Dr.

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Citizens Advisory Committee Meeting #4 A REBUILD BY DESIGN PROJECT February 28, 2017 AG AGENDA

1 Enabling technologies Advances in sensor and actuator technology GPS, control of quantum

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

PreK-12 Professional Development Friday, December 7, 2018 AM General Session: Sustaining and

What happens after patient harm? Why one family from Kansas is fighting to stop the secrecy and

State-of-the-art Machine Learning based Modeling Attacks Phuong Ha Nguyen, Durga P. Sahoo, Kaleel

Positive Emotions: Science &amp; Practice Barbara Fredrickson, PhD, UNC-CH Psychology Mary

Finite Approximations of Discrete Random Measures Jonathan H. Huggins Postdoctoral Research

Positive Emotions: Science & Practice Barbara Fredrickson, PhD, UNC-CH Psychology Mary