unsupervised learning part 1 lecture 19
play

Unsupervised learning (part 1) Lecture 19 David Sontag New York - PowerPoint PPT Presentation

Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore Bayesian networks enable use of domain knowledge Y p ( x 1


  1. Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

  2. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V Will my car start this morning? Heckerman et al. , Decision-TheoreMc TroubleshooMng, 1995

  3. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V What is the differenMal diagnosis? Beinlich et al. , The ALARM Monitoring System, 1989

  4. Bayesian networks are genera*ve models • Can sample from the joint distribuMon, top-down • Suppose Y can be “spam” or “not spam”, and X i is a binary indicator of whether word i is present in the e-mail • Let’s try generaMng a few emails! Label Y . . . X1 X2 X3 Xn Features • OZen helps to think about Bayesian networks as a generaMve model when construcMng the structure and thinking about the model assumpMons

  5. Inference in Bayesian networks • CompuMng marginal probabiliMes in tree structured Bayesian networks is easy – The algorithm called “belief propagaMon” generalizes what we showed for hidden Markov models to arbitrary trees Label X 1 X 2 X 3 X 4 X 5 X 6 Y . . . X1 X2 X3 Xn Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features • Wait… this isn’t a tree! What can we do?

  6. Inference in Bayesian networks • In some cases (such as this) we can transform this into what is called a “juncMon tree”, and then run belief propagaMon

  7. Approximate inference • There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these • Markov chain Monte Carlo algorithms repeatedly sample assignments for esMmaMng marginals • Varia4onal inference algorithms (determinisMc) find a simpler distribuMon which is “close” to the original, then compute marginals using the simpler distribuMon

  8. Maximum likelihood esMmaMon in Bayesian networks Suppose that we know the Bayesian network structure G Let θ x i | x pa ( i ) be the parameter giving the value of the CPD p ( x i | x pa ( i ) ) Maximum likelihood estimation corresponds to solving: M 1 X log p ( x M ; θ ) max M θ m =1 subject to the non-negativity and normalization constraints This is equal to: M M N 1 1 X X X log p ( x M ; θ ) log p ( x M | x M max = max pa ( i ) ; θ ) i M M θ θ m =1 m =1 i =1 N M 1 X X log p ( x M | x M = max pa ( i ) ; θ ) i M θ m =1 i =1 The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.

  9. Returning to clustering… • Clusters may overlap • Some clusters may be “wider” than others • Can we model this explicitly? • With what probability is a point from a cluster?

  10. ProbabilisMc Clustering • Try a probabilisMc model! Y X 1 X 2 • allows overlaps, clusters of different size, etc. ?? 0.1 2.1 • Can tell a genera*ve story for ?? 0.5 -1.1 data ?? 0.0 3.0 – P(Y)P(X|Y) ?? -0.1 -2.0 • Challenge: we need to esMmate ?? 0.2 1.5 model parameters without labeled Ys … … …

  11. Gaussian Mixture Models • P(Y): There are k components • P(X|Y): Each component generates data from a mul>variate Gaussian with mean μ i and covariance matrix Σ i Each data point assumed to have been sampled from a genera4ve process : 1. Choose component i with probability P(y=i) [Mul*nomial] 2. Generate datapoint ~ N( m i , Σ i ) P ( X = x j | Y = i ) = µ 2 µ 1 T Σ i 1 1/ 2 exp − 1 ⎡ ⎤ − 1 x j − µ i ( ) ( ) 2 x j − µ i (2 π ) m / 2 || Σ i || ⎢ ⎥ ⎣ ⎦ By fi:ng this model (unsupervised µ 3 learning), we can learn new insights about the data

  12. MulMvariate Gaussians # T Σ i & (2 π ) m /2 || Σ i || 1/2 exp − 1 1 − 1 x j − µ i P(X= x j )= ( ) ( ) P ( X = x j | Y = i ) = 2 x j − µ i % ( $ ' Σ ∝ idenMty matrix

  13. MulMvariate Gaussians # T Σ i & (2 π ) m /2 || Σ i || 1/2 exp − 1 1 − 1 x j − µ i P(X= x j )= ( ) ( ) P ( X = x j | Y = i ) = 2 x j − µ i % ( $ ' Σ = diagonal matrix X i are independent ala Gaussian NB

  14. MulMvariate Gaussians # T Σ i & (2 π ) m /2 || Σ i || 1/2 exp − 1 1 − 1 x j − µ i P(X= x j )= ( ) ( ) P ( X = x j | Y = i ) = 2 x j − µ i % ( $ ' Σ = arbitrary (semidefinite) matrix: - specifies rotaMon (change of basis) - eigenvalues specify relaMve elongaMon

  15. MulMvariate Gaussians Eigenvalue, λ, of Σ Covariance matrix, Σ, = degree to which x i vary together # T Σ i & (2 π ) m /2 || Σ i || 1/2 exp − 1 1 − 1 x j − µ i P(X= x j )= ( ) ( ) P ( X = x j | Y = i ) = 2 x j − µ i % ( $ '

  16. Modelling erupMon of geysers Old Faithful Data Set Time to ErupMon DuraMon of Last ErupMon

  17. Modelling erupMon of geysers Old Faithful Data Set Single Gaussian Mixture of two Gaussians

  18. Marginal distribuMon for mixtures of Gaussians Component Mixing coefficient K=3

  19. Marginal distribuMon for mixtures of Gaussians

  20. Learning mixtures of Gaussians Original data (hypothesized) Observed data (y missing) Inferred y’s (learned model) Shown is the posterior probability that a point was generated from i th Gaussian: Pr( Y = i | x )

  21. ML esMmaMon in supervised setng • Univariate Gaussian • Mixture of Mul4 variate Gaussians ML esMmate for each of the MulMvariate Gaussians is given by: n µ ML = 1 n Σ ML = 1 k ∑ T k k x n k ∑ ( ) x j − µ ML ( ) x j − µ ML n n j = 1 j = 1 Just sums over x generated from the k ’th Gaussian

  22. What about with unobserved data? • Maximize marginal likelihood : K – argmax θ ∏ j P(x j ) = argmax ∏ j ∑ k=1 P(Y j =k, x j ) • Almost always a hard problem! – Usually no closed form soluMon – Even when lgP(X,Y) is convex, lgP(X) generally isn’t… – Many local opMma

  23. ExpectaMon MaximizaMon 1977: Dempster, Laird, & Rubin

  24. The EM Algorithm • A clever method for maximizing marginal likelihood: – argmax θ ∏ j P(x j ) = argmax θ ∏ j ∑ k=1 K P(Y j =k, x j ) – Based on coordinate descent. Easy to implement (eg, no line search, learning rates, etc.) • Alternate between two steps: – Compute an expectaMon – Compute a maximizaMon • Not magic: s4ll op4mizing a non-convex func4on with lots of local op4ma – The computaMons are just easier (oZen, significantly so)

  25. EM: Two Easy Steps Objec>ve: argmax θ lg ∏ j ∑ k=1 K P(Y j =k, x j ; θ) = ∑ j lg ∑ k=1 K P(Y j =k, x j ; θ) Data: {x j | j=1 .. n} • E-step : Compute expectaMons to “fill in” missing y values according to current parameters, θ – For all examples j and values k for Y j , compute: P(Y j =k | x j ; θ) • M-step : Re-esMmate the parameters with “weighted” MLE esMmates – Set θ new = argmax θ ∑ j ∑ k P(Y j =k | x j ;θ old ) log P(Y j =k, x j ; θ) Par>cularly useful when the E and M steps have closed form solu>ons

  26. Gaussian Mixture Example: Start

  27. AZer first iteraMon

  28. AZer 2nd iteraMon

  29. AZer 3rd iteraMon

  30. AZer 4th iteraMon

  31. AZer 5th iteraMon

  32. AZer 6th iteraMon

  33. AZer 20th iteraMon

  34. EM for GMMs: only learning means (1D) Iterate: On the t ’th iteraMon let our esMmates be (t) } λ t = { μ 1 (t) , μ 2 (t) … μ K E-step Compute “expected” classes of all datapoints ⎛ ⎞ ) ∝ exp − 1 ( ( ) 2 σ 2 ( x j − µ k ) 2 P Y j = k x j , µ 1 ... µ K P Y j = k ⎜ ⎟ ⎝ ⎠ M-step Compute most likely new μ s given class expectaMons m ( ) ∑ P Y j = k x j x j j = 1 µ k = m ( ) ∑ P Y j = k x j j = 1

  35. What if we do hard assignments? Iterate: On the t ’th iteraMon let our esMmates be (t) } λ t = { μ 1 (t) , μ 2 (t) … μ K E-step Compute “expected” classes of all datapoints ⎛ ⎞ ) ∝ exp − 1 ( ( ) 2 σ 2 ( x j − µ k ) 2 P Y j = k x j , µ 1 ... µ K P Y j = k ⎜ ⎟ ⎝ ⎠ M-step δ represents hard assignment to “most Compute most likely new μ s given class expectaMons likely” or nearest cluster m ( ) ∑ P Y j = k x j x j m ( ) x j ∑ δ Y j = k , x j j = 1 j = 1 µ k = µ k = m m ( ) ( ) ∑ ∑ δ Y j = k , x j P Y j = k x j j = 1 j = 1 Equivalent to k-means clustering algorithm!!!

  36. E.M. for General GMMs p k (t) is shorthand for esMmate of P(y=k) on Iterate: On the t ’th iteraMon let our esMmates be t’th iteraMon λ t = { μ 1 (t) , μ 2 (t) … μ K (t) , Σ 1 (t) , Σ 2 (t) … Σ K (t) , p 1 (t) , p 2 (t) … p K (t) } E-step Compute “expected” classes of all datapoints for each class ( ) ∝ p k ( ) ( t ) p x j ; µ k ( t ) , Σ k ( t ) Evaluate probability of a P Y j = k x j ; λ t mul*variate a Gaussian at x j M-step Compute weighted MLE for μ given expected classes above T ( ) ( ) [ ] x j − µ k [ ] ∑ ∑ ( t + 1 ) ( t + 1 ) P Y j = k x j ; λ t x j P Y j = k x j ; λ t x j − µ k ) = ) = ( t + 1 j ( j t + 1 µ k Σ k ( ) ( ) ∑ ∑ P Y j = k x j ; λ t P Y j = k x j ; λ t j j ( ) ∑ P Y j = k x j ; λ t ( t + 1) = j p k m m = #training examples

  37. The general learning problem with missing data • Marginal likelihood: X is observed, Z (e.g. the class labels Y ) is missing: • ObjecMve: Find argmax θ l(θ:Data) • Assuming hidden variables are missing completely at random (otherwise, we should explicitly model why the values are missing)

Recommend


More recommend