machine learning lecture 2 bayesian learning binomial and
play

Machine Learning Lecture 2 - Bayesian Learning: Binomial and - PowerPoint PPT Presentation

Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016


  1. Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016

  2. Introduction Coin Tossing D. Dubhashi ◮ Estimate probability a coin shows head based on observed coin tosses. ◮ Simple but fundamental! ◮ Historically important: originally used by alertBayes (1763) and generalized by Pierre–Simon de Laplace (1774) creating Bayes Rule.

  3. Introduction Likelihood D. Dubhashi Suppose X i ∼ Ber( θ ) i.e. P ( X i = 1) = θ (“heads”) P ( X i = 0) = 1 − θ (“tails”) and θ ∈ [0 . 1] is the parameter to be estimated. Given a series of N observed coin tosses, the probability that we observe k heads is given by the Binomial distribution : � N � θ k (1 − θ ) N − k . Bin( k | N , θ ) = k

  4. Introduction Likelihood D. Dubhashi Thus, the likelihood has the form P ( D | θ ) ∝ θ N 1 (1 − θ ) N 0 , where N 0 and N 1 are the number of tails and heads seen respectively. These are called sufficient statistics since this is all we need to know about the data to estimate θ . Formally, s ( D ) are a set of sufficient statistics for D if P ( θ | D ) = P ( θ | s ( D )) .

  5. Introduction Binomial Distribution D. Dubhashi θ =0.250 θ =0.900 0.35 0.4 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

  6. Introduction Bayes Rule for Posterior D. Dubhashi P ( D | θ ) P ( θ ) P ( θ | D ) = � 1 0 P ( D | θ ) P ( θ ) d θ Bit daunting to compute the integral in the denominator! Can avoid it via a clever trick of choosing suitable prior!

  7. Introduction Prior D. Dubhashi Need a prior with support [0 , 1] and to make math easy, of the same form as likelihood. Beta distribution 1 B ( a , b ) θ a − 1 (1 − θ ) b − 1 , Beta( θ | a , b ) = with hyperparameters a , b , and where B ( a , b ) = Γ( a )Γ( b ) Γ( a + b ) , is a normalizing factor. a Mean a + b a − 1 Mode a + b − 2 Prior Knowledge If we believe that the mean is 0 . 7 and standard deviation is 0 . 2 we can set a = 2 . 975 and b = 1 . 277 (exercise!) Uninformative Prior Uniform prior a = 1 = b .

  8. Introduction Beta Distribution D. Dubhashi beta distributions 3 a=0.1, b=0.1 a=1.0, b=1.0 a=2.0, b=3.0 2.5 a=8.0, b=4.0 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1

  9. Introduction Posterior, Conjugate Prior D. Dubhashi Multiplying prior and likelihood gives the posterior: P ( θ | D ) ∝ Bin( N 1 | θ, N 0 + N 1 )Beta( θ | a , b ) = Beta( θ | N 1 + a , N 0 + b ) . ] The posterior has the same distribution as the prior (with different parameters): the Beta distribution is said to be a conjugate prior for the Binomial distribution. The posterior is obtained by simply adding the prior parameters to the empirical counts, hence the hyperparameters are often called pseudo–counts.

  10. Introduction Updating Beta Prior with Binomial Likelihood D. Dubhashi 6 4.5 prior Be(2.0, 2.0) prior Be(5.0, 2.0) lik Be(4.0, 18.0) lik Be(12.0, 14.0) 4 post Be(5.0, 19.0) post Be(16.0, 15.0) 5 3.5 4 3 2.5 3 2 2 1.5 1 1 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  11. Introduction Sequential update versus Batch D. Dubhashi Suppose we have two data sets D 1 and D 2 with sufficient statistics N 1 1 , N 1 0 and N 2 1 , N 2 0 respectively. Let N 1 := N 1 1 + N 2 1 and N 0 := N 1 0 + N 2 0 be the combined suffcient statistics. In batch mode, P ( θ | D 1 , D 2 ) ∝ Bin( N 1 | θ, N 1 + N 0 )Beta( θ | a , b ) ∝ Beta( θ | N 1 + a , N 0 + a ) . In sequential mode, P ( D 2 | θ ) P ( θ | D 1 ) P ( θ | D 1 , D 2 ) ∝ Bin( N 2 1 | θ, N 2 1 + N 2 0 )Beta( θ | N 1 1 + a , N 1 ∝ 0 + b ) Beta( θ | N 1 1 + N 2 1 + a , N 1 0 + N 2 ∝ 0 + b ) = Beta( θ | N 1 + a , N 0 + b ) Very suitable for online learning!

  12. Introduction Posterior Mean and Mode D. Dubhashi ◮ The MAP estimate is given by a + N 1 − 1 ˆ θ MAP = a + b + N − 2 . ◮ With uniform prior a = 1 = b , θ MLE = N 1 ˆ N , which is just the MLE. ◮ The posterior mean is a + N 1 ¯ θ = a + b + N , which is a convex combination of the prior mean and the MLE: a ¯ a + b + (1 − λ )ˆ θ = λ θ MLE , a + b + N . Note that as N → ∞ , ¯ a + b θ → ˆ with λ := θ MLE .

  13. Introduction Posterior Predictive Distribution D. Dubhashi The probability of a heads in a single new coin toss is: � 1 P (ˆ x = 1 | D ) = P ( x = 1 | θ ) P ( θ | D ) d θ 0 � 1 = θ Beta( θ | N 1 + a , N 0 + b ) d θ 0 = E Beta[ θ ] N 1 + a = N 1 + N 0 + a + b

  14. Introduction Predicting Multiple Future Trials D. Dubhashi The probability of predicting x heads in M future trials: � 1 P ( x | D ) = Bin( x | θ, M ) P ( θ | D ) d θ 0 � 1 = Bin( x | θ, M )Beta( θ | N 1 + a , N 0 + b ) d θ 0 � 1 � M � 1 θ x + N 1 + a − 1 (1 − θ ) M − x + N 0 + b − 1 = x B ( N 1 + a , N 0 + b ) 0 � M � B ( x + N 1 + a , ( M − x ) + N 0 + b ) = x B ( N 1 + a , N 0 + b ) the compound Beta–Binomial distribution, with mean and variance: N + a + b , var[ x ] = M ( N 1 + a )( N 0 + b ) N 1 + a N + a + b + M E [ x ] = M ( N 1 + a + N 0 + b ) 2 N + 1

  15. Introduction Posterior Predictive and Plugin D. Dubhashi posterior predictive plugin predictive 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

  16. Introduction Tossing a Dice D. Dubhashi ◮ From coins and two outcomes to dice and many outcomes. ◮ Given observations from a dice with K faces, predict next roll. ◮ Suppose we observe N dice rolls D = { x 1 , x 2 , · · · , x N } where each x i ∈ { 1 , · · · , K } .

  17. Introduction Likelihood D. Dubhashi ◮ Suppose we observe N dice rolls D = { x 1 , x 2 , · · · , x N } where each x i ∈ { 1 , · · · , K } . ◮ K � θ N k P ( D | θ ) = k , k =1 where θ k is the (unknown) probability of showing face k and N k is the observed number outomes showing face k .

  18. Introduction Multinomial Distribution D. Dubhashi The probability of observing face k appear x k times in n rolls of a dice with face probabilities θ := ( θ k , k ∈ { 1 , · · · K } ) is Multinomial Distribution k � n � � θ x k Mu( x | n , θ ) = k . x 1 , x 2 , · · · , x K k =1

  19. Introduction Priors D. Dubhashi Since the parameter vector θ lives in the K –dimensional simplex S K := { ( x 1 , · · · , x K ) | x 1 + · · · + x K = 1 } , we need a prior that ◮ has support on this simplex. ◮ is ideally also conjugate to the likelihood distribution i.e. multinomial.

  20. Introduction Dirichlet Distribution D. Dubhashi Dirichlet Distribution K 1 � x α k − 1 Dir( x | α ) := 1[ x ∈ S K ] , k B ( α ) k =1 where B ( α ) is the normalization factor � K k =1 Γ( α k ) B ( α ) := , Γ( α 0 ) with α 0 := � K k =1 α 0 .

  21. Introduction Dirichlet Distribution D. Dubhashi α =0.10 15 10 p 5 0 1 1 0.5 0.5 0 0

  22. Introduction Dirichlet Distribution examples D. Dubhashi α 0 controls the strength i.e. the peak and α k s control where it occurs: (1,1,1) Uniform distribution. (2,2,2) Broad distribution centered at (1 / 3 , 1 / 3 , 1 / 3) (20,20,20) Narrow distribution centered at (1 / 3 , 1 / 3 , 1 / 3). When α k < 1, for all k , we get “spikes” at the corners.

  23. Introduction Samples from Dirichlet Distribution D. Dubhashi Samples from Dir (alpha=0.1) Samples from Dir (alpha=1) 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5

  24. Introduction Prior and Posterior D. Dubhashi Prior Dirichlet Prior: K 1 � θ α k − 1 Dir( θ | α ) = k B ( α ) i =1 Posterior P ( θ | D ) ∝ P ( D | θ ) P ( θ ) K � θ N k k θ α k − 1 ∝ k k =1 K � θ α k + N k − 1 = k k =1 = Dir( θ | α 1 + N 1 , · · · , α K + N K )

  25. Introduction MAP estimate using Lagrange Multipliers D. Dubhashi max Dir( θ | α 1 + N 1 , · · · , α K + N K ) θ K � θ α k + N k − 1 = k k =1 subject to θ 1 + θ 2 + · · · + θ K = 1 . Use Lagrange multipliers! Solution: θ k = α k + N k − 1 ˆ α 0 + N − K . With uniform prior this becomes ˆ θ k = N k / N .

  26. Introduction Posterior Predictive D. Dubhashi � P ( X = k | D ) = P ( X = k | θ ) P ( θ | D ) d θ � = P ( X = k | θ k ) [ P ( θ − k , θ k | D ) d θ − k ] d θ k � = θ k P ( θ k | D ) d θ k = E [ θ k | D ] α k + N k = α + N

  27. Introduction Application to Language Modelling D. Dubhashi Suppose observe the following sentences: Sentences Mary had a little lamb, little lamb, little lamb Mary had a little lamb, it’s fleece as white as snow Can we predict which word comes next?

Recommend


More recommend