probabilistic graphical models
play

Probabilistic Graphical Models Guest Lecture by Narges Razavian - PowerPoint PPT Presentation

Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference


  1. Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017

  2. Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference (when some variables are unobserved) How to learn model parameters from data Undirected Graphical Models Inference (Belief Propagation) New Directions in PGM research & wrapping up

  3. “ What I cannot create, I do not understand.” -Richard Feynman

  4. Generative models vs Discriminative models Discriminative models learn P(Y|X) . It’s easier, requires less data, but is only useful for one particular task: Given X, what is P(Y|X)? [Example: Logistic Regression, Feed-Forward or Convolutional Neural Networks, etc.] Generative models instead learn P(Y, X) completely. Once they do that, they can compute everything! P(X) = ∫ y P(X,Y) P(Y) = ∫ x P(X,Y) P(Y|X) = P(Y,X) / ∫ y P(Y,X) [Caveat: No Free Lunch!! You want to answer every question under the sun? You need more data!]

  5. Probabilistic Graphical Models: Main “Classic” approach to modeling P(Y, X) P( Y 1 ,…, Y M , X 1 , … ,X D )

  6. Some Calculations on Space Imagine each variable is binary P( Y 1 ,…, Y M , X 1 , … , X D )

  7. Some Calculations on Space Imagine each variable is binary P( Y 1 ,…, Y M , X 1 , … , X D ) How many parameters do we need to estimate from data to specify P(Y,X)??

  8. Some Calculations on Space Imagine each variable is binary P( Y 1 ,…, Y M , X 1 , … , X D ) How many parameters do we need to estimate from data to specify P(Y,X)?? 2 (M+D) -1

  9. Too many parameters! What can be done? 1) Look for conditional independences 2) Use Chain Rule for probabilities to break P(Y,X) into smaller pieces 3) Rewrite P(Y,X) as product of smaller factors a) Maybe you have more data for a subset of variables.. 4) Simplify some of the modeling assumptions to cut parameters a) I.e. Assume data is multivariate Gaussian b) I.e. Assume conditional independencies even if they don’t really always apply

  10. Bayesian Networks Use chain rule for probabilities ● This is always true, no approximations or assumptions, so no reduction in number of parameters either ● BNs: Conditional Independence Assumption: ○ For some of variables, P(X i | X 1 , …, X i-1 ) is approximated with P(X i | Subset of (X 1 , …, X i-1 ) ) ■ This “ Subset of (X 1 , …, X i-1 )” is referred to as Parents(X i ) Reduce parameters (if binary for instance) from 2 (i-1) to 2 |parents(Xi)| ■

  11. Bayesian Networks Variable and assumption Number of parameters in binary case: Raw Chain Rule BN Chain Rule X1: X2: Difficulty Intelligence X1 (Difficulty) 1 2-1 P(X1) P(X1) P(X1) X3: Grade X4: SAT X2 (Intelligence) 2 1 P(X2|X1) = P(X2) P(X2|X1) P(X2) X3 (Grade) 4 4 X5: Letter of recom P(X3|X1,X2) = P(X3|X1,X2) P(X3|X1,X2) P(X3|X1,X2) X4 (SAT score) 8 2 P(X4| X1,X2,X3) = P(X4|X2) P(X4| X1,X2,X3) P(X4|X2) X5 (Letter) 16 2 P(X5 | X1,X2,X3,X4) = P(X5 | X3) P(X5 | X1,X2,X3,X4) P(X5 | X3) Total P(X1,X2,X3,X4,X5) 1+2+4+8+16 = 31 1+1+4+2+2 =10

  12. Some Example of a BN for SNPs

  13. Benefits of Bayesian Networks 1) Once estimated they can answer any conditional or marginal queries! a) Called Inference 2) Fewer parameters to estimate! 3) We can start putting prior information into the network 4) We can incorporate LATENT(Hidden/Unobserved) variables based on how we/domain experts think variables might be related 5) Generating samples from the distribution becomes super easy.

  14. Inference in Bayesian Networks Query types: X1: X2: Difficulty Intelligence 1) Conditional probabilities P(Y|X)=? X3: Grade X4: SAT P(X i ==a|X \i ==B,Y==C)=? X5: Letter of recom 2) Maximum a posteriori estimate Argmax xi P(X i |X \i ) = ? Argmax yi P(Yi| X) = ?

  15. Key operation: Marginalization P(X) = Σ y P(X,Y) P(X5 | X2=a) = ?? X1: X2: Difficulty Intelligence P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a) P(X5 , X2=a) = Σ X1,X3,X4 P(X1,X2=a,X3,X4,X5) X3: Grade X4: SAT P(X2=a) = Σ X1,X3,X4,X5 P(X1,X2=a,X3,X4,X5) X5: Letter of recom

  16. Marginalize from the first parents (root) to the variable...

  17. Marginalize from the first parents (root) to the variable...

  18. Marginalize from the first parents (root) to the variable...

  19. Marginalize from the first parents (root) to the variable...

  20. Marginalize from the first parents (root) to the variable...

  21. Marginalize from the first parents (root) to the variable...

  22. Marginalize from the first parents (root) to the variable... This method is called sum-product or variable elimination

  23. Marginalization when P(X) = Σ y P(X,Y) P(X5 | X2=a) = ?? X1: X2: P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a) Difficulty Intelligence X3: Grade X4: SAT X5: Letter of recom

  24. Marginalization when P(X) = Σ y P(X,Y) P(X5 | X2=a) = ?? X1: X2: P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a) Difficulty Intelligence X3: Grade X4: SAT P(X5 , X2=a) = Σ X1,X3,X4 P(X1,X2=a,X3,X4,X5) = Σ X1,X3,X4 P(X1) P(X2=a) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) X5: Letter of recom = P(X2=a) Σ X1,X3,X4 P(X1) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) Σ X1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) Σ X4 P(X4|X2=a) = P(X2=a) Σ X1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) = P(X2=a) Σ X1,X3 P(X1) P X2=a (X3|X1) P(X5|X3) = P(X2=a) Σ X3 P(X5|X3) Σ X1 P X2=a (X3|X1) P(X1) = P(X2=a) Σ X3 P(X5|X3) f x2=a (X3) = P(X2=a) Σ X3 P(X5|X3) f x2=a (X3) = P(X2=a) g x2=a (X5) = P(X2=a) g x2=a (X5)

  25. Marginalization when P(X) = Σ y P(X,Y) P(X5 | X2=a) = g x2=a (X5) X1: X2: P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a) Difficulty Intelligence X3: Grade X4: SAT P(X5 , X2=a) = Σ X1,X3,X4 P(X1,X2=a,X3,X4,X5) = Σ X1,X3,X4 P(X1) P(X2=a) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) X5: Letter of recom = P(X2=a) Σ X1,X3,X4 P(X1) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) Σ X1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) Σ X4 P(X4|X2=a) = P(X2=a) Σ X1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) = P(X2=a) Σ X1,X3 P(X1) P X2=a (X3|X1) P(X5|X3) = P(X2=a) Σ X3 P(X5|X3) Σ X1 P X2=a (X3|X1) P(X1) = P(X2=a) Σ X3 P(X5|X3) f x2=a (X3) = P(X2=a) Σ X3 P(X5|X3) f x2=a (X3) = P(X2=a) g x2=a (X5) = P(X2=a) g x2=a (X5)

  26. Estimating Parameters of a Bayesian Network ● Maximum Likelihood Estimation ● Also sometimes Maximum Pseudolikelihood estimation

  27. How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known If you remember from other lectures: Likelihood(D; Parameters) = ∏ Dj in data P(Dj | Parameters) = ∏ Dj in data ∏ Xij in Dj P(Xij | Par(Xij) , Parameters{Par(Xij) -> Xij}) = ∏ i in variable set ∏ Dj in data P(Xij | Par(Xij) , Parameters{Par(Xij) -> Xij}) = ∏ i in variable set (Independent Local terms function of All observed Xij and Par(Xij)) MLE-Parameters{Par(Xij) -> Xij} = Argmax ( Local likelihood of observed Xij and Par(Xij) in data!)

  28. How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known ● If variables are discrete: P(Xi = a | Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B)

  29. How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known ● If variables are discrete: P(Xi = a | Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B) ● If variables are continuous: P(Xi = a | Parents(Xi) = B) = fit Some_PDF_Function (a,B)

  30. How to estimate parameters of a Bayesian Network? (1) You have observed all Y,X variables and dependency structure is known P(Xi=a | Parent(Xi) =B) = Some_PDF_Function (a, B) Single Multivariate Gaussian Mixture of Multivariate Gaussian Non-parametric density functions

  31. How to estimate parameters of a Bayesian Network? (2) You have observed all Y,X variables, but dependency structure is NOT known

  32. Structure learning when all variables are observed 1) Neighborhood Selection: ● Lasso: L1 regularized regression per variable, learning using other variables. ● Not necessarily a tree structure 2) Tree Learning via Chaw-Liu method: ● Per variable pairs find empirical distribution P(Xi,Xj) = Count(Xi,Xj)/M ● Per variable pairs, compute mutual information ● Use I(Xi,Xj) as weight in graph. Learn maximum spanning tree.

  33. How to estimate parameters of a Bayesian Network? (3) You have unobserved variables!!, but dependency structure is known Most commonly used Bayesian Networks these days!

  34. In practice, Bayes Nets are most used to inject priors and structure into the task Modeling documents as a collection of topics where each topic is a distribution over words: Topic Modeling via Latent Dirichlet Allocation

  35. In Practice, Bayes Nets are most used to inject priors and structure Correcting for hidden confounders in expression data

  36. In Practice, Bayes Nets are most used to inject priors and structure Correcting for hidden confounders in expression data

Recommend


More recommend