advanced machine learning
play

Advanced Machine Learning Introduction to Probabilistic Graphical - PowerPoint PPT Presentation

Advanced Machine Learning Introduction to Probabilistic Graphical Models Amit Sethi Electrical Engineering, IIT Bombay Objectives Learn about statistical dependency of variables Understand how this dependency can be coded in graphs


  1. Advanced Machine Learning Introduction to Probabilistic Graphical Models Amit Sethi Electrical Engineering, IIT Bombay

  2. Objectives • Learn about statistical dependency of variables • Understand how this dependency can be coded in graphs • Understand the basic intuition behind Bayesian Networks 2

  3. Bayesian models with which you are familiar • Bayes theorem • p(C k | x ) = p(C k ) p( x |C k ) / p( x ) • Posterior = prior x likelihood / evidence • Naïve Bayes • p(C k | x ) = p(C k |x 1 ,…, x n ) α p(C k ) П i p(x i |C k ) • A decision about the class can now be based on: • Prior, and • Simplified class conditional densities of x • Log of the posterior probability leads to a linear discriminant for certain class conditionals from the exponential families

  4. Consider an inference problem • Trying to guess if the family is out: – When wife leaves the house she leaves the outdoor light on (but sometimes leaves it on for a guest) – When wife leaves the house, she usually puts the dog out – When dog has a bowel problem, she goes to the backyard – If the dog is in the backyard, I will probably hear it (but it might be the neighbor's dog) • If the dog is barking and the light is off, is the family out? 4 Example source: “Bayesian Networks without Tears” by Eugene Charniak, AI Magazine, AAAI 1991

  5. Some observations • A lot of the events in the world are related • The relations are not deterministic but probabilistic – Some events are the cause and others are effects – The effect usually has a sharper conditional distribution given the cause than if the cause is unknown 5

  6. Bayesian Network definition • A Bayesian network is a directed graph in which each node (variable) is annotated with a conditional probability distribution that encodes statistical dependency: – Each node corresponds to a random variable – If there is an arrow (edge) from node X to node Y , X is said to be a parent of Y – Each node X i has a conditional probability distribution P(X i | Parents(X i )) that quantifies the effect of the parents on the node – The graph has no directed cycles (and hence is a directed acyclic graph, or DAG 6 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  7. Back to our inference problem • Trying to guess if the family is out, given that the light is off and the dog is barking • 1: Brute force (no independence assumption): – P(fo) = Σ bp Σ do p(fo, bp, do, lo=1, hb=1) = Σ bp Σ do p(fo) p(bp|fo) p(do|fo,bp ) … • 2: Using Factorization property of BNs: – P(fo) = Σ bp Σ do p(fo) p(bp) p(lo|fo) p(do|fo,bp) p(hb|do) = p(fo) Σ bp p(bp) p(lo|fo) Σ do p(do|fo,bp) p(hb|do) 7 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  8. What have we gained, so far? 1. We have made it easier for us to visualize relationships between variables 2. We have simplified joint distribution as a product of lower dimensional conditional distributions 3. We have simplified marginalization of the joint distribution by taking out terms that do not depend on the function being marginalized 8

  9. Notion of D-separation • Influence of x “flows through” z to y y x z z y z x x y x z y • In which case the influence does not flow iff z is known (z D- separates x and y, or the path becomes inactive given z)? • Ans: (a), (b) and (c). For (d), iff z is unknown 9 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  10. Statistical independence in BNs • Independence: (x|y), iff p(x,y) = p(x) p(y) • Conditional independence: (x|y | z), iff p(x,y | z) = p(x | z) p(y | z) – x is conditionally independent of y given z • In Bayesian Networks: – (x | NonDescendants(x) | Parents(x) ) – x is conditionally independent of all its non- descendants given its parents 10

  11. Markov Networks aka MRF • Definition – Graphical models with undirected edges – Variables are nodes – Relationships between variables are undirected edges • Properties – Notion of conditional independence is simpler – Joint distributions are represented by clique (largest fully connected subset of nodes) potentials 11 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  12. Joint distributions in MRFs • If there is no link between two nodes x i and x j , then conditional independence can be expressed as: p(x i ,x j | x \{i,j} ) = p(x i | x \{i,j} ) p(x j | x \{i,j} ) • Due to Hammersley-Clifford theorem, the sets of distributions represented by the MRF’s conditional independence structure is the same as those that can be represented by a product of maximal clique potentials. i.e. the joint distribution is written as a product of potential functions ψ c (x c ) over the maximal cliques of the graph p(x) = 1/Z П c ψ c (x c ) • Here Z is the partition function, which is ∑ x П c ψ c (x c ) , which is a normalization constant 12 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  13. This clique potential can be represented in terms of an energy function • The clique potentials are strictly positive, hence can be defined in terms of an energy ψ c ( x c ) = exp( -E ( x c )) • Now, product of clique potentials is equivalent to sum of energies • However, the clique potentials do not have a specific probabilistic interpretation 13

  14. In general, the BN and MRFs represent a non-overlapping set of distributions • A directed graph whose conditional independence cannot be expressed as an undirected graph • An undirected graph whose conditional independence cannot be expressed as an directed graph Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  15. An example use of an MRF in image denoising or binary segmentation • Objective: – Find the underlying clean image • Assumptions: – Most of the pixels are not corrupt – Neighbouring pixels are likely to be same • Define (with values {-1,+1}): – x i to be underlying true pixel – y i to be observed pixels (iid | x i ) • Potentials: – For observing: - η x i y i – For spatial coherence: - β x i x j – For prior: -hx i • Total energy: 15 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  16. Now, we minimize the energy to get the desired results • Energy function: E( x,y ) = h ∑ i x i – β ∑ {i,j} x i x j – η ∑ i x i y i • And p( x,y ) = 1/Z exp{-E( x,y )} • We initialize y i and find x i such that the energy is minimized. The following are results with two different energy minimization algorithms: 16 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  17. Factor graphs are the most general form of Graphical Models • Factor graphs make the relationship among variables explicit by using Factor Nodes • Factorization: – If we can represent p(X) as a product of factors: p(x) = П s f s (x s ) = f a (x 1 ,x 2 ) f b (x 1 ,x 2 ) f c (x 2 ,x 3 ) f d (x 3 ) • Then, we can draw a Bipartite graph (undirected) such that: – Set of nodes V represent variables – Set of nodes F represent functions or factors – No node in V connected to another node in V – No node in F connected to another node in F 17 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

  18. Relation between the three PGMs • In a Bayesian Network, co-parents need to be moralized (married) to form edges in an MRF (because they are not independent given the children) • For an MRF, every clique is represented by a function node • Priors of parentless variables can also be incorporated in factor graphs • Loops can be avoided in factor graphs by combining functions that form a loop 18 Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

Recommend


More recommend