Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019

Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup

Where does a graphical model come from? Where does a graphical model come from? image: http://blog.londolozi.com/

Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function image: http://blog.londolozi.com/

Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function learning from data: fixed structure: easy for directed models unknown structure fully or partially observed data, hidden variables image: http://blog.londolozi.com/

Goals of learning: Goals of learning: density estimation density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣

Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = arg min ( P ∥ P ) P D P KL

Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL

Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P)

Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P) ^ substitute with : P ∗ = arg max log P ( x ) P ∑ x ∈ D P P D log-likelihood its negative is called the log loss how to compare two log-likelihood values?

Goals of learning: prediction Goals of learning: prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation

Goals of learning: Goals of learning: prediction prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation error measures: ^ 0/1 loss (unforgiving): E I ( X = ( Y )) X ( X , Y )∼ P ∗ ^ Hamming loss: E I ( X ( X , Y )∼ P ∗ ∑ i = ( Y ) ) X i i ^ E conditional log-likelihood: log ( X ∣ Y ) P ( X , Y )∼ P ∗ takes prediction uncertainty into account

Goals of learning: Goals of learning: knowledge discovery knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships E.g. in gene regulatory network image credit: Chen et al., 2014

Goals of learning: knowledge discovery Goals of learning: knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships not always uniquely identifiable two DAGs are I-equivalent if ′ Recall I ( G ) = I ( G ) E.g. in gene regulatory network same undirected skeleton same immoralities image credit: Chen et al., 2014

bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D image: http://ipython-books.github.io

bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P image: http://ipython-books.github.io

bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io

bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P a solution: penalize model complexity regularization simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io

Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P

Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y Naive Bayes P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X )

Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X ) Naive Bayes X trained discriminatively (cond. log-likelihood) works better on large datasets Y no assumptions about cond. ind. in Y logistic regression P ( X = 1∣ Y ) = σ ( W Y + b ) T

Discreminative vs generative Discreminative vs generative training Example naive Bayes vs logistic regression on UCI dataset naive Bayes logistic regression from: Ng & Jordan 2001

summary summary learning can have different objectives: density estimation calculating P(x) sampling from P (generative modeling) prediction (conditional density estimation) discriminative and generative modeling knowledge discovery expressed as empirical risk minimization bias-variance trade-off regularize the model

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Extended Path Integral Formulation for Volumetric Transport T. Hachisuka I. Georgiev W. Jarosz

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality

Statistical Geometry Processing Winter Semester 2011/2012 Machine Learning Topics Topics

( ) { ( ) } A random variable is subject to the following A random variable is subject to

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and