Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019
Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup
Where does a graphical model come from? Where does a graphical model come from? image: http://blog.londolozi.com/
Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function image: http://blog.londolozi.com/
Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function learning from data: fixed structure: easy for directed models unknown structure fully or partially observed data, hidden variables image: http://blog.londolozi.com/
Goals of learning: Goals of learning: density estimation density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣
Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = arg min ( P ∥ P ) P D P KL
Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL
Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P)
Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P) ^ substitute with : P ∗ = arg max log P ( x ) P ∑ x ∈ D P P D log-likelihood its negative is called the log loss how to compare two log-likelihood values?
Goals of learning: prediction Goals of learning: prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation
Goals of learning: Goals of learning: prediction prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation error measures: ^ 0/1 loss (unforgiving): E I ( X = ( Y )) X ( X , Y )∼ P ∗ ^ Hamming loss: E I ( X ( X , Y )∼ P ∗ ∑ i = ( Y ) ) X i i ^ E conditional log-likelihood: log ( X ∣ Y ) P ( X , Y )∼ P ∗ takes prediction uncertainty into account
Goals of learning: Goals of learning: knowledge discovery knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships E.g. in gene regulatory network image credit: Chen et al., 2014
Goals of learning: knowledge discovery Goals of learning: knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships not always uniquely identifiable two DAGs are I-equivalent if ′ Recall I ( G ) = I ( G ) E.g. in gene regulatory network same undirected skeleton same immoralities image credit: Chen et al., 2014
bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D image: http://ipython-books.github.io
bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P image: http://ipython-books.github.io
bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io
bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P a solution: penalize model complexity regularization simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io
Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P
Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y Naive Bayes P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X )
Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X ) Naive Bayes X trained discriminatively (cond. log-likelihood) works better on large datasets Y no assumptions about cond. ind. in Y logistic regression P ( X = 1∣ Y ) = σ ( W Y + b ) T
Discreminative vs generative Discreminative vs generative training Example naive Bayes vs logistic regression on UCI dataset naive Bayes logistic regression from: Ng & Jordan 2001
summary summary learning can have different objectives: density estimation calculating P(x) sampling from P (generative modeling) prediction (conditional density estimation) discriminative and generative modeling knowledge discovery expressed as empirical risk minimization bias-variance trade-off regularize the model
Recommend
More recommend