Probabilistic & Unsupervised Learning Model selection, Hyperparameter optimisation, and Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2015
Learning model structure How many clusters in the data? How smooth should the function be? Is this input relevant to predicting that output? What is the order of a dynamical system? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG How many states in a hidden Markov model? How many auditory sources in the input?
Model selection Models (labelled by m ) have parameters θ m that specify the probability of data: P ( D| θ m , m ) . If model is known, learning θ m means finding posterior or point estimate (ML, MAP , . . . ). What if we need to learn the model too? ◮ Could combine models into a single “supermodel”, with composite parameter ( m , θ m ) . ◮ ML learning will overfit: favours most flexible (nested) model with most parameters, even if the data actually come from a simpler one. ◮ Density function on composite parameter space (union of manifolds of different dimensionalities) difficult to define ⇒ MAP learning ill-posed. ◮ Joint posterior difficult to compute — dimension of composite parameter varies [but Monte-Carlo methods may sample from much a posterior.] ⇒ Separate model selection step: P ( θ m , m |D ) = P ( θ m | m , D ) · P ( m |D ) � �� � � �� � model-specific posterior model selection
Model complexity and overfitting: a simple example M = 0 M = 1 M = 2 M = 3 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10
Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models.
Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric).
Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric). Likelihood validation ◮ Partition data into disjoint training and validation data sets D = D tr ∪ D vld . Choose model with greatest P ( D vld | θ ML m ) , with θ ML m = argmax P ( D tr | θ ) . [Or, better, greatest P ( D vld |D tr , m ) .] ◮ May be biased towards simpler models; often high-variance. ◮ Cross-validation uses multiple partitions and averages likelihoods.
Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric). Likelihood validation ◮ Partition data into disjoint training and validation data sets D = D tr ∪ D vld . Choose model with greatest P ( D vld | θ ML m ) , with θ ML m = argmax P ( D tr | θ ) . [Or, better, greatest P ( D vld |D tr , m ) .] ◮ May be biased towards simpler models; often high-variance. ◮ Cross-validation uses multiple partitions and averages likelihoods. Bayesian model selection ◮ Choose most likely model: argmax P ( m |D ) . ◮ Principled from a probabilistic viewpoint—if true model is in set being considered—but sensitive to assumed priors etc. ◮ Can use posterior probabilities to weight models for combined predictions (no need to select at all).
Bayesian model selection: some terminology A model class m is a set of distributions parameterised by θ m , e.g. the set of all possible mixtures of m Gaussians. The model implies both a prior over the parameters P ( θ m | m ) , and a likelihood of data given parameters (which might require integrating out latent variables) P ( D| θ m , m ) . The posterior distribution over parameters is P ( θ m |D , m ) = P ( D| θ m , m ) P ( θ m | m ) . P ( D| m ) The marginal probability of the data under model class m is: � P ( D| m ) = P ( D| θ m , m ) P ( θ m | m ) d θ m . Θ m (also called the Bayesian evidence for model m ). The ratio of two marginal probabilities (or sometimes its log) is known as the Bayes factor: p ( m ′ ) P ( D| m ′ ) = P ( m |D ) P ( D| m ) P ( m ′ |D ) p ( m )
The Bayesian Occam’s razor Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. Bayesian inference formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their posterior probability given the data: � P ( m |D ) = P ( D| m ) P ( m ) , P ( D| m ) = P ( D| θ m , m ) P ( θ m | m ) d θ m P ( D ) Θ m P ( D| m ) : The probability that randomly selected parameter values from the model class would generate data set D . Model classes that are too simple are unlikely to generate the observed data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. Like Goldilocks, we favour a model that is just right. P ( D| m ) data sets: D D 0
Bayesian model comparison: Occam’s razor at work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 1 20 20 20 20 0.8 0 0 0 0 0.6 −20 −20 −20 −20 P(Y|M) 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 0.4 40 40 40 40 0.2 20 20 20 20 0 0 1 2 3 4 5 6 7 0 0 0 0 M −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10
Conjugate-exponential families (recap) Can we compute P ( D| m ) ?
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes.
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p )
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e and so: � �� �� P ( D| m ) = d θ m P ( D , θ m | m ) = Z i s ( x i ) + s p , N + n p Z ( s p , p )
Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e and so: � �� �� P ( D| m ) = d θ m P ( D , θ m | m ) = Z i s ( x i ) + s p , N + n p Z ( s p , p ) But this is a special case. In general, we need to approximate . . .
Recommend
More recommend