minimum description length
play

Minimum Description Length Bono Nonchev Principle in Model - PowerPoint PPT Presentation

Minimum Description Length Principle in Model Selection Minimum Description Length Bono Nonchev Principle in Model Selection Information Theory The MDL Principle Bono Nonchev Model Selection Faculty of Mathematics and Informatics,


  1. Minimum Description Length Principle in Model Selection Minimum Description Length Bono Nonchev Principle in Model Selection Information Theory The MDL Principle Bono Nonchev Model Selection Faculty of Mathematics and Informatics, Sofia University Model Complexity Q/A ISCPS, SDA, WPA 2012, Pomorie

  2. Contents Minimum Description Length Principle in Model Selection 1 Information Theory Bono Nonchev Information 2 The MDL Principle Theory The MDL Principle 3 Model Selection Model Selection Model Complexity 4 Model Complexity Q/A

  3. Knowledge = compression Minimum Description Length Regularities in data lead to compression Principle in Model Selection Example: Bono 0101010101010101010101010101010101010... Nonchev 1101100111111101111110110011111111111... Information 1010101000111010001110100011101011111... Theory Denote an n-tuple of real numbers x n - data, generated by The MDL Principle some process. Model Selection Make inference about the generating process by finding a Model way to encode the data using the patterns it exhibits. Complexity Q/A Kolmogorov complexity and problems uncomputability arbitrariness

  4. Equivalence between code and distribution Minimum Description Length Let x n be a realization of the random vector X n in Principle in Model (Ω , F , P ). Selection Bono x n is encoded with a string of 0 and 1 with length L ( x n ) Nonchev having unique decodability. Information Theory Shortest code length (in expected sence) is achieved using a code that for a given observation x n has length The MDL Principle (Shannon-Fano coding) Model Selection L ( x n ) = − log P ( x n ) Model Complexity Q/A A probability distributions defines a code and vice versa. The requirement for integer code lengths is not essential.

  5. The MDL principle I Minimum Description Length Principle in Restrict the class of models and codes to probability Model Selection distributions. Bono Nonchev Define a set of candidate models H , e.g. N ( µ, σ ). Encode data “optimally” using code with length L ( x n | H ), Information Theory for each point hypothesis H ∈ H , e.g. The MDL H = { X n ∈ N (1 . 42 , 0 . 443) } . Principle Model Encode point hypothesis “optimally” with code length Selection L ( H ). Model Complexity The optimal point hypothesis H ∈ H is that for which Q/A L ( x n | H ) + L ( H ) is minimal. Not clear how to find the code lengths for H and x n | H .

  6. The MDL principle II L is called a universal code with respect to a family of Minimum Description codes L , if Length Principle in Model 1 � � Selection L ∗ ∈L L ∗ ( x n ) L ( x n ) − max − n →∞ 0 − − → Bono n Nonchev Examples: Information Theory Two step coding: code H ∈ H “uniformly”, then code x n The MDL using the coding corresponding to P ( x n | H ). Principle Bayesian approach (Minimum Message Length): define Model Selection prior probability P H ( H ), then code using Model Complexity L ( x n ) = − log P ( x n | H ) − log P H ( H ) Q/A Instead of a set of codes L examine a set of distributions M . M ∈ M that corresponds to a universal code is called a universal model .

  7. Measures of Goodness For a model M and a distribution ˜ Minimum P , the Regret is a Description Length measure of distance - how much we loose if we try to Principle in encode the data using ˜ Model P instead of the best distribution in Selection M : Bono Nonchev R M ( ˜ P , x n ) = − log ˜ P ( x n ) − min P ∈M {− log P ( x n ) } Information Theory To remove the dependence on x n we take the maximum The MDL Principle � � M ( ˜ R ( ˜ R max P , x n ) P ) = max Model Selection x n ∈X n Model For a parametric family M θ we use the maximum Complexity likelihood estimate ˆ θ ( x n ) and define model complexity as Q/A � P ( x n | ˆ θ ( x n )) COMP n ( M θ ) = x n ∈X n Also called Stochastic Complexity .

  8. Normalized Maximum Likelihood (NML) Distribution Minimum Find distribution for x n that is universal w.r.t. M θ . Description Length Idea: use P ( x n | ˆ θ ( x n )). Principle in Model Selection Not possible, so do the next best thing: normalize the Bono above probability: Nonchev Information P ( x n | ˆ θ ( x n )) Theory ˜ P NML ( x n ) = y n ∈ Y n P ( y n | ˆ The MDL � θ ( y n )) Principle Model The NML distribution achieves constant regret Selection Model Complexity R M θ ( ˜ P , x n ) = log COMP n ( M θ ) Q/A NML is a universal model with code length L ( x n ) = − log P ( x n | ˆ θ ( x n )) + log COMP n ( M θ )

  9. Properties of Model Complexity Minimum Description Length COMP n ( M θ ) is a measure of the complexity or flexibility Principle in Model of the family of distributions M θ . Selection In case M θ is discrete, COMP n ( M θ ) can be interpreted Bono Nonchev as the number of “essentially different models” in the Information family. Theory COMP n ( M θ ) is invariant under change of The MDL Principle parametrization. Model Selection When some regularity conditions are imposed Model Complexity k 2 log n � � log COMP n ( M θ ) − − − → 2 π +log | I ( θ ) | d θ + o (1) Q/A n →∞ θ ∈ Θ It is possible that COMP n ( M θ ) = ∞ .

  10. Model Selection Problem Minimum Description Length Principle in A sample X 1 , . . . , X n of a random variable. Model Selection Decide which of a myriad of distributions does this sample Bono Nonchev originate. Example: Information Theory y = ax b + Z - Stevens’ model The MDL y = a ln ( x + b ) + Z - Fechner’s model Principle Model More data patterns can be explained by Stevens’ model Selection than by Fechner’s (see [Grundwald, Rissanen, 2007]). Model Complexity Our goal is to select between: Q/A (N) - the sample is from a distribution in N ( µ, σ 2 I ). (T) - the sample is from a distribution in T ν ( µ, σ 2 I ).

  11. Model Selection Solution Minimum Description Length Principle in Model Ordinary information criteria (AIC, BIC, GIC, DIC) do not Selection account for complexity beyond the number of free Bono Nonchev parameters. Using the MDL principle: Information Theory Calculate COMP n ( M θ ) for both models. The MDL Calculate MLE for µ and σ and the log-likelihood of the Principle data. Model Selection Select the model with smallest total description length Model Complexity L ( x n ) = − log f ( x n | ˆ µ, ˆ σ ) + log COMP n ( M θ ) Q/A .

  12. Solution for Gaussian Model with Known Variance Minimum [Barron et al, 1998] show that with (jointly) sufficient Description Length statistics T present for θ Principle in Model Selection � � � � dx n = x n | ˆ P ( t | ˆ θ ( x n ) COMP n ( M θ ) = X n P θ ( t )) dt Bono Nonchev T Information Normal model with known variance COMP n ( M θ ) = ∞ . Theory The MDL They propose that the conditional complexity be used by � x i Principle � ≤ R � 1 �� � � limiting = A : Model n Selection � COMP n ( M θ | x n ∈ A ) = � � Model x n | ˆ θ ( x n ) dx n P Complexity A Q/A In that case COMP n ( M θ | x n ∈ A ) = − 1 2 log π − log σ + 1 2 log n +log 2 R

  13. Solution for Gaussian Model with Unknown Variance Minimum Description Length Principle in Model Use the sufficient statistics x and s 2 . Selection Bono Compute the conditonal complexity for Nonchev � | x | ≤ R , D ≤ s 2 � A = : Information Theory � n � n 2 e − n COMP n ( M θ | x n ∈ A ) = 2 The MDL 2 � × 2 RD − 1 2 √ π Γ Principle � n − 1 Model 2 Selection The unconditional complexity is again infinite Model Complexity An idea emerges - try to extract the last term and ignore Q/A it when comparing models.

  14. Complexity of Absolutely Continuous Location-Scale Family I Minimum Description Length Define the p.d.f. of a multivariate location-scale family Principle in Model Selection � x n − µ � Bono f ( x n | µ, σ ) = σ − n g Nonchev σ Information The conditional complexity, conditional on x n ∈ A for Theory The MDL Principle σ 2 ( x n ) � � A = | ˆ µ ( x ) | ≤ R , D ≤ ˆ Model Selection Model is the following integral Complexity Q/A � x n − ˆ µ ( x n ) � � COMP n ( M θ | x n ∈ A ) = σ ( x n )) − n g dx n (ˆ σ ( x n ) ˆ A

  15. Complexity of Absolutely Continuous Location-Scale Family II Minimum Theorem Description Length If δ is the Dirac delta function, then Principle in Model Selection COMP n ( M θ | x n ∈ A ) Bono Nonchev � = 2 RD − 1 × µ ( y n ))) δ (1 − ˆ σ ( y n ))] g ( y n ) dy n [ δ (ˆ Information Theory The MDL = 2 RD − 1 × E Y n [ δ (ˆ µ ( Y n )) δ (1 − ˆ σ ( Y n ))] Principle Model Selection Corollary Model Complexity The unconditional parametric complexity of an absolutely Q/A continuous location-scale family is either zero or infinity. Note: Dependence structure in a sample is treated in the integral.

  16. Future work Minimum Description Length Principle in Model Calculate the parametric complexity of multivariate Selection Student-T using the theorem and that x and s 2 are the Bono Nonchev MLE estimators for the parameters. Information Calculate the stochastic complexity of linear regression Theory with Student-T distributed residuals. The MDL Principle Extend the theorem and corollary to be used when the Model sample statistics are not MLE estimators: Selection Model An i.i.d. sample with Student T marginals. Complexity Stable and CTS Q/A Models for time dependence

Recommend


More recommend