the catch up phenomenon in bayesian and mdl model
play

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - PowerPoint PPT Presentation

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grnwald , Steven de Rooij and Wouter Koolen Outline Bayes Factors and MDL Model Selection Consistent, but


  1. The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grünwald , Steven de Rooij and Wouter Koolen

  2. Outline ✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

  3. T wo Desirable Properties in Model Selection ✤ Suppose are statistical models M 1 , . . . , M K (sets of probability distributions: ) M k = { p θ | θ ∈ Θ k } ✤ Consistency : If some in model generates the data, then is p ∗ M k ∗ M k ∗ selected with probability one as the amount of data goes to infinity. ✤ Rate of convergence : How fast does an estimator based on the available models converge to the true distribution? AIC-BIC Dilemma Consistent Optimal rate of convergence BIC, Bayes, MDL Yes No AIC, LOO Cross-validation No Yes

  4. T wo Desirable Properties in Model Selection ✤ Suppose are statistical models M 1 , . . . , M K (sets of probability distributions: ) M k = { p θ | θ ∈ Θ k } ✤ Consistency : If some in model generates the data, then is p ∗ M k ∗ M k ∗ selected with probability one as the amount of data goes to infinity. ✤ Rate of convergence : How fast does an estimator based on the available models converge to the true distribution? AIC-BIC Dilemma Consistent Optimal rate of convergence ? BIC, Bayes, MDL Yes No AIC, LOO Cross-validation No Yes

  5. Bayesian Prediction ✤ Given model with prior and data M k = { p θ | θ ∈ Θ k } w k x n = ( x 1 , . . . , x n ) , the Bayesian marginal likelihood is Z p k ( x n ) ≡ p ( x n |M k ) := p θ ( x n ) w k ( θ ) d θ ¯ Θ k ✤ Given predict with estimator M k p k ( x n +1 ) p k ( x n +1 | x n ) = ¯ Z p θ ( x n +1 | x n ) w k ( θ | x n ) d θ ¯ = p k ( x n ) ¯ Θ k

  6. Bayes Factors and MDL Model Selection ✤ Suppose we have multiple models M 1 , M 2 , . . . ✤ Bayes factors : Put a prior on model index k and choose to ˆ k ( x n ) π maximize the posterior probability p k ( x n ) π ( k ) ¯ p ( M k | x n ) := P k 0 ¯ p k 0 ( x n ) π ( k 0 ) ˆ ✤ is minimizing k ( x n ) p k ( x n ) − log π ( k ) ≈ − log ¯ p k ( x n ) − log ¯

  7. Bayes Factors and MDL Model Selection ✤ Suppose we have multiple models M 1 , M 2 , . . . ✤ Bayes factors : Put a prior on model index k and choose to ˆ k ( x n ) π maximize the posterior probability p k ( x n ) π ( k ) ¯ p ( M k | x n ) := P k 0 ¯ p k 0 ( x n ) π ( k 0 ) ˆ ✤ is minimizing k ( x n ) p k ( x n ) − log π ( k ) ≈ − log ¯ p k ( x n ) − log ¯ } Minimum Description Length (MDL)

  8. Example: Histogram Density Estimation 0.6 0.5 M k = { p θ | θ ∈ Θ k ⊂ R k } 0.4 0.3 = 4+1 θ 1 n +4 0.2 = 2+1 θ 3 n +4 = 1+1 θ 4 0.1 n +4 θ 2 = 0+1 n +4 0 0 0.25 0.5 0.75 1 ✤ I.I.D. data in interval [0,1] ✤ Given k, estimate density by the estimator in the figure ✤ This is equivalent to for conjugate Dirichlet(1,...,1) prior ¯ p k ✤ How should we choose the number of bins k ? ✤ Too few: does not capture enough structure ✤ Too many: overfitting (many bins will be empty) ✤ [Yu, Speed, ‘92]: Bayes does not achieve the optimal rate of convergence!

  9. CV Selects More Bins than Bayes average # bins selected 3 100 2.5 80 2 60 1.5 40 1 20 Bayes 0.5 LOOCV f(x) 0 0 0.2 0.4 0.6 0.8 1 0 50000 100000 150000 200000 n

  10. CV Predicts Better than Bayes k ( x n ) ( x n +1 | x n ) Prediction error in log loss at sample size n: − log ¯ p ˆ n X k ( x i − 1 ) ( x i | x i − 1 ) cumulative prediction error: − log ¯ p ˆ i =1 cumulative loss 350 300 250 200 150 100 Bayes 50 LOOCV 0 0 50000 100000 150000 200000 n

  11. CV Predicts Better than Bayes... # bins selected 3 100 2.5 80 2 60 1.5 40 1 20 Bayes 0.5 LOOCV f(x) 0 0 0.2 0.4 0.6 0.8 1 0 50000 100000 150000 200000 n ✤ Density not a histogram, but can be approximated arbitrarily well ✤ LOO-CV, AIC converge at optimal rate ✤ Bayesian model selection selects too few bins ( underfits )!

  12. ... but CV is Inconsistent! ✤ Now suppose data are sampled from the uniform distribution # bins selected 8 Bayes 1 7 LOOCV 6 0.8 5 0.6 4 3 0.4 2 0.2 1 f(x) 0 0 0 50000 100000 150000 200000 0 0.2 0.4 0.6 0.8 1 n ✤ LOO cross-validation selects 2.5 bins on average: it is inconsistent !

  13. Outline ✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

  14. Logarithmic Loss If we measure prediction quality by log loss loss( p, x ) := − log p ( x ) then minus log likelihood = cumulative log loss : n X − log p ( x i | x i − 1 ) − log p ( x 1 , . . . , x n ) = i =1 x i − 1 = ( x 1 , . . . , x i − 1 ) where n Y p ( x i | x i − 1 ) p ( x 1 , . . . , x n ) = Proof. Take the negative logarithm of the chain rule: i =1

  15. The Most Important Slide Bayes factors and MDL pick the k minimizing n X p k ( x i | x i − 1 ) − log ¯ p k ( x 1 , . . . , x n ) = − log ¯ } i =1 Prediction error for model at sample size i! M k Prequential/predictive MDL interpretation: select the model such that, when used as a sequential prediction M k strategy, minimizes cumulative sequential prediction error ¯ p k [Dawid ’84, Rissanen ’84]

  16. Example: Markov Chains Natural language text : “The Picture of Dorian Gray” by Oscar Wilde "... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..." Compare the first-order and second-order Markov chain models on the first n characters in the book, with uniform priors on the transition probabilities

  17. Example: Markov Chains Natural language text : “The Picture of Dorian Gray” by Oscar Wilde "... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..." 128x128x127 parameters 128x127 parameters Compare the first-order and second-order Markov chain models on the first n characters in the book, with uniform priors on the transition probabilities

  18. Example: Markov chains Compare the marginal likelihoods Sample size (n) (green line equals the log of the Bayes factor)

  19. Example: Markov chains Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

  20. Example: Markov chains For , select complex model Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

  21. Example: Markov chains For , complex model makes the best predictions! For , select complex model Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

  22. The Catch-up Phenomenon ✤ Given “simple” model and a “complex” model M 1 M 2 ✤ Common phenomenon: for some sample size s ✤ simple model predicts better if n ≤ s ✤ complex model predicts better if n > s ✤ Catch-up Phenomenon : Bayes/MDL exhibit inertia ✤ complex model has to “ catch up ”, so we prefer simpler model for a while even after n > s! Remark : Methods similar to Bayes factors (e.g. BIC) will also exhibit the catch-up ✤ phenomenon. Bayesian model averaging does not help either!

  23. Example: Markov chains Bayes / MDL Can we modify Bayes so as to do as well as the black curve ?

  24. Example: Markov chains Bayes / MDL Can we modify Bayes so as to do as well as the black curve ? Almost!

  25. Outline ✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

  26. The Best of Both Worlds ✤ Catch-up phenomenon : new explanation for poor predictions of Bayes (and other BIC-like methods) ✤ We want a model selection/averaging method that, in a wide variety of circumstances, ✤ is provably consistent , ✤ provably achieves optimal convergence rates ✤ But it has previously been suggested that this is impossible! [Yang ’05]

Recommend


More recommend