Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark ole.winther@gmail.com owi@imm.dtu.dk 1
Overview 1. How many species? predicting sequence tags. • Non-parametric Bayes • Averaging beats maximum likelihood • The model is always wrong (and Bayes can’t tell) 2. Computing the marginal likelihood with MCMC • Motivation: computing corrections to EP/C • The trouble with Gibbs sampling • Gaussian process classification 2
How many species? 3
DNA sequence tags - CAGE 4
Look at the data - cerebellum 1800 60 1600 50 1400 1200 40 m(n) 1000 m(n) 30 800 20 600 400 10 200 0 0 200 400 600 800 1000 1200 5 10 15 20 25 30 n n 5
Look at the data - embryo 60 2000 1800 50 1600 1400 40 1200 m(n) 30 m(n) 1000 800 20 600 400 10 200 0 0 5000 10000 15000 5 10 15 20 25 30 n n 6
Chinese restaurant process - Yor-Pitman sampling formula Observing new species given counts n = n 1 , . . . , n k in k bins: k p ( n 1 , . . . , n k , 1 | n , σ, θ ) = θ + kσ � with n i = n n + θ i =1 Re-observing j : p ( n 1 , . . . , n j − 1 , n j + 1 , n j +1 , . . . , n k | n , σ, θ ) = n j − σ n + θ Exchangeability – invariant to re-ordering 1 − σ 1 − σ p 1 = θ θ + σ θ + 2 σ E, E, M, T, T : θ 1 + θ 2 + θ 3 + θ 4 + θ 1 − σ 1 − σ p 2 = θ θ + σ θ + 2 σ M, T, E, T, E : 4 + θ = . . . = p 1 1 + θ 2 + θ 3 + θ θ 7
Inference and prediction Likelihood function, e.g. E, E, M, T, T 1 − σ 1 − σ θ θ + σ θ + 2 σ p ( n | σ, θ ) = 1 + θ 2 + θ 3 + θ 4 + θ θ n i ′ − 1 k − 1 k 1 ( j ′ − σ ) � � � = ( θ + jσ ) � n − 1 i =1 ( i + θ ) i ′ =1 j ′ =1 j =1 Flat prior distribution for σ ∈ [0 , 1] and θ pseudo-count parameter. Predictions for new count m : � p ( m | n ) = p ( m | n , σ, θ ) p ( σ, θ ) dσdθ with Gibbs sampling ( σ, θ ) and Yor-Pitman sampling for m . 8
Averaging versus max. likelihood cerebellum Log L − log L ML contours embryo Log L − log L ML contours − −10.81 2780 −9.9 1 −8.08 1 . 7 −6.26 2 2100 − − 5 7 . . 3 1 5 7 −4.09 − − 2760 1 − 1 2 − 4 2080 6 . 4 . 1 − 3 2 . 3 −8.99 − 4 3 . 5 8 2 1 . 8 3 − 0 3 2 . 2740 − 2 . 0 −2.61 2060 8 4 5 . − 1 −6.26 6 − 9 −1.7 −7.17 . 2 −5.35 1 . 4 0 2720 2040 5 −4.09 2700 −4.43 2020 −0.79 −3.52 θ θ 2680 2000 − 4 . 0 2660 1980 9 −2.61 −1.7 −2.05 −5.35 2640 1960 − − 1 − 8 −2.05 6 1 . −4.43 − − − 1 . 0 −3.52 3 6 1940 1 1 . 9 2620 2 . 8 −6.26 8 4 1 − 4 . . 4 3 1 4 3 3 2 −4.09 . 2 −5.35 8 1920 2600 7 8 1 0 7 . 8 . − − 2 6 6 . − 0.5 1 1.5 2 2.5 3 0.07 0.075 0.08 0.085 0.09 σ σ −3 x 10 9
Notice anything funny? cerebellum 4 x 10 15000 1.2 1.15 10000 9550 9500 9450 k 9400 9350 5000 1500 1480 1460 1440 1420 0 186257 2 4 6 n 5 x 10 10
Notice anything funny? Example 2 embryo 4 x 10 1.4 16000 1.35 14000 1.3 12000 4 x 10 1.08 10000 1.06 k 8000 1.04 1.02 6000 1850 4000 1800 2000 1750 0 344794 2 4 6 8 10 12 n 5 x 10 11
(Well-known) take home messages • Parameter averaging works! • The model is always wrong! (as revealed with sufficient data) • Only one model considered! • What happened to being Bayesian about model selection? 12
Calculating the marginal likelihood The marginal likelihood: � p ( D|H ) = p ( D| f , H ) p ( f |H ) d f Approximate inference needed! Monte Carlo: slow mixing, non-trivial to get marginal likelihood estimates Expectation propagation+, variational Bayes, loopy BP+: some- times not precise, approximation errors not controllable, not appli- cable to all models. 13
Motivation: validating EP corrections Kuss-Rasmussen (JMLR 2006) N = 767 3-vs-5 GP USPS digit classification with −|| x − x ′ || 2 � � K ( x , x ′ ) = σ 2 exp 2 ℓ 2 I: log R = log Z EPc − log Z EP and II+III: log Z MCMC − log Z EP . 0.0001 1e−07 0.6 1e−06 0.01 0.4 0.5 0.3 0.2 5 0.5 0.4 0.1 0 . 4 0.001 0.1 0.3 0 . 0.1 0.0001 5 5 2 0 0.7 . 6 2 1e−06 0.001 0.2 0.4 0.2 0.1 0 . 1e−07 0.6 0.5 0.01 0.3 0.2 0.0001 1e−06 0.001 0.01 1e−06 0.001 1 0.3 0.6 1 0.0001 0.3 0.5 0.01 e − 0 1e−07 0 1 0.4 7 0 0 0.6 0.5 3 0 . . 7 0.3 0.3 . 0 0 0.001 0.4 0.2 0.4 0 0.2 1 6 0.1 0 0.6 . 5 0.01 0.5 1e−06 4 − 0 1 0.6 0.5 4.5 0 0.3 . 0 7 − 0 0.4 0.01 0.0001 0.4 0.1 e 0 0 e 0 . 0.01 0.1 0.0001 4.5 4.5 1 0 1 e 0.0001 7 1e−06 0 1 1 0 0.5 − . − 0 6 0.001 0 . 0 0.1 0.4 e 0 0.3 0.001 0 1e−07 0.1 . 1 0.4 0.2 0.1 1e−06 − 3 0.2 1e−06 0.6 0 e 0.1 7 . 0 1 1e−07 1e−06 0.0001 − 0 0.2 1 0.1 e 1 0.001 1e−06 0.2 log magnitude, ln( σ ) log magnitude, ln( σ ) 0.001 1 0 0.5 0.7 0 . 3 0.001 0.01 4 0 . 1e−06 0.2 1e−07 4 4 0.3 0.3 1e−06 0.5 0.2 0.6 0.4 0.2 0.4 0.5 . 5 0.1 0.3 log magnitude, ln( σ ) 0 0.0001 2 0.6 0.1 0 . 0 1 0.0001 . 0.001 0 2 1e−07 1 e − 0 0 . 6 0 0 1 1e−07 0 . 7 0.1 0.2 0.1 3.5 0 0 . 2 0.001 1e−07 2 0.0001 3.5 3.5 0.1 0.01 0.0001 − 0.3 1e−06 0.0001 0.3 0 1 . 2 0 . 0 0.1 e 0 . 0.7 0.01 0.5 0.6 0 0.3 . 4 1 0.1 0.01 1 0.5 0 1 0.0001 0 . . 6 0.7 0.6 0.5 0.0001 1e−06 0.4 0 1e−07 1 3 0 0.001 3 3 0.4 0.3 . 1e−06 . 0 0 0 0 0.2 0 0.4 0.001 . . 1 e 0.2 1 0.0001 0.01 e − − 0 7 4 0.3 2 − 0 7 1 e 0 6 3 . 0.1 0.2 1 0 0.01 0.1 0.3 0.6 0.5 0.0001 0 . 0.3 0.6 0.7 0 0.001 1e−06 1e−07 0.01 0.2 . 2.5 2.5 2.5 0.0001 0.4 2 0.5 0.1 1e−07 1e−06 0.001 0.1 1e−07 0.3 0 0 0 1 0.4 0.01 0.6 0.0001 . 0 . 0 0.1 0.01 1 0.5 0.2 6 0.01 0 0.01 0.1 1e−07 0.001 − 0.2 5 1 0 . 3 . 0 0.001 e 1e−06 2 0 0.1 0.01 2 2 0.2 3 0 . 1 0.0001 1e−07 1e−06 1 0 . 0.3 0 1 0.01 0 . . 0 0.01 0.1 0 0.2 0.01 0.3 0 4 0.2 0.3 0.01 . . 0.001 1e−07 1e−06 1.5 1.5 1.5 0.6 5 0 0 . 0 0.1 0 0 1 0 1e−06 0.4 0.5 0.01 . 0.001 0 0.2 0.4 3 0.4 0 1e−07 0.2 1 1 . 0 . 1e−06 1 0.001 0 e e 0 2 0.0001 1e−07 1 0.1 . − − 0 0 0.1 0 0 0 1 0.1 0 0 6 7 0 0.2 1 . 1 1 1 0.0001 . 0 0.001 0.01 0.001 1e−07 1e−06 0 0.01 0 1e−07 1e−06 0.001 0 . 0 0 0 1 0.2 0 . 0.3 0.1 0.01 0.1 1 6 0.0001 1e−07 0.1 0 0 0.2 1 0.0001 0.1 0.2 0 − 0 . 0.2 0.01 0.1 0 . 3 0.01 0.1 . 0.2 . 2 0.2 . 1e−06 0.0001 e 0 3 0.5 0.5 0.5 0 1e−07 0 1e−06 1 0 0.01 . 0 1 1e−07 0 0.001 1 0.2 0.0001 0.1 1e−07 1 0.1 0.001 0.001 0.0001 7 0.0001 1e−06 0 0 1 1e−06 e − 0.1 . 0.1 1 0.001 0 0 0.01 0.01 1e−06 1e−07 0 0 0 0 0.0001 1e−06 1 1 0.0001 . 0.001 1e−07 0 0 0.01 . 0 0.001 0.01 0 0 0 . 0.01 1e−06 0.01 0.001 1 e − 0 7 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) Thanks to Malte Kuss for making III available. 14
Marginal likelihood from importance sampling Importance sampling � p ( D| f , H ) p ( f |H ) p ( D|H ) = q ( f ) d f q ( f ) Draw samples f 1 , . . . , f R from q ( f ) and set R p ( D| f r , H ) p ( f r |H ) p ( D|H ) ≈ 1 � R q ( f r ) r =1 This will usually not work because ratio varies too much. 15
Marginal likelihood from thermodynamic integration Variants: parallel tempering, simulated tempering and annealed im- portance sampling p ( D| f , H ) p ( f |H ) h ( f ) = 1 Z ( β ) h β ( f ) q 1 − β ( f ) p ( f | β ) = � β 2 d log Z ( β ) log Z ( β 2 ) − log Z ( β 1 ) = dβ dβ β 1 � β 2 log h ( f ) � = q ( f ) p ( f | β ) d f dβ β 1 Run N β chains and interpolate N β R log Z ( β 2 ) − log Z ( β 1 ) ≈ ∆ β log h ( f rb ) � � R q ( f rb ) r =1 b =1 Other things that might work even better: multi-canonical. 16
The trouble with Gibbs sampling Cycle over variables f i , i = 1 , . . . , N and sample conditionals p ( f i | f \ i ) = p ( f ) p ( f \ i ) ∝ p ( f ) 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1 0 1 2 17
A trivial cure for N ( f | 0 , C ) Gibbs sample z i , i = 1 , . . . , N with N ( z | 0 , I ) C = LL T f = Lz with 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 −2.5 −2.5 −2 −1 0 1 2 −2 −1 0 1 2 18
Gaussian process classification (GPC) 1 − β � � 2 f T K − 1 f � p ( f | y , K , β ) = φ ( y n f n ) exp Z ( β ) n Noise-free formulation f nf � θ ( yf nf ) N ( f nf | f , I ) φ ( yf ) = Joint distribution p ( f , f nf , y | K , β ) = p ( y | f nf ) p ( f nf | f ) p ( f | K , β ) Marginalize out f � p ( f nf | y , K , β ) ∝ θ ( y n f n ) N ( f nf | 0 , I + K /β ) n Samples of f can be recovered from p ( f | f nf ) (Gaussian) Efficient sampler of truncated Gaussian needed! 19
Recommend
More recommend