Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences University of Toronto ISBA 2016 June 16, 2016 Meta-Bayesian Analysis (Yang) 1
Motivation “ All models are wrong, “truth [...] is much too complicated to some are useful.” allow anything but approximations.” — George Box – John von Neumann ◮ Subjectivism Bayesian: alluring but impossible to practice when model is wrong ◮ Prior probability = degree of Belief... in what? What is a prior? ◮ Is there any role for (subjective) Bayesianism? Our proposal: More inclusive and pragmatic definition for “prior”. Our approach: Bayesian decision theory Meta-Bayesian Analysis (Yang) 2
Example: Grossly Misspecified Model Setting: Machine learning data are collection of documents: ◮ Model: Latent Dirichlet Allocation (LDA) aka “topic modeling” ◮ Prior belief: ˜ π ≡ 0, i.e., no setting of LDA is faithful to our true beliefs about data. ◮ Conjugate priors π ( d θ ) ∼ Dirichlet ( α ) What is the meaning of a prior on LDA parameters? Pragmatic question: If we use an LDA model (for whatever reason), how should we choose our “prior”? Meta-Bayesian Analysis (Yang) 3
Example: Accurate but still Misspecified Model Setting: Careful Science data are experimental measurements: ◮ Model: ( Q θ ) θ ∈ Θ , painstakingly produced after years of effort ◮ Prior belief: ˜ π ≡ 0, i.e., no Q θ is 100% faithful to our true beliefs about data. What is the meaning of a prior in a misspecified model? (All models are misspecified.) Pragmatic question: How should we choose a “prior”? Meta-Bayesian Analysis (Yang) 4
Standard Bayesian Analysis for Prediction Q θ ( · ) Model on X × Y given parameter θ X : what you will observe Y : what you will then predict π ( · ) prior on θ � ( π Q )( · ) = Q θ ( · ) π ( d θ ) Marginal distribution on X × Y Bayes optimal action minimizes expected Believe ( X , Y ) ∼ π Q loss under the conditional distribution of Y given X = x , written π Q ( dy | x ): The Task 1. Observe X . BayesOptAction ( π Q , x ) 2. Take action ˆ Y . � = arg min L ( a , y ) π Q ( dy | x ) . 3. Suffer loss L ( ˆ Y , Y ) a ◮ Quadratic loss − → posterior mean. The Goal ◮ Self-information loss (log loss) Minimize expected loss − → posterior π Q ( ·| x ). Meta-Bayesian Analysis (Yang) 5
Meta-Bayesian Analysis ◮ ( Q θ ) θ ∈ Θ : the model, i.e., a family of distributions on X × Y . ◮ Don’t believe Q θ , i.e., model is misspecified ◮ P : represents our true belief on X × Y . Believe ( X , Y ) ∼ P But We Will Use Q θ The Task 1. Choose (surrogate) prior π 2. Observe X . 3. Take action ˆ Y = BayesOptAction ( π Q , x ) 4. Suffer loss L ( ˆ Y , Y ) The Goal Minimize expected loss with respect to P not π Q . Meta-Bayesian Analysis (Yang) 6
Meta-Bayesian Analysis Key ideas: ◮ Believe ( X , Y ) ∼ P ◮ But predict using π Q ( ·| X = x ) for some prior π ◮ Prior π is an choice/decision/action. ◮ Loss associated with π and ( x , y ) is L ∗ ( π, ( x , y )) = L ( BayesOptAction ( π Q , x ) , y ) Meta-Bayesian risk ◮ Bayes risk under P of doing Bayesian analysis under π Q � L ∗ ( π, ( x , y )) P ( dx × dy ) . R ( P , π ) = ◮ Meta-Bayesian optimal prior minimizes meta-Bayesian risk: π ∈F R ( P , π ) , inf where F is some set of priors under consideration. Meta-Bayesian Analysis (Yang) 7
Meta-Bayesian Analysis Recipe ◮ Step 1: State P , Q θ , and select a loss function L ; ◮ Step 2: Choose prior π that minimizes meta-Bayesian risk. Examples ◮ Log loss: minimizing the conditional relative entropy � � � P 2 ( x , · ) || π Q ( ·| x ) P 1 ( dx ) inf KL π where P ( dx , dy ) = P 1 ( dx ) P 2 ( x , dy ). ◮ Quadratic loss: minimizing the expected quadratic distance between two posterior means π Q ( ·| x ) and P 2 ( x , · ): � � m π Q ( x ) − m P 2 ( x ) � 2 2 P 1 ( dx ) inf π Meta-Bayesian Analysis (Yang) 8
Meta-Bayesian Analysis High-level Goals ◮ Meta-Bayesian analysis for Q θ under P is generally no easier than doing Bayesian analysis under P directly. ◮ But P serves only as a placeholder for an impossible-to-express true belief. ◮ Our theoretical approach is to attempt to prove general theorems true of broad classes of “true beliefs” P . ◮ The hope is that this will tell us something deep about subjective Bayesianism. Remaining results are some key findings. Meta-Bayesian Analysis (Yang) 9
Meta-Bayesian 101: optimal prior depends on loss data are coin tosses: 10001001100001000100100 ◮ Model: i.i.d. Bernoulli( θ ) sequence, unknown θ ◮ True prior belief ˜ π ( d θ ) Problem Setting ◮ X = { 0 , 1 } n , Y = { 0 , 1 } k . ◮ P : [ Bernoulli ( θ )] n + k , θ ∼ ˜ π ( d θ ) ◮ Q θ : [ Bernoulli ( θ )] n + k , θ ∼ π ( d θ ) Results from Meta-Bayesian Analysis ◮ Log loss: π should match the first n + k moments of ˜ π ; The optimal prior usually depends on n and k ! Meta-Bayesian Analysis (Yang) 10
Meta-Bayesian Analysis for i.i.d. Bernoulli Model Example ◮ true belief P : two state { 0 , 1 } discrete Markov chain with � 1 − p � p transition matrix . q 1 − q ◮ model Q k θ = Bernoulli ( θ ) k . ◮ true prior belief ˜ ν ( d p , d q ) = ˜ π ( d θ ) ˜ κ ( d ψ | θ ) , where p θ = p + q is the limiting relative frequency of 1’s (LRF). Meta-Bayesian Analysis (Yang) 11
What does a prior on an i.i.d. Bernoulli model mean? Conjecture Optimal prior for the model Q k θ is our true belief ˜ π ( d θ ) on the LRF. Theorem (Y.–Roy) False. Example for n = 1 and k = 1 Beta(0.01,0.01) 0.5 ◮ Sticky Markov Chain: 0.45 0.4 0000001111111100000011111111 0.35 0.3 ◮ i.i.d. Model: f π ( θ ) 0.25 0.2 0010011101001011001001001001 0.15 0.1 ◮ Beta (0 . 01 , 0 . 01) is better. 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ Meta-Bayesian Analysis (Yang) 12
What does a prior on an i.i.d. Bernoulli model mean? Theorem (Y.–Roy) 1. Let Q k θ be the i.i.d. Bernoulli model. 2. Let P be true belief and assume P believes in LRF. 3. Let ˜ π ( d θ ) be the true belief about the LRF and assume ˜ π is absolutely continuous. 4. Let π ∗ k = arg min π R ( P , π ) be an optimal surrogate prior. Then � � � KL ( P ( k ) || π ∗ − KL ( P ( k ) || ˜ � k Q k ) π Q k ) � � → 0 as k → ∞ . � �� � � �� � R ( P ,π ∗ k ) R ( P , ˜ π ) True belief about limiting relative frequency is an asymptotically optimal (surrogate) prior. Meta-Bayesian Analysis (Yang) 13
General Results when P is a mixture of i.i.d. Theorem (Berk 1966). Posterior distribution of θ concentrates asymptotically on point minimizing the KL divergence. p fi PTE tidy ifao ~ % , / l : 9 KLIPYHQOI ' u arg y Problem Setting � ˜ ν ( d ψ ), where ˜ ◮ P = P ψ ˜ P ψ is i.i.d. ◮ Let Ψ θ be the set of ψ such that Q θ closest to ˜ P ψ . � � Ψ θ ˜ ◮ Define P θ = P ψ ˜ ν ( d ψ | θ ) and ˜ π ( d θ ) = Ψ θ ˜ ν ( d ψ ). Meta-Bayesian Analysis (Yang) 14
General Results when P is a mixture of i.i.d. Theorem (Y.–Roy) π P ( k ) = � P ( k ) π ( d θ ) , where P ( k ) 1. ˜ ˜ is i.i.d. θ θ 2. For every θ ∈ Θ , the point θ is the unique point in Θ achieving the infimum inf θ ′ ∈ Θ KL ( Q ( k ) θ ′ || P ( k ) ) for k = 1 . θ Then � � � KL ( P ( k ) || π ∗ k Q k ) − KL ( P ( k ) || ˜ π Q k ) � � � → 0 as k → ∞ . � �� � � �� � R ( P ,π ∗ k ) R ( P , ˜ π ) True belief about asymptotic “location” of posterior distribution is an asymptotically optimal (surrogate) prior. Meta-Bayesian Analysis (Yang) 15
Conclusion and Future work Conclusion ◮ Standard definition of a (subjective) prior too restrictive ◮ More useful definition using Bayesian decision theory. ◮ Meta-Bayesian prior is one you believe will lead to best results. Future Work ◮ Beyond choosing priors: General Meta-Bayesian analysis (optimal prediction algorithms) ◮ Analysis of the rationality of non-subjective procedures (e.g, switching, empirical Bayes) Meta-Bayesian Analysis (Yang) 16
Recommend
More recommend