selecting explanatory variables with the modified version
play

Selecting explanatory variables with the modified version of - PowerPoint PPT Presentation

Selecting explanatory variables with the modified version of Bayesian Information Criterion Magorzata Bogdan Institute of Mathematics and Computer Science, Wrocaw University of Technology, Poland in cooperation with J.K.Ghosh, R.W.Doerge,


  1. Selecting explanatory variables with the modified version of Bayesian Information Criterion Małgorzata Bogdan Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh, R.W.Doerge, R. Cheng – Purdue University A. Baierl, F. Frommlet, A. Futschik – Vienna University A. Chakrabarti - Indian Statistical Institute P. Biecek, A. Ochman, M. Żak – Wrocław University of Technology Vienna, 24/07/2008 Małgorzata Bogdan Modified BIC

  2. Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Małgorzata Bogdan Modified BIC

  3. Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Małgorzata Bogdan Modified BIC

  4. Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Properties of the data base – number of potential factors, m, may be much larger than the number of cases, n Małgorzata Bogdan Modified BIC

  5. Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Properties of the data base – number of potential factors, m, may be much larger than the number of cases, n Assumption of Sparsity - only a small proportion of potential explanatory variables influences Y Małgorzata Bogdan Modified BIC

  6. Specific application - Locating Quantitative Trait Loci Małgorzata Bogdan Modified BIC

  7. Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus Małgorzata Bogdan Modified BIC

  8. Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j Małgorzata Bogdan Modified BIC

  9. Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Małgorzata Bogdan Modified BIC

  10. Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Multiple regression model: m � (0.1) Y i = β 0 + β j X ij + ǫ i , j = 1 where i ∈ { 1 , . . . , n } and ǫ i ∼ N ( 0 , σ 2 ) Małgorzata Bogdan Modified BIC

  11. Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Multiple regression model: m � (0.1) Y i = β 0 + β j X ij + ǫ i , j = 1 where i ∈ { 1 , . . . , n } and ǫ i ∼ N ( 0 , σ 2 ) Problem : estimation of the number of influential genes Małgorzata Bogdan Modified BIC

  12. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors Małgorzata Bogdan Modified BIC

  13. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Małgorzata Bogdan Modified BIC

  14. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n Małgorzata Bogdan Modified BIC

  15. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. Małgorzata Bogdan Modified BIC

  16. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n ≥ 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Małgorzata Bogdan Modified BIC

  17. Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n ≥ 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Surprise ? : - Broman and Speed (JRSS, 2002) report that BIC overestimates the number of regressors when applied to QTL mapping. Małgorzata Bogdan Modified BIC

  18. Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i Małgorzata Bogdan Modified BIC

  19. Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i Małgorzata Bogdan Modified BIC

  20. Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i posterior probability of M i : P ( M i | Y ) ∝ m i ( Y ) π ( M i ) Małgorzata Bogdan Modified BIC

  21. Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i posterior probability of M i : P ( M i | Y ) ∝ m i ( Y ) π ( M i ) BIC neglects π ( M i ) and uses approximation log m i ( Y ) ≈ log L ( Y | M i , ˆ θ i ) − 1 / 2 ( k i + 2 ) log n + R i , R i is bounded in n . Małgorzata Bogdan Modified BIC

  22. Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability Małgorzata Bogdan Modified BIC

  23. Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability ≡ assigning a large prior probability to the event that the true model contains approximately m 2 regressors Małgorzata Bogdan Modified BIC

  24. Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability ≡ assigning a large prior probability to the event that the true model contains approximately m 2 regressors � � 200 m=200, 200 models with one regressor, = 19900 models 2 � 200 � = 9 × 10 58 models with 100 regressors with two regressors, 100 Małgorzata Bogdan Modified BIC

  25. Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) Małgorzata Bogdan Modified BIC

  26. Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i Małgorzata Bogdan Modified BIC

  27. Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i � 1 − p � log π ( M i ) = m log ( 1 − p ) − k i log p Małgorzata Bogdan Modified BIC

  28. Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i � 1 − p � log π ( M i ) = m log ( 1 − p ) − k i log p Modified version of BIC recommends choosing the model maximizing θ i ) − 1 � 1 − p � log L ( Y | M i , ˆ 2 k i log n − k i log p Małgorzata Bogdan Modified BIC

  29. mBIC (2) c = mp - expected number of true regressors Małgorzata Bogdan Modified BIC

Recommend


More recommend