Selecting explanatory variables with the modified version of Bayesian Information Criterion Małgorzata Bogdan Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh, R.W.Doerge, R. Cheng – Purdue University A. Baierl, F. Frommlet, A. Futschik – Vienna University A. Chakrabarti - Indian Statistical Institute P. Biecek, A. Ochman, M. Żak – Wrocław University of Technology Vienna, 24/07/2008 Małgorzata Bogdan Modified BIC
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Małgorzata Bogdan Modified BIC
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Małgorzata Bogdan Modified BIC
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Properties of the data base – number of potential factors, m, may be much larger than the number of cases, n Małgorzata Bogdan Modified BIC
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim – identify factors influencing Y Properties of the data base – number of potential factors, m, may be much larger than the number of cases, n Assumption of Sparsity - only a small proportion of potential explanatory variables influences Y Małgorzata Bogdan Modified BIC
Specific application - Locating Quantitative Trait Loci Małgorzata Bogdan Modified BIC
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus Małgorzata Bogdan Modified BIC
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j Małgorzata Bogdan Modified BIC
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Małgorzata Bogdan Modified BIC
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Multiple regression model: m � (0.1) Y i = β 0 + β j X ij + ǫ i , j = 1 where i ∈ { 1 , . . . , n } and ǫ i ∼ N ( 0 , σ 2 ) Małgorzata Bogdan Modified BIC
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij ∈ {− 1 / 2 , 1 / 2 } Multiple regression model: m � (0.1) Y i = β 0 + β j X ij + ǫ i , j = 1 where i ∈ { 1 , . . . , n } and ǫ i ∼ N ( 0 , σ 2 ) Problem : estimation of the number of influential genes Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n ≥ 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Małgorzata Bogdan Modified BIC
Bayesian Information Criterion (1) M i - i -th linear model with k i < n regressors θ i = ( β 0 , β 1 , . . . , β k i , σ ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) – maximize BIC = log L ( Y | M i , ˆ θ i ) − 1 2 k i log n If m is fixed, n → ∞ and X ′ X / n → Q , where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n ≥ 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Surprise ? : - Broman and Speed (JRSS, 2002) report that BIC overestimates the number of regressors when applied to QTL mapping. Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i posterior probability of M i : P ( M i | Y ) ∝ m i ( Y ) π ( M i ) Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (1) f ( θ i ) – prior density of θ i , π ( M i ) – prior probability of M i L ( Y | M i , θ i ) f ( θ i ) d θ i – integrated likelihood of the data � m i ( Y ) = given the model M i posterior probability of M i : P ( M i | Y ) ∝ m i ( Y ) π ( M i ) BIC neglects π ( M i ) and uses approximation log m i ( Y ) ≈ log L ( Y | M i , ˆ θ i ) − 1 / 2 ( k i + 2 ) log n + R i , R i is bounded in n . Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability ≡ assigning a large prior probability to the event that the true model contains approximately m 2 regressors Małgorzata Bogdan Modified BIC
Explanation - Bayesian roots of BIC (2) neglecting π ( M i ) ≡ assuming all the models have the same prior probability ≡ assigning a large prior probability to the event that the true model contains approximately m 2 regressors � � 200 m=200, 200 models with one regressor, = 19900 models 2 � 200 � = 9 × 10 58 models with 100 regressors with two regressors, 100 Małgorzata Bogdan Modified BIC
Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) Małgorzata Bogdan Modified BIC
Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i Małgorzata Bogdan Modified BIC
Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i � 1 − p � log π ( M i ) = m log ( 1 − p ) − k i log p Małgorzata Bogdan Modified BIC
Modified version of BIC, mBIC (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π ( M i ) = p k i ( 1 − p ) m − k i � 1 − p � log π ( M i ) = m log ( 1 − p ) − k i log p Modified version of BIC recommends choosing the model maximizing θ i ) − 1 � 1 − p � log L ( Y | M i , ˆ 2 k i log n − k i log p Małgorzata Bogdan Modified BIC
mBIC (2) c = mp - expected number of true regressors Małgorzata Bogdan Modified BIC
Recommend
More recommend