mixture models of truncated data for estimating the
play

Mixture models of truncated data for estimating the number of - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S ebastien , Jean-Jacques Daudin, St ephane Robin Equipe Statistique et G enome, UMR


  1. Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T´ e S´ ebastien , Jean-Jacques Daudin, St´ ephane Robin Equipe Statistique et G´ enome, UMR 518 AgroParisTech / INRA MIA 19th COMPSTAT symposium, 23rd August 2010 Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 1 / 12

  2. Context Mixture models Estimation Application to Metagenomics Context Situation individuals are sampled from a population then classified into species Goal 1 : estimate the species abundance distribution Goal 2 : estimate the number of species with no sampled individual Applications ecological surveys : number of species of butterflies [Fisher et al., 1943] metagenomics (our interest, large number of unobserved species, large datasets) other : number of words in a language, number of unreported drug addicts Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 2 / 12

  3. Context Mixture models Estimation Application to Metagenomics Example Observations Species A B C D E . . . Number of individuals 10 430 10 289 3 . . . Species abundance distribution Number of individuals 1 2 3 4 5 . . . Number of species 513 149 65 34 24 . . . Frequency/Count data Frequency/Count data ● nb of missing species : 2121 nb of observations nb of observations 500 500 ● ● ● ● ● ● 50 50 Species Abundance Distribution ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 5 ● ● ● ● 1 ● ● ● ● ● ● 1 ● ● ● ● ● ● 0 5 10 15 20 25 0 5 10 15 20 25 rare species −−−−−−−> abundant species rare species −−−−−−−> abundant species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 3 / 12

  4. Context Mixture models Estimation Application to Metagenomics Sampling model species abundance distribution : � f ( λ ) = α q f q ( λ ) q X i individuals are observed for species i conditionally on its abundance λ i (Poisson distributed number) f ( X i | λ i ) = exp − λ i λ X i i X i ! only positive numbers of individuals are recorded in the data set : f ( X + | λ i ) = f ( X i | λ i , X i > 0) Truncated model ( ϑ = { α q , π q } ) : � Q q α q f q ( x , π q ) X + ∼ f ( x , ϑ ) = 1 − � Q q α q f q (0 , π q ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 4 / 12

  5. Context Mixture models Estimation Application to Metagenomics Bayesian model � Q q α q f q ( x , π q ) X + ∼ f ( x , π q ) = 1 − � Q q α q f q (0 , π q ) A priori : α ∼ Dirichlet( � a ) π q ∼ Beta( b q , c q ) Z ∼ Multinom( � a ) X | Z ∼ Geom( π q ) Approximate a posteriori distribution : α | X ∼ Dirichlet(˜ a ) π q | X ∼ Beta(˜ b q , ˜ c q ) a , ˜ The hyper parameters ˜ b and ˜ c provide an approximation of the a posteriori distribution and hence confidence intervals. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 5 / 12

  6. Context Mixture models Estimation Application to Metagenomics Variational framework Theorem The log-likelihood can be decomposed into : �� log P ( X ) = log P ( X , Z , ϑ ) d Z d ϑ = F ( X , Q ) + KL ( Q , P ( . | X )) log P ( X , Z , ϑ ) �� where F ( X , Q ) := Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ . Consequently : log P ( X ) ≥ F ( X , Q ) if Q = argmax F ( X , Q ) then Q = argmin KL ( Q , P ( . | X )) . argmax F ( X , Q ) = P ( Z , ϑ | X ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 6 / 12

  7. Context Mixture models Estimation Application to Metagenomics VB-EM algorithm Application of [Beal and Ghahramani, 2003] leads to the following update formulae : a ( n +1) i τ ( n )  = a 0 q + � q iq   b ( n +1) i τ ( n ) = b 0 q + � iq ( X i − 1) q c ( n +1) i τ ( n )  = c 0 q + �  q iq where τ ( n ) = P Q Zi ( Z i = q ). iq Consequences : Approximate posterior distribution approximate non asymptotic credibility intervals proposal distribution for importance sampling 0.006 50 VB−EM VB−EM VB−EM Imp Sampling VB−EM IS Parameter density 40 0.004 MCMC MCMC Density 30 20 0.002 10 0.000 0 0.35 0.40 0.45 0.50 0.55 0.60 4800 4900 5000 5100 5200 5300 5400 total number of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 7 / 12

  8. Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging Let M denote the (random) number of components in the mixture model. Then the BMA model is � f BMA = P ( M = m | X ) f m m where f m is the posterior density of the observations given a model with m components. The weights P ( M = m | X ) can be computed based on the Bayes formula : P ( M = m | X ) ∝ P ( X | M ) P ( M ) where P ( M ) is the a priori distribution on M . The evidence P ( X | M ) is hard to compute in general ; the VB-EM algorithm provides the approximation �� log P ( X , Z , ϑ ) log P ( X ) ∼ F ( X , Q ) = Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ where the error term KL ( Q , P ( . | X )) has been neglected. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 8 / 12

  9. Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging (example) bma weights P(X|M) IS 1.5e−07 P(X|M) 5.0e−08 1 2 3 4 5 nb of components Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 9 / 12

  10. Context Mixture models Estimation Application to Metagenomics Metagenomics High throughput DNA sequencing Complex environmental samples : soil, seawater, intestine microflora Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 10 / 12

  11. Context Mixture models Estimation Application to Metagenomics Real dataset example Model fit to the data (human gut microbiota [Tap et al., 2009]) : 5e−01 ● Data ● 1 ● 2 5e−02 ● 3 density ● 4 ● ●● 5 5e−03 BMA ●●● ● ● ● ●● ● ●● ● ● ● 5e−04 ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ●● 0 10 20 30 40 50 60 species abundance Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 11 / 12

  12. Context Mixture models Estimation Application to Metagenomics Real dataset example Estimated number of species and approximate posterior distributions : 0.0020 1 2 3 4 density 0.0010 5 BMA 0.0000 0 5000 10000 15000 20000 25000 30000 nb of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

  13. Context Mixture models Estimation Application to Metagenomics Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data : with application to scoring graphical model structures. Bayesian Statistics 7 (pp. 453–464). Fisher, R., Corbet, A., and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology , 12(1) :42–58. Tap, J., Mondot, S., Levenez, F., Pelletier, E., Caron, C., Furet, J., Ugarte, E., Mu˜ noz-Tamayo, R., Paslier, D., Nalin, R., et al. (2009). Towards the human intestinal microbiota phylogenetic core. Environmental Microbiology , 11(10) :2574–2584. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

Recommend


More recommend