is the best model good enough assessing the absolute fit
play

Is the best model good enough? Assessing the absolute fit of - PowerPoint PPT Presentation

Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling Gerhard Jger Tbingen University Workshop Computational and phylogenetic historical linguistics ICHL24, Canberra, July 4, 2019


  1. Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling Gerhard Jäger Tübingen University Workshop Computational and phylogenetic historical linguistics ICHL24, Canberra, July 4, 2019

  2. Introduction 1 / 35

  3. “What I cannot create, I do not understand” (Feynman) 2 / 35

  4. Motivation • Bayesian model comparison (BF, WAIC, LOOIC, DIC, ...) compare models • tell us which model, out of a pre-defined collection, is best in explaining the data • does not tell us how plausible it is that the data were generated from a generative process akin to the one specified by the model • Posterior predictive sampling simulates possible data from posterior distribution, after fitting model to observed data • tells us what the data might have been , provided the model accurately represents our state of knowledge 3 / 35

  5. data model posterior ... ... Workflow posterior predictive simulation 4 / 35

  6. Example: regression 5 / 35

  7. Toy example 6 / 35

  8. An example Suppose you observe a roulette wheel 20 times, and it comes up with the following sequence of colors: BBBBBBBBBRRRRRRRRBBB How can you model the wheel’s behavior? 7 / 35

  9. An example posterior probability of P(R) • 8 times red and 12 times black • maximum likelihood estimation: P ( R ) = 0 . 4 • straightforward Bayesian analyis: • prior distribution over P ( R ) : uniform • posterior distribution: Beta ( 9 , 13 ) 0.2 0.4 0.6 0.8 Model assumes that the wheel has no memory! Should we believe this? BBBBBBBBBRRRRRRRRBBB 8 / 35

  10. An example • observed sequence contains only 2 changes of colors between subsequent draws • how many such changes should we expect within 20 draws if • the wheel is memory-less, • our prior belief over P ( R ) is uniform, and • we have observed the sequence above? 9 / 35

  11. An example posterior predictive distribution 10 / 35

  12. An example • posterior distribution of model parameters sampled via MCMC • posterior predictive distribution sampled by repeatedly (often) 1 draw a sample of parameter values from the posterior 2 simulate mock data according to the generative model, using the parameters from the previous step • mock data can be used to sample from posterior distribution of some summary statistics (such as the number of color changes) 11 / 35

  13. An example 0.15 posterior probability 0.10 0.05 0.00 0 5 10 15 number of changes 12 / 35

  14. Better model π 1 initial state π 1 π 0 α β red black α β 0.00 0.25 0.50 0.75 1.00 13 / 35

  15. Better model 0.20 posterior probability 0.15 0.10 0.05 0.00 0 5 10 15 number of changes 14 / 35

  16. Sound inventories and population size 15 / 35

  17. Sound inventories and population size number of phonemes (Phoible) number of speakers 16 / 35

  18. Sound inventories and population size 100 number of phonemes areal Africa Eurasia NorthAmerica Oceania 30 SouthAmerica 10 1e+02 1e+05 1e+08 population 17 / 35

  19. Sound inventories and population size • Hay and Bauer (2007) (a.o.): positive correlation of population size with sound inventory size • Phoible data: correlation = 0 . 37 • debunked by Moran et al. (2012) 18 / 35

  20. 19 / 35 phylogenetically controlled regression Kpx.KAPIXANA.KANOE Mat.MATACOAN.CHOROTE OM.CHINANTECAN.LEALAO_CHINANTEC Pen.SAHAPTIAN.NEZ_PERCE AuA.PALAUNG_KHMUIC.KSINMUL_2 AuA.KATUIC.NGE AuA.VIET_MUONG.MALIENG CSu.BONGO_BAGIRMI.BIRRI_C_AFRICAN_REP NC.UBANGI.KPATILI NC.ADAMAWA.PAM NC.KWA.TUTRUGBU NC.BANTOID.NGULU NC.BANTOID.F21_SUKUMA NC.BANTOID.BABUNGO NC.PLATOID.HASHA NC.CROSS_RIVER.UKWA Pan.PANOAN.KASHIBO_SAN_ALEJANDRO Pan.PANOAN.YAMINAWA AA.BIU_MANDARA.GIDAR AA.WEST_CHADIC.KWAAMI T or.WAPEI_PALEI.BRAGAT T or.WAPEI_PALEI.AMOL An.CELEBIC.TAJE_PETAPA An.BARITO.KADORIH An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAUYAI An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAREMBORI An.OCEANIC.RIRIO An.YAPESE.YAPESE An.OCEANIC.SIVISA_TITAN An.OCEANIC.OROHA An.OCEANIC.BIEREBO_YEVALI An.OCEANIC.SOUTH_EFATE_ERAKOR An.CENTRAL_MALAYO_POLYNESIAN.SAPOLEWA_SOOW_KWELE_ULUI_SERAM An.CENTRAL_MALAYO_POLYNESIAN.ALOR An.CENTRAL_MALAYO_POLYNESIAN.APUTAI Mas.MASCOIAN.SANAPANA_ENLHET Man.EASTERN_MANDE.SHANGA TNG.MADANG.SINSAURU NDe.ATHAPASKAN.HUPA Iro.NORTHERN_IROQUOIAN.TUSCARORA Alt.MONGOLIC.KALAQIN ESu.NILOTIC.ALUR ST.KUKI_CHIN.CHIN_HAKA ST.MAHAKIRANTI.YAKHA ST.BODIC.JIREL MGe.GE_KAINGANG.PANARA Alt.TURKIC.SHOR Alt.TURKIC.TEREKEME_AZERI IE.IRANIAN.PERSIAN LP .LAKES_PLAIN.TAUSE_WEIRATE τ rake-shaped “proto-world” established language families • branches tend to be much root with unknown depth • infer trees for individual • very high phylogenetic • isolates are connected • connect them with a estimated from data • total tree depth τ is • compromise used here: directly to the root • tree estimation above language families uncertainty too short unreliable

  21. DIC: 4132 without phylogenetic control with phylogenetic control DIC: 3475 mean sound inventory size (in standard deviations) 20 / 35

  22. Posterior predictive check: correlations under null models log-population size model naive phylogenetic -0.2 0.0 0.2 correlation 21 / 35

  23. Applying PPS to phylogenetic inference 22 / 35

  24. Case study Urdu Hindi Bihari Nepali Marwari Gujarati Marathi Pashto Shughni T ajik • data: IELex Persian Armenian_Eastern • 30 randomly Latvian Macedonian Sorbian_Lower sampled Ukrainian Russian languages Danish Swedish • binarized Faroese Luxembourgish cognate classes Afrikaans Flemish Welsh Romanian Portuguese Spanish French Dolomite_Ladino Romansh 23 / 35

  25. Summary statistics of interest: Retention Index • instead of comparing y with ˆ y , I will compare distribution of ( y , θ ) with (ˆ y , θ ) • if model is credible, we expect strong overlap between distributions • summary statistics to be used: Retention Index • most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny B A A C C B 24 / 35

  26. Summary statistics of interest: Retention Index • instead of comparing y with ˆ y , I will compare distribution of ( y , θ ) with (ˆ y , θ ) • if model is credible, we expect strong overlap between distributions • summary statistics to be used: Retention Index • most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny B 2 mutations B B C A B A A C C B 24 / 35

  27. Summary statistics of interest: Retention Index • instead of comparing y with ˆ y , I will compare distribution of ( y , θ ) with (ˆ y , θ ) • if model is credible, we expect strong overlap between distributions • summary statistics to be used: Retention Index • most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny B 3 mutations B C C A B A A C C B 24 / 35

  28. Summary statistics of interest: Retention Index • instead of comparing y with ˆ y , I will compare distribution of ( y , θ ) with (ˆ y , θ ) • if model is credible, we expect strong overlap between distributions • summary statistics to be used: Retention Index • most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny A 4 mutations B A A C C B 24 / 35

  29. Retention Index • minimal number of mutations: number of states − 1 • maximal number of mutations: number of taxa - number of occurrences of most frequent state • number of avoidable mutations: maximal number of mutations - minimal number of mutations • number of mutations avoided in T : maximal number of mutations − (minimal) number of mutations in T • Retention Index (RI) of a tree T : RI ( T ) = number of mutations avoided in T number of avoidable mutations 25 / 35

  30. Retention Index RI = 1 B 2 mutations B B C A A A B C C B 26 / 35

  31. Retention Index RI = 1 / 2 B 3 mutations B C C A A A B B C C 26 / 35

  32. Retention Index RI = 0 A 4 mutations A A B C C B 26 / 35

  33. Model 1: CTMC β 0 1 α • Γ -distributed rates • relaxed clock • uniform tree prior 27 / 35

  34. Model 1: CTMC empirical simulated 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Retention Index marginal log-density: − 14391 28 / 35

  35. Model 2: CTMC + ascertainment bias correction empirical simulated 0.75 0.80 0.85 Retention Index marginal log-density: − 12862 29 / 35

  36. Model 3: Covarion + ascertainment bias correction β 0 1 α δ γ 0 1 30 / 35

  37. Model 3: Covarion + ascertainment bias correction empirical simulated 0.68 0.70 0.72 0.74 Retention Index marginal log-density: − 12983 31 / 35

Recommend


More recommend