2017-07-29 part 4: phenomenological load and biological inference phenomenological load review types of models phenomenological mechanistic Newton Einstein F = − Gm 1 m 2 G αβ = 8 π T αβ r 2 1
2017-07-29 phenomenological load molecular evolution is process and pattern process pattern “MutSel models” ! ⎧ µ ij N × 1 ⎪ N = µ IJ if neutral ⎪ Pr = ⎨ ⎪ 2 s ij µ ij N × if selected ⎪ − 2 Ns ij 1 − e ⎩ s ij = Δ f ij GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ... ... ... ... ... ... AG. ... ... ... ... ... .G. ... ... ... ..C ..C ... ... G.. ... ... ... ... T.. GG. ... ... ... ... ... .G. ..T ..A ... ..C .A. ... ... ..A C.. ... ... ... GCT G.. ... ... ... ... ... ..C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ... ... ... ..T ... ..A Halpern(and(Bruno((1998)( ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ... ... ... ..C ... ... ... ... ... ... ... ..G ... ... ..C ... ... ... ... G.. ... ... ... ..C ... ... ... T.C .C. ... ... ... .AG ... A.C ..A .C. ... ... ... ... ... ... T.T ... A.T ..T G.A ... .C. ... ... ... ... ..C ... .CT ... ... ... ..T ... ... ..C ... ... ... ... TC. .C. ... ..C ... ... A.C C.. ..T ..T ..T ... phenomenological load Maximum phenomenological model for sequence data : explains all variation in a particular dataset so-called “ saturated model ” (multinomial model) • does not generalize to other datasets • no information about process • highest lnL score (useless?) • site pattern 4 GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... Question: Does anyone really care, at all, that site pattern No.4 occurs 33 times in my sample of 5 mammalian mt genomes? 2
2017-07-29 phenomenological load Review phenomenological models : “The good” all we have to model are “outcomes” (site pattern distribution) • they can be predictive ( e.g ., Newtonian models) • they can tell us about process ( e.g ., some codon models) • “The bad” a “saturated model” is useless • must “decide” how much variability to “soak up” with model • parameters matching variability to mechanistic process is hard • traditional statistical methods manage phenomenological variability • ( NOT process variability ) “the ugly” getting it wrong = false biological conclusions • phenomenological load new concept: move phenomenological from model to parameter phenomenological load ( PL ): if a parameter has a mechanistic interpretation, and if the process it represents did not actually occur, then when it absorbs significant variance that parameter has taken on phenomenological load (measured via PRD*). two conditions for PL: 1. confounding of model parameters 2. underspecified model * PRD = percent reduction of deviance, and is defined in subsequent slides 3
2017-07-29 phenomenological load codon models 1. confounding 2. underspecified ⎧ 0 if i and j differ by > 1 ⎪ Δ f Ile → Leu h ⎪ π j for synonymous tv. ⎪ Q ij = κπ j ⎨ for synonymous ts. ⎪ ωπ j Δ f Ile → Lys for non-synonymous tv. h ⎪ ⎪ ωκπ j for non-synonymous ts. ⎩ DNA sub-model: missing model variability: • κ and π • different fitness landscapes for sites • applied to all sites equally • different AA echangeabilities ( s ij ) • ≠ mutation sub-model • different equilibrium for sites • independent mutational sub-model • mechanistic effect of N e protein level sub-model: • high level non-independence (global • ω and π epistasis for stability) • direct selective interpretation • low-level non-independence (local • affected by mutation process epistasis for function) 1. sub-models are confounded! 2. models are heavily underspecified phenomenological load a different look at the issue … true model (M T ) fitted model (M0) 4
2017-07-29 ⌢ ( ) T = X | θ T P Kullback-Leibler divergence ⌢ ⌢ θ T ) ( ) log P T ( X | ∑ KL = θ T ⌢ ( ) P T X | ( ⌢ ) M0 = X | θ M0 θ M0 P P M0 X | X M S M T KL “Deviance M0” { } ⌢ ( ) − l M S X ( ) D M0 = − 2 l M0 θ M0 | X , T M0 5
2017-07-29 Not to scale! M S M T KL M3 Percent Reduction in M0 Deviance (PDR) � PRD = D M0 − D M3 D poisson M S M T Hypothesis tests along THIS PATH have direct connection to mechanism of evolution KL Hypothesis tests along THIS PATH have phenomenological load M3 M0 PRD significant LRTs b/c variation is § not random interpretation is not direct about § mechanism of evolution 6
2017-07-29 DT: Double and Triple mutations Example double: ATG (Met) è AAA (Lys) [ α parameter] Example triple: AAA (Lys) è GGG (GLY) [ β parameter] M0 Q matrix New Q matrix 2 parameters ( κ and ω ) • 4 parameters ( κ , ω , α , β ) • DT not allowed • DT allowed (via α and β ) • Is such a model warranted? white: probability = 0 Let’s do a simulation study! process ( M T ): outcome ( X ): African we need outcomes to match up chimpanzee bonobo gorilla orangutan Sumatran orangutan common gibbon harbor seal grey seal cat horse Indian rhinoceros cow fin whale blue whale rat mouse wallaroo opossum platypus real mtDNA data simulation outcome simulation • MutSel • f h differ for each site NO DT-mutations • • 12 mt proteins (3331 codons) 20 mammals heat maps: proportion of sites having a given pair of AAs • Our simulated data LOOKS LIKE the REAL DATA! 7
2017-07-29 simulation for M T : MutSel with NO DT-mutations M S M T KL C3 C3 +DT M3 LRT: +DT 47% M3 LRT: M0 97% +DT M0 LRT: 100% PRD since there are NO DT-mutations, PRD PRD is a measure of PL PRD PRD with true DT process PRD for real mtDNA dataset PL associated with α and β Conclusions: DT parameters ( α • and β ) carry PL is evidence for DT • process in mtDNA in excess of PL estimated level of • DT very small in the real data M3 C3 M0 +DT +DT +DT 8
2017-07-29 m o d e l p a t h f o r “ s h a l l o w ” p h Poisson for DNA y l o g e n e t i c s JC69 M S M S M S M T model path for “deep” phylogenetics model path for inference of process Alternative model paths: research objective differs • Poisson for target model differs • codons PL differs • impact on inferences differs Poisson for • amino acids M T M S KL Why should you care? M3 M0 PRD 1. All of molecular evolution depends on models to some extent. 2. All models are wrong (underspecified). 3. Model parameters will carry some PL. 4. Faster computers è more complex models 5. Next Gen sequencing è minor effects detectable 6. Standard model selection tools will NOT inform you about levels of PL. 7. Excessive PL will lead to false biological conclusions. 8. Modelers MUST have biological expertise, and they MUST use that expertise as part of the modeling process. 9
2017-07-29 How can you really tell if you have learned anything relevant to the function of your protein? • formally combine computational and experimental approaches (B. Chang, next lecture) • formally combine phenotypic information within the computational analysis of sequence evolution The End . 10
Recommend
More recommend