2015-07-21 part II codon substitution models and the analysis of natural selection pressure Joseph P. Bielawski Department of Biology Department of Mathematics & Statistics Dalhousie University part II model based inference 1. 3 inference tasks 2. example gene analysis ( & experimental validation ) 3. MLE instabilities 4. false biological conclusions 5. example evolutionary survey ( & experimental validation ) 6. closing thoughts 1
2015-07-21 1. three tasks model based inference 3 analytical tasks Task 1 . Parameter estimation (e.g., ω ) Task 2 . Hypothesis testing Task 3 . Make predictions (e.g., sites having ω > 1 ) 1. analytical task-1 task 1: parameter estimation t, κ , ω = unknown constants estimated by ML π ’s = empirical [GY: F3 × 4 or F61 in Lab] use a numerical hill-climbing algorithm to maximize the likelihood function 2
2015-07-21 1. analytical task-1 parameter estimation Parameters : t and ω Gene : acetylcholine α receptor human mouse common ancestor lnL = -2399 1. three tasks How do we know that the estimate is significant? Task 1. Parameter estimation (e.g., ω ) ✔ Task 2. Hypothesis testing LRT Task 3. Prediction / Site identification 3
2015-07-21 1. analytical task-2 LRT No. 1: Does selection pressure vary among sites? H 0 : uniform selective pressure among sites (M0) H 1 : variable selective pressure among sites (M3) Compare 2 Δ l = 2( l 1 - l 0 ) with a χ 2 distribution Model 3 Model 0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 ω ˆ ω ω ˆ ω ˆ ˆ = 0.65 = 0.01 = 0.90 = 5.55 1. analytical task-2 LRT No. 2: Have some sites evolved under positive selection? H 0 : variable selective pressure but NO positive selection (M1) H 1 : variable selective pressure with positive selection (M2) Compare 2 Δ l = 2( l 1 - l 0 ) with a χ 2 distribution Model 1a Model 2a 0.7 1 0.9 0.6 0.8 0.5 0.7 0.6 0.4 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 ω ˆ ( ω = 1) = 0.5 ω ω ˆ ˆ = 0.5 ( ω = 1) = 3.25 4
2015-07-21 1. analytical task-2 the LRT does not follow the χ 2 distribution simulated 0.2 simulated χ 2 4 0.15 frequency 0.1 0.05 0 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 2 Δ ℓ Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592 1. analytical task-2 the LRT is conservative Number of cases out of 100 for which the null hypothesis was rejected at the α = 1% (5%) significance levels Simulation parameters Type I error Exp. Simulation LRT at α = 1% (5%) Taxa κ ω t N = 100 N = 500 A…. M0 M0 & M3 6 2 0.40 0.11 0 (0) 0 (0) 1.1 0 (0) 0 (0) 11 0 (0) 0 (0) B…. M0 M0 & M3 17 2 0.40 2.11 0 (0) 0 ( 1 ) 8.44 0 (0) 0 ( 1 ) 16.88 0 ( 1 ) 0 (0) C… M0 M0 & M3 5 5 0.25 0.91 0 (0) 0 (0) 9.1 0 (0) 0 ( 1 ) 18.2 0 ( 1 ) 2 ( 3 ) D… M7 M7 & M8 6 2 p = 0.41 0.11 N/A 0 (0) q = 1.10 1.1 N/A 1 ( 5 ) 11 N/A 1 ( 4 ) NOTE: Here t denotes total tree length (sum of all branch lengths in the tree ) Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592 5
2015-07-21 1. analytical task-2 the LRT can be powerful Power of the LRT: Number of replicates out of 100 in which positive selection was 5% ( P +s, 0.05 , in parentheses) significance levels indicated by parameter estimates ( P + ), or detected by the LRT at the 1% ( P +s, 0.01 ) and Simulation parameters P + P +s, 0.01 (0.05) Simulation LRT Taxa ω distribution t L C = 100 L C = 500 L C = 100 L C = 500 κ M3 M0 & M3 17 2 ω 0 = 0.018, p 0 = 0.386 0.38 61 80 10 (17) 66 (72) ω 1 = 0.304, p 1 = 0.535 ω 2 = 1.691, p 2 = 0.079 2.11 93 100 91 (92) 100 (100) 8.44 99 100 99 (99) 100 (100) 16.88 99 99 99 (99) 99 (99) 105.5 31 58 31 (31) 58 (58) NOTE: Here t denotes total tree length (sum of all branch lengths in the tree ) Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592 1. three tasks How do we identify the selected sites ? Task 1. Parameter estimation (e.g., ω ) ✔ Task 2. Hypothesis testing ✔ Task 3. Prediction / Site identification Bayes’ rule 6
2015-07-21 1. analytical task-3 Which sites have ω > 1 ? 1 0.9 0.8 model: 0.7 0.6 9% have ω > 1 0.5 0.4 0.3 0.2 0.1 0 GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC Bayes’ rule: ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. site 4, 12 & 13 ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... structure: sites are in contact 1. analytical task-3 Bayes’ rule: yet another (silly) example of Suppose that a population consists of 60% males and 40% females, and a disease occurs at the rate 1% in males and 0.1% in females. Q 1 : What is the probability that any individual carries the disease? A 1 : 0.6 × 0.01 + 0.4 × 0.001 = 0.0064 P (D) = P (M) P (D|M) + P (F) P (D|F) See Yang and Bielawski (2000) TREE 15:496-503 for a detailed presentation of this example 7
2015-07-21 1. analytical task-3 Bayes’ rule: yet another (silly) example of Q 2 : Given that an individual carries the disease, what is the probability that it is a male? A 2 : 0.6 × 0.01/0.0064 = 0.94 P (M) P (D|M) P (M|D) = P (D) See Yang and Bielawski (2000) TREE 15:496-503 for a detailed presentation of this example From Paul Lewis’ lecture …. Bayes’ rule in statistics Prior probability of hypothesis θ Likelihood of hypothesis θ Pr( D | θ ) Pr( θ ) Pr( θ | D ) = � θ Pr( D | θ ) Pr( θ ) Marginal probability Posterior probability of the data (marginalizing of hypothesis θ over hypotheses) 1 8
2015-07-21 1. analytical task-3 identifying selected sites under a codon model K − 1 ∑ p ( ω i ) P ( x h | ω i ) P ( x h ) = i = 0 Likelihood Total Prior probability = 0.03 = 0.40 = 14.1 ω 2 ω 0 ω 1 p 1 p 2 p 0 = 0.85 = 0.10 = 0.05 1. analytical task-3 Bayes’ rule for identifying selected sites Site class 0: ω 0 = .03, 85% of codon sites Site class 1: ω 1 = .40, 10% of codon sites ? ? Site class 2: ω 2 = 14, 05% of codon sites Likelihood of hypothesis ( ω 2 ) Prior probability of hypothesis ( ω 2 ) ( ) P ( ω 2 | x h ) = P ( ω 2 ) P x h | ω 2 K − 1 ∑ ( ) P ( ω i ) P x h | ω i i = 0 Posterior probability of Marginal probability (Total hypothesis ( ω 2 ) probability) of the data 9
2015-07-21 1. analytical task-3 Bayes’ rule for identifying selected sites ,-./012%34514/67%837/56% 956:38430%837/56% (" !#'" >5:;38/58%.85?-?/1/;2% !#&" !#%" !#$" !" (" &" ((" (&" $(" $&" )(" )&" %(" %&" *(" *&" &(" &&" +(" +&" '(" '&" ,(" ,&" (!(" (!&" (((" ((&" ($(" ($&" ()(" ()&" (%(" (%&" (*(" (*&" (&(" (&&" (+(" (+&" ('(" ('&" (,(" (,&" $!(" Site class 0: ω 0 = .03 (strong purifying selection) Site class 1: ω 1 = .40 (weak purifying selection) Site class 2: ω 2 = 14 (positive selection) NOTE : The posterior probability should NOT be interpreted as a “ P -value”; it can be interpreted as a measure of relative support, although there is rarely any attempt at “calibration” 1. analytical task-3 Bayes’ rule for identifying selected sites Empirical Bayes Naive Empirical Bayes Bayes Empirical Bayes • (NEB) • (BEB) • Nielsen and Yang, 1998 • Yang et al., 2005 • assumes no MLE errors • accommodate MLE errors 10
2015-07-21 1. analytical task-3 Bayes Empirical Bayes (BEB) 1. assign a prior to ω distribution parameters 2. fix branch lengths to MLEs 3. integrate over uncertainty 4. BEB is faster than “Full Bayes” (FB) False classification rates Small datasets: FB/BEB < NEB Large datasets: FB/BEB ≈ NEB* * exception: extreme parameter estimates See: Yang Z, Wong WS, Nielsen R. 2005. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol. 22(4):1107-1118. model based inference progress … Task 1. Parameter estimation (e.g., ω ) Task 2. Hypothesis testing Task 3. Prediction / Site identification let’s put this into practice … 11
2015-07-21 model based inference progress 1. 3 inference tasks ✔ 2. example gene analysis ( & experimental validation ) 3. MLE instabilities 4. false biological conclusions 5. example evolutionary survey ( & experimental validation ) 6. closing thoughts 2. example analysis colour diversity of coral pigments Red/blue colour morphs of the great star coal Montastraea cavernosa o Is color diversity tuned by natural selection? o Is there a relationship between colour and endosymbiotic algae? See Field et al. 2006 J. Mol. Evol. 62(3):332-9 for details. 12
Recommend
More recommend