Model Misspecification due to Site Specific Rate Heterogeneity: how is tree inference affected? Stephen Crotty School of Mathematical Sciences, University of Adelaide October, 2013 Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 1 / 21
What is Site Specific Rate Heterogeneity (SSRH)? 0.1 0.4 0.4 0.4 0.4 A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21
What is Site Specific Rate Heterogeneity (SSRH)? 0.1 The model contains 3 site types: Invariable sites 0.4 0.4 0.4 0.4 A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21
What is Site Specific Rate Heterogeneity (SSRH)? 0.1 The model contains 3 site types: Invariable sites Variable sites 0.4 0.4 0.4 0.4 A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21
What is Site Specific Rate Heterogeneity (SSRH)? 0.1 The model contains 3 site types: 0.1 0.1 Invariable sites Variable sites 0.4 0.3 0.3 0.4 Switching sites A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21
Why should we care about SSRH? Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Tasmanian Pygmy Possum Tasmanian Native Hen Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Tasmanian Pygmy Possum Tasmanian Native Hen Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Tasmanian Devil Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? What’s up Doc? Devil Facial Tumour Syndrome Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? What’s up Doc? Devil Facial Tumour Syndrome Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Why should we care about SSRH? Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21
Experimental Procedure 1 Data was simulated using the program LineageSpecificSeqgen 1 1 Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland. BMC Evolutionary Biology, 8:317, 2008. 2 http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21
Experimental Procedure 1 Data was simulated using the program LineageSpecificSeqgen 1 2 The Phylip 2 software package was used to perform tree inference using the maximum parsimony (MP), neighbour joining (NJ) and maximum likelihood (ML) methods. 1 Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland. BMC Evolutionary Biology, 8:317, 2008. 2 http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21
Experimental Procedure 1 Data was simulated using the program LineageSpecificSeqgen 1 2 The Phylip 2 software package was used to perform tree inference using the maximum parsimony (MP), neighbour joining (NJ) and maximum likelihood (ML) methods. 3 A theoretical analysis of each method was carried out in an effort to understand their performance. 1 Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland. BMC Evolutionary Biology, 8:317, 2008. 2 http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21
Simulation Parameters 0.1 0.1 0.1 0.4 0.3 0.3 0.4 A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21
Simulation Parameters 0.1 p inv = 80% 0.1 0.1 p var = 20% p switch = 0 , 1 , 2 , . . . , 100% 0.4 0.3 0.3 0.4 A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21
Simulation Parameters 0.1 p inv = 80% 0.1 0.1 p var = 20% p switch = 0 , 1 , 2 , . . . , 100% 100000 base pairs 0.4 0.3 0.3 0.4 Jukes Cantor substitution model 100 replications A B C D Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21
Maximum Parsimony 100 Correct Tree Inferred % 75 50 25 0 0 25 50 75 100 p switch Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 6 / 21
Maximum Parsimony Site pattern analysis predicts the asymptotic failure point of MP to be 26.56%. 100 Correct Tree Inferred % 75 50 25 0 0 25 50 75 100 p switch Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 6 / 21
Neighbour Joining 100 Correct Tree Inferred % 75 50 25 0 0 25 50 75 100 p switch MP NJ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 7 / 21
Neighbour Joining - why the recovery? The neighbour joining algorithm r = number of taxa. D ij = JC distance between taxa i and j . Q ij = ( r − 2) D ij − � r k =1 D ik − � r k =1 D jk Q is the matrix used by the NJ algorithm: the pair of taxa with the smallest Q ij are joined together and the process is repeated. Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 8 / 21
The Q matrix for a 4-taxa tree � � = (4 − 2) D AB − D Ak − Q AB D Bk k ∈{ B , C , D } k ∈{ A , C , D } = − ( D AC + D AD + D BC + D BD ) Similarly, = − ( D AB + D AC + D BD + D CD ) Q AD and, Q AC = − ( D AB + D AD + D BC + D CD ) Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 9 / 21
Digression - what tree might we infer? AB | CD AD | BC AC | BD A B C D A D B C A C B D min ( Q AB , Q AD , Q AC ) = Q AB min ( Q AB , Q AD , Q AC ) = Q AD min ( Q AB , Q AD , Q AC ) = Q AC ✔ ✘ ✘ = ⇒ = ⇒ = ⇒ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 10 / 21
Digression - what tree might we infer? AB | CD AD | BC A B C D A D B C Q AB < Q AD ( Q AB , Q AD , Q AD < Q AB ( Q AB , Q AD , ✔ ✘ = ⇒ = ⇒ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 10 / 21
The Q matrix for a 4-taxa tree The correct tree (AB | CD) will be inferred given the condition: Q AB Q AD < = ⇒ 0 Q AD − Q AB < = ⇒ 0 D AD + D BC − D AB − D CD < We now define C = D AD + D BC − D AB − D CD so that the correct tree will be inferred when C > 0. Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 11 / 21
Deriving the expected value of C T = the tree topology P ij = the proportion of differing sites between taxa i and j E [ P ij ] = f ( p switch , T ) E [ D ij ] = − 3 4 ln (1 − 4 3 E [ P ij ]) E [ C ] = E [ D AD ] + E [ D BC ] − E [ D AB ] − E [ D CD ] Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 12 / 21
Expected value of C 0.01 NJ critical quantity 0.00 −0.01 0 25 50 75 100 p switch Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 13 / 21
Neighbour Joining E [ C ] > 0 E [ C ] < 0 E [ C ] > 0 100 Correct Tree Inferred % 75 50 25 0 0 25 50 75 100 p switch Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 14 / 21
Maximum Likelihood 100 Correct Tree Inferred % 75 50 25 0 0 25 50 75 100 p switch MP NJ ML Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 15 / 21
Why is this important? Traditional methods of phylogenetic inference may be compromised by SSRH. Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21
Why is this important? Traditional methods of phylogenetic inference may be compromised by SSRH. Diagnostic tools need to be developed to help identify the presence and extent of SSRH in sequence data. Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21
Why is this important? Traditional methods of phylogenetic inference may be compromised by SSRH. Diagnostic tools need to be developed to help identify the presence and extent of SSRH in sequence data. Data driven model checking will be the focus of my PhD going forward. Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21
Acknowledgements I would like to thank my supervisory team for their input and guidance: Prof. Nigel Bean - University of Adelaide Dr Lars Jermiin - CSIRO Dr Barbara Holland - University of Tasmania Dr Jono Tuke - University of Adelaide Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 17 / 21
Recommend
More recommend