on the identifiability of two tree mixtures for group
play

On the identifiability of two tree mixtures for group-based models - PowerPoint PPT Presentation

On the identifiability of two tree mixtures for group-based models E. Allman 1 c 2 J. Rhodes 1 S. Petrovi S. Sullivant 3 1 University of Fairbanks, Alaska 2 University of Illinois Chicago 3 North Carolina State University Phylomania 2010


  1. On the identifiability of two tree mixtures for group-based models E. Allman 1 c 2 J. Rhodes 1 S. Petrovi´ S. Sullivant 3 1 University of Fairbanks, Alaska 2 University of Illinois Chicago 3 North Carolina State University Phylomania 2010 – Hobart, Tasmania November 2010 On the identifiability of two tree mixtures for group-based models 1/33

  2. Today’s talk: ◮ identifiability of 2-tree mixture models ◮ work dates from 2009 and before ◮ focus today on algebraic techniques (technical at times) On the identifiability of two tree mixtures for group-based models 2/33

  3. Background Interest sparked by papers/conversations ◮ Kolaczkowski and Thornton: 2004 Nature ◮ Mossel and Vigoda; Ronquist et al: 2005, 2006, Science ◮ ˇ Stefankoviˇ c and Vigoda: 2007, JCB, Phylogeny of Mixture Models: Robustness of Maximum Likelihood and Non-identifiable Distributions ◮ Matsen and Steel: 2007, Sys. Bio., Phylogenetic mixtures on a single tree can mimic a tree of another topology ◮ Matsen, Mossel, and Steel: 2008, BMB, Mixed-up trees: the structure of phylogenetic mixtures ◮ Junhyong Kim On the identifiability of two tree mixtures for group-based models 3/33

  4. Due to incomplete lineage sorting, or other biological phenonomenon, sequence data may have evolved along two or more trees. Species Tree Gene 1 Gene 2 Q: Is it theoretically possible to identify the two trees giving rise to expected pattern frequencies? Q’: If so, what about the numerical parameters for these trees? On the identifiability of two tree mixtures for group-based models 4/33

  5. Due to incomplete lineage sorting, or other biological phenonomenon, sequence data may have evolved along two or more trees. Species Tree Gene 1 Gene 2 Q: Is it theoretically possible to identify the two trees giving rise to expected pattern frequencies? Q’: If so, what about the numerical parameters for these trees? On the identifiability of two tree mixtures for group-based models 4/33

  6. Modeling sequence evolution along a tree(s) For a fixed tree T and a model of sequence evolution (GTR, GTR+Γ, JC, ...) , the distribution of states at the leaves of T is a function ψ T of the model’s parameters. Eg. GTR model on a n -taxon tree T parameterization map → ∆ 4 n − 1 ψ T : S T − � � π, Q , { t e } �− → P = ( p i 1 ··· , i n ) where p i 1 ··· i n is the expected frequency of pattern i = i 1 · · · i n at the leaves of T . On the identifiability of two tree mixtures for group-based models 5/33

  7. Mixture models Modeling sequence evolution along two or more trees requires using a mixture model . Eg. Suppose T 1 and T 2 are two n -taxon trees, then the distribution is a point in the image of → ∆ 4 n − 1 ψ T 1 , T 2 : S T 1 × S T 2 × [0 , 1] − � � s 1 , s 2 , w �− → P = ( p i 1 ··· , i n ) where P = w ψ T 1 ( s 1 ) + (1 − w ) ψ T 2 ( s 2 ) is the weighted sum of the distributions for parameter choices on T 1 and T 2 . On the identifiability of two tree mixtures for group-based models 6/33

  8. Group-based models Today: focus on group-based models Cavender-Farris-Neyman (CFN), Jukes-Cantor (JC), Kimura 2-Parameter (K2P), Kimura 3-Parameter (K3P) These models, as well as GM, have an algebraic structure useful for analysis. On the identifiability of two tree mixtures for group-based models 7/33

  9. Model parameters π , { M e } on tree T π M 1 M 4 M 2 M 3 S1 S2 S3 4 4 � � p ijk = π l M 1 ( l , m ) M 2 ( m , i ) M 3 ( m , j ) M 4 ( l , k ) l =1 m =1 lead to a polynomial parameterization map ψ T . Thus, any mixture distribution P T 1 , T 2 ∈ ψ T 1 , T 2 is also parameterized by polynomials. On the identifiability of two tree mixtures for group-based models 8/33

  10. Mixture varieties Extending the parameterization to complex parameters, define V T 1 ∗ V T 2 = Im ψ T 1 , T 2 , the phylogenetic mixture variety . (Point: This allows ideas from algebraic geometry to be used.) On the identifiability of two tree mixtures for group-based models 9/33

  11. Algebraic geometry reminders ◮ Fundamental correspondence: Geometry ← → Algebra V ← → I V Corresponding to any phylogenetic variety V is its ideal I V of phylogenetic invariants , the ideal of polynomials f in the pattern frequencies p i so that f ( P ) = 0 for any P ∈ V . ◮ Inclusion reversing correspondence: V 1 ⊆ V 2 ⇐ ⇒ I V 2 ⊆ I V 1 On the identifiability of two tree mixtures for group-based models 10/33

  12. More notation For stochastic parameter choices, denote the collection of joint distributions by M T 1 ∗ M T 2 . Note that M T 1 ∗ M T 2 � V T 1 ∗ V T 2 . Though the varieties are used for proofs because of their algebraic structure (dim, good intersection properties, etc.), all results today hold for the stochastic distributions. On the identifiability of two tree mixtures for group-based models 11/33

  13. Monomial parameterization Hendy, Penny, Sz´ ekely, Erd¨ os, Evans, Speed, Sturmfels, Sullivant: Group-based models can be diagonalized by means of the discrete Fourier transform over G (Hadamard transform). In the Fourier coordinates, group-based models give rise to toric varieties. (In this setting, ψ T is parameterized by monomials.) Moreover, the discrete Fourier transform is a linear change of variables, so it behaves well with respect to taking mixtures of group-based models. F ( M T 1 ) ∗ F ( M T 2 ) = F ( M T 1 ∗ M T 2 ) On the identifiability of two tree mixtures for group-based models 12/33

  14. Fourier coordinates For each split A | B in T , introduce a set of Fourier parameters { a A | B : g ∈ G } . g Theorem (Hendy-Penny) In the Fourier coordinates, a group-based phylogenetic model is given parameterically by: � � A | B ∈ Σ( T ) a A | B if g 1 + · · · + g n = 0 P q g 1 ,..., g n = a ∈ A g a 0 if g 1 + · · · + g n � = 0 ‘Coordinates’ in this parameterization are called q-coordinates . On the identifiability of two tree mixtures for group-based models 13/33

  15. Fourier coordinates For JC, K2P, we take G = Z 2 × Z 2 = { A , C , G , T } . ◮ For K2P model, we have a A | B = a A | B for all A | B G T ◮ For JC model, we have a A | B = a A | B = a A | B for all A | B . C G T On the identifiability of two tree mixtures for group-based models 14/33

  16. Tree parameter identifiability (stochastic version) Definition The tree parameters T 1 , . . . , T k in a k -class phylogenetic mixture model are identifiable , if for all P ∈ M T 1 ∗ · · · ∗ M T k there does not exist another set of k trees T ′ 1 , . . . , T ′ k such that P ∈ M T ′ 1 ∗ · · · ∗ M T ′ k . On the identifiability of two tree mixtures for group-based models 15/33

  17. Tree parameter identifiability (geometric version) Definition The tree parameters in a k -class phylogenetic mixture model are generically identifiable if for all non-equal multisets { T 1 , . . . , T k } , and { T ′ 1 , . . . , T ′ k } , dim( V T 1 ∗ · · · ∗ V T k ∩ V T ′ 1 ∗ · · · ∗ V T ′ k ) < dim( V T 1 ∗ · · · ∗ V T k ) . V T 1 * V T 2 V T 3 * V T i On the identifiability of two tree mixtures for group-based models 16/33

  18. Generic identifiability of tree parameters An immediate consequence of the geometric definition: dim( V T 1 ∗ V T 2 ∩ V T ′ 1 ∗ V T ′ 2 ) < dim( V T 1 ∗ V T 2 ) is that tree parameters are generically identifiable for stochastic parameter choices too. That is, the trees giving rise to M T 1 ∗ M T 2 are identifiable, except on a non-generic set E of stochastic parameters ( s 1 , s 2 , π ) of Lebesque measure zero where 2 ( s ′ 1 , s ′ 2 , π ′ ). ψ T 1 , T 2 ( s 1 , s 2 , π ) = ψ T ′ 1 , T ′ ( E is the set of bad parameters.) On the identifiability of two tree mixtures for group-based models 17/33

  19. Algebraic methods for proofs Use ◮ dimension counts for phylogenetic varieties ◮ all phylogenetic mixture varieties are irreducible, since they are parameterized ◮ two irreducible varieties of the same dimension either coincide or intersect in a sub-variety of lower dimension Analogy with linear spaces. = ⇒ if two phylogenetic varieties are distinct, then parameters will be generically identifiable ◮ two varieties V 1 and V 2 are distinct if I V 1 � = I V 2 and V 1 � V 2 if there exists an invariant f 2 ∈ I V 2 \ I V 1 On the identifiability of two tree mixtures for group-based models 18/33

  20. V T 1 * V T 2 V T 1 * V T 2 V T 3 V T 3 * V T i I V T 1 ∗ V T 2 � = I V T 3 ∗ V T 4 ∃ f ∈ I V T 1 ∗ V T 2 \ I V T 3 On the identifiability of two tree mixtures for group-based models 19/33

  21. Algebraic methods for proofs Use ◮ group-based models (JC and K2P) have linear invariants which can be used to construct invariants for 2-tree mixtures ◮ computational algebra packages like Singular On the identifiability of two tree mixtures for group-based models 20/33

  22. Main theorem (tree parameters) Theorem The tree parameters of the 2 -tree mixture model M T 1 ∗ M T 2 are generically identifiable under the Jukes-Cantor and Kimura 2-parameter models if T 1 , T 2 are binary with n ≥ 4 leaves. Strategy: Prove theorem for quartets n = 4, then lift to larger trees. On the identifiability of two tree mixtures for group-based models 21/33

Recommend


More recommend