quartet inference from snp data under the coalescent model
play

Quartet Inference from SNP Data Under the Coalescent Model Syed - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference from SNP Data Under the Coalescent Model Problem Statement Were given aligned


  1. Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi

  2. Quartet Inference from SNP Data Under the Coalescent Model Problem Statement ◮ We’re given aligned sequence data from multiple genes ◮ We want a good estimate for the species tree

  3. Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees

  4. Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Co-estimate gene trees and species trees using MCMC

  5. Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem

  6. Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Don’t scale to large datasets

  7. Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly

  8. Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly ◮ Does not use a Bayesian approach

  9. Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR)

  10. Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees

  11. Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees ◮ Using this, and the model for sequence evolution, we can compute the probability of observing a particular character on a leaf of the species tree

  12. Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split)

  13. Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split) ◮ We can calculate all these probabilities, and write them in a 16 × 16 matrix (with rows representing the possible values for X 1 , X 2 )

  14. Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23)

  15. Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23) ◮ We then have the following theorem Theorem Assuming a strict molecular clock, for the split corresponding to the true species tree, the rank of the corresponding matrix is at most 10 . For all others splits, rank is strictly greater than 10 .

  16. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets ◮ Theorem 1 suggests the following procedure for estimating the species tree ◮ Estimate probabilities using the sequences (assuming each site has it’s own genealogy) ◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species tree

  17. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets ◮ Theorem 1 suggests the following procedure for estimating the species tree ◮ Estimate probabilities using the sequences (assuming each site has it’s own genealogy) ◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species tree

  18. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD

  19. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty

  20. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty ◮ For more than four taxa � n ◮ Do this for all � taxa 4 ◮ Use a quartet assembly algorithm to get the full species tree

  21. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34

  22. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix)

  23. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11

  24. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34

  25. Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34 ◮ We pick the split corresponding to the lowest SVDScore

  26. Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees

  27. Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1

  28. Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered

  29. Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered ◮ Simulations were done for Jukes Cantor and GTR + I + Γ

  30. Quartet Inference from SNP Data Under the Coalescent Model Results (Jukes Cantor)

  31. Quartet Inference from SNP Data Under the Coalescent Model Results (GTR + I + Γ)

  32. Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split

  33. Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split ◮ The theory for the model was derived for SNP sites, and for the GTR and it’s sub-models

Recommend


More recommend