Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi
Quartet Inference from SNP Data Under the Coalescent Model Problem Statement ◮ We’re given aligned sequence data from multiple genes ◮ We want a good estimate for the species tree
Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees
Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Co-estimate gene trees and species trees using MCMC
Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem
Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Don’t scale to large datasets
Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly
Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly ◮ Does not use a Bayesian approach
Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR)
Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees
Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees ◮ Using this, and the model for sequence evolution, we can compute the probability of observing a particular character on a leaf of the species tree
Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split)
Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split) ◮ We can calculate all these probabilities, and write them in a 16 × 16 matrix (with rows representing the possible values for X 1 , X 2 )
Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23)
Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23) ◮ We then have the following theorem Theorem Assuming a strict molecular clock, for the split corresponding to the true species tree, the rank of the corresponding matrix is at most 10 . For all others splits, rank is strictly greater than 10 .
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets ◮ Theorem 1 suggests the following procedure for estimating the species tree ◮ Estimate probabilities using the sequences (assuming each site has it’s own genealogy) ◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species tree
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets ◮ Theorem 1 suggests the following procedure for estimating the species tree ◮ Estimate probabilities using the sequences (assuming each site has it’s own genealogy) ◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species tree
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty ◮ For more than four taxa � n ◮ Do this for all � taxa 4 ◮ Use a quartet assembly algorithm to get the full species tree
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix)
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34
Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34 ◮ We pick the split corresponding to the lowest SVDScore
Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees
Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1
Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered
Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered ◮ Simulations were done for Jukes Cantor and GTR + I + Γ
Quartet Inference from SNP Data Under the Coalescent Model Results (Jukes Cantor)
Quartet Inference from SNP Data Under the Coalescent Model Results (GTR + I + Γ)
Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split
Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split ◮ The theory for the model was derived for SNP sites, and for the GTR and it’s sub-models
Recommend
More recommend