1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Université de Montréal 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble Rhône-Alpe 4 McGill Center for Bioinformatics
2 Introduction • Gene trees reflect the evolutionary history of a family of homologous genes • Ancestral genes may have undergone duplication or speciation Duplication G : G : Gene tree of the Speciation Ensembl ZincFinger protein 800 gene, for the species • Zebrafish • Stickleback • Medaka • Tetraodon ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T
3 Introduction (LCA = Lowest Common Ancestor) • Pairwise extant genes relationships • Orthologs : LCA is a speciation (e.g. ZNF800 Z2 , ZNF800 T ) • Paralogs : LCA is a duplication (e.g. ZNF800 Z1 , ZNF800 T ) Duplication G : Speciation ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T
4 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) G : S : M T S Z ZNF800 Z1 ZNF800 Z2 ZNF800 S ZNF800 M ZNF800 T G : Gene tree for the ZincFinger protein 800
5 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) • We use this mapping to ease up notation G : S : M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800
6 Introduction • Each gene tree has an associated species tree • s(g) for ancestral genes : we use LCA Mapping , where each ancestral gene is mapped to the LCA of its descendants mappings in S γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800
7 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800
8 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800
12 Introduction • Orthology and paralogy are inferred given the gene tree. • But instead, can we infer (or correct) parts of the gene tree, given orthology/paralogy relationships ? γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800
13 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ Untrusted duplication G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1
14 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1
15 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” Untrusted duplication γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1
16 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” γ 1 γ G : S : β β 1 α Z 3 α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1
17 Two correction problems • Case 1 and 2 give us speciation (orthology) constraints • Given G containing untrusted duplications, find a gene tree G’ that satisfies the given constraints AND messes up G as least as possible • e.g. minimize the Robinson-Foulds distance G’ : G : M 1 T 1 M 1 Z 1 Z 2 T 1 S 1 Z 1 Z 2 S 1
18 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T2’ : T1 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5
19 RF distance In the case of rooted binary trees T1, T2 with the same leaves : RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5
20 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5
21 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y : {g2, g3} g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5
22 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : distRF(T1, T2) = 2 T2 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5
23 Detecting untrustworthy duplications • Some duplications are labeled “dubious” or given low confidence values by Ensembl • We can use synteny to infer orthology/paralogy relationships [1] • Software inferring ancestral adjacencies might pick up erroneous duplications • Using DeCo, one can identify bad duplications when more than two adjacencies are inferred on an ancestral gene [2] [1] Lafond, Swenson, El-Mabrouk , “Error detection and correction of gene trees”, MASGE (2013) [2] Chauve, El-Mabrouk, Guéguen, Semeria, Tannier , “ Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later “, MAGE (2013)
24 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
25 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families • Look at the gene trees of each involved family a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
26 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
27 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous • If not, some unlikely event occurred a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
28 Detecting untrustworthy duplications • What’s wrong with this ? • If only the ancestral gene ab duplicated, the copy typically went somewhere else on the ancestral genome • And somehow, it ended up in a region similar to the original gene…mostly by chance . ab ab copy a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
29 Detecting untrustworthy duplications • We looked at ~6000 Ensembl gene trees • The trees for the Zebrafish, Medaka, Tetraodon and Stickleback species • 22% (~1200) of these trees contained this type of bad duplication a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b
31 Problem 1 • Given : given a gene tree G, a species tree S, and a set C of clades that are required to be speciations • Find : A corrected gene tree G’ in which all clades in C are preserved, are speciations, and such that RFDist(G , G’) is minimized (as many clades as possible are preserved) G’ : G : a 1 c 1 c 2 b 1 d 1 d 2 a 1 b 1 c 1 c 2 d 1 d 2 In green : preserved clades
32 Problem 1 • A solution doesn’t always exist • In this example, if C = {x,y}, we cannot correct both x and y into speciations • A solution exists iff for any two x, y in C, we don’t have that x is an ancestor of y and s(x) = x(y) • We will assume there exists a solution x G : y s(x) = s(y) S : a b c d a 1 c 1 d 1 b 1 C = {x, y}
33 Problem 1 • To transform x into a speciation s(x) S : • Let L and R be the two children of s(x) L R a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2
34 Problem 1 • Find G L (resp. G R ), the set of maximal s(x) S : subtrees of G that contains only genes L R mapped to species in L (resp. R) a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2
35 Problem 1 • Form G* by making two polytomies s(x) S : (non-binary subtrees) with G L and G R , L R joined under a common parent a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2
36 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2
37 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R • In fact, every solution is the result of a binary resolution of G*. a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2
Recommend
More recommend