Phylogenetic Trees in ACL2 Warren A. Hunt Jr. and Serita M. Nelesen The University of Texas at Austin Phylogenetic Trees in ACL2 – p.1/14
Phylogenetic Trees Representation of the evolutionary relationship between species Very Long Ago Long Ago Present Phylogenetic Trees in ACL2 – p.2/14
From Organisms to Trees Ape: ACCGTAGCTT Ape : ACCGTAGCTT Bear: ATAGTAACT− Bear: ATAGTAACT Dog : −CCGTA−TTT Dog: CCGTATTT Emu : CGCATAGC−− Emu: CGCATAGC Multiple Sequence Frog: C−C−TA−AAC DNA Sequencing Frog: CCTAAAC Goat: GTAATAGAAC Alignment Goat: GTAATAGAAC Unaligned Sequences Aligned Sequences A Set of Taxa Consensus A B D E F G Maximum Parsimony A B D G E F Analysis A B D E F G Search A B D F E G Set of Optimal Trees Consensus Tree Phylogenetic Trees in ACL2 – p.3/14
Lots and lots of trees Number of possible trees grows exponentially with the number of leaves in the tree Two main methods used to determine the correct tree A heuristic search through tree space A Bayesian estimation of phylogeny using Markov chain Monte Carlo Both of these methods may produce hundreds, or thousands of trees which are then the input to further processing Phylogenetic Trees in ACL2 – p.4/14
Lots and lots of trees Number of possible trees grows exponentially with the number of leaves in the tree Two main methods used to determine the correct tree A heuristic search through tree space A Bayesian estimation of phylogeny using Markov chain Monte Carlo Both of these methods may produce hundreds, or thousands of trees which are then the input to further processing Need a system to store these trees efficiently, and perform post-tree analysis. Phylogenetic Trees in ACL2 – p.4/14
Why Use ACL2? Standard answer: Accuracy Explicit specification of input and output for all functions together with proof that the specification is met within the code (guards) Two representations of trees, with proof that we can accurately move from one representation to the other and back Additional answers: Storage space and performance speed Hash-consing gives greatly reduced storage space Memoization gives improved performance speed Overall: Medical systems of the future Phylogenetic Trees in ACL2 – p.5/14
Representation E F G A G A B C D E F A B C D E F G A B C D B C D E F G A B C D E F G TASPI High-Level Representation: (((A B) C) ((D E) (F G))) (((A B) C) ((D E) F G)) (((A B) C) (D (E (F G)))) ((A (B C)) ((D E F) G)) ((A (B C)) ((D E) (F G))) Phylogenetic Trees in ACL2 – p.6/14
Representation E F G A G A B C D E F A B C D E F G A B C D B C D E F G A B C D E F G TASPI Low-Level Representation: ((#1=((A B) C) #5=(#6=(D E) #9=(F G))) (#1#(#6# F G)) (#1#(D (E #9#))) (#12=(A (B C)) ((D E F) G)) (#12##5#)) Phylogenetic Trees in ACL2 – p.6/14
Reduced Storage Space 1G Newick TASPI.bhz 100M Size (bytes) 10M 1M 100K 10K 1 2 3 4 5 6 7 8 9 10 11 12 Data Set Number Phylogenetic Trees in ACL2 – p.7/14
Bipartition Representation F C D B D A F A A C F E B D E C E B Phylogenetic Trees in ACL2 – p.8/14
Bipartition Representation F C D B D A F A A C F E B D E C E B Parenthetical Notation: (A B (C ((D E) F))) (A (B ((D E) F)) C) (A B ((C (D E)) F)) Phylogenetic Trees in ACL2 – p.8/14
Bipartition Representation F C D B D A F A A C F E B D E C E B Parenthetical Notation: (A B (C ((D E) F))) (A (B ((D E) F)) C) (A B ((C (D E)) F)) Bipartition Representation: AB | CDEF AC | BDEF AB | CDEF ABC | DEF ABC | DEF ABF | CDE ABCF | DE ABCF | DE ABCF | DE Phylogenetic Trees in ACL2 – p.8/14
Bipartition Representation F C D B D A F A A C F E B D E C E B Parenthetical Notation: (A B (C ((D E) F))) (A (B ((D E) F)) C) (A B ((C (D E)) F)) Bipartition Representation: AB | CDEF AC | BDEF AB | CDEF ABC | DEF ABC | DEF ABF | CDE ABCF | DE ABCF | DE ABCF | DE Our Bipartitions: (A B C D E F) (A B C D E F) (A B C D E F) (C D E F) (B D E F) (C D E F) (D E F) (D E F) (C D E) (D E) (D E) (D E) Phylogenetic Trees in ACL2 – p.8/14
Relationship of Representations < properties of input tree > < properties of ordering > < properties of tree and ordering > ) (defthm paren-partition-paren (implies (and (equal (tree-from-fringes (get-fringes tree ordering) ordering) tree))) Phylogenetic Trees in ACL2 – p.9/14
Strict and Majority Consensus Strict consensus : Any branch that appears in every input tree is in the consensus tree Majority consensus : Any branch that appears in more than half of the input trees is in the consensus tree Phylogenetic Trees in ACL2 – p.10/14
Example F C D B D A F A A C F E B D E C E B Phylogenetic Trees in ACL2 – p.11/14
Example F C D B D A F A A C F E B D E C E B Majority Strict A D C F D A C F B E B E Phylogenetic Trees in ACL2 – p.11/14
Improved Consensus Performance 10000 PAUP total 6000 2000 TNT total TASPI total 1000 Time (secs) TASPI.bhz total 800 600 400 200 1 2 3 4 5 6 7 8 9 10 11 12 Data Set Number Phylogenetic Trees in ACL2 – p.12/14
Conclusion and Future Work TASPI provides accuracy guarantees, while providing state of the art performance in terms of size and speed TASPI is being extended to perform further post-tree analyses, as well as database operations Phylogenetic Trees in ACL2 – p.13/14
Questions? Phylogenetic Trees in ACL2 – p.14/14
Recommend
More recommend