Constructing parsimonious hybridization networks using a SAT-solver Vladimir Ulyantsev and Mikhail Melnik, presented by Alexey Sergushichev AlCoB 2015, Mexico
Phylogenetic tree • Binary tree with set of taxa as leaves • Can be defined for a particular gene 2
Hybridization network • Directed acyclic graph with a single root • Reticulation nodes: in-degree=2, out-degree=1 • Regular nodes: in-degree=1, out-degree=2 • Leaves: taxa 3
Displaying a tree • Select direction a reticulation nodes • Collapse simple paths 4
Hybridization network problem 5
Most parsimonious network • Find a hybridization network for a set of phylogenetic trees T 1 , T 2 , .. T t with the minimal number of reticulation nodes • Is NP-complete even for t =2 6
Existing solutions For two trees: • CASS (heuristic) • MURPAR (heurisic) For multiple trees: • PIRN CH (heuristic) • PIRN C (exact) 7
Reduction to SAT • Fix hybridization number k • Make Boolean formula f so that f ∈ SAT iff there is a hybridization network for k • Check satisfiability with a SAT-solver • Find minimal k with satisfiable formula • Restore the network 8
SAT • Boolean formula f in CNF form: 𝑔(𝑤 1 , 𝑤 2 , … ) = 𝑤 1 ∨ ¬𝑤 2 ∨ . . . ∧ … ∧ . . . • Whether values for 𝑤 1 , 𝑤 2 , … exist that makes f true • Can be seen as conjunction of multiple constraints • Constraints can be of the form 𝑤 1 ∧ ¬𝑤 2 ∧ . . . → 𝑤 3 9
Network structure • 2n+ 2k - 1 nodes – [1, n] — leaves (L) – [n+1, 2n + k - 1] — regular nodes (V) – [2n+k, 2n+2k-1] — reticulation nodes (R) 10
Network structure variables • 𝑚 𝑤,𝑣 and 𝑠 𝑤,𝑣 — u is a left (right) child of v for v in V • 𝑞 𝑤,𝑣 — u is parent of v for v in L + V • 𝑞 𝑚 , 𝑞 𝑠 and c — parent child relations for reticulation nodes • 𝑃(𝑜 2 ) variables 11
Network consistency constraints • Nodes have only one left child, right child, parent • u is child of v → v is parent of u • u is parent of v → v is left of right child of u • 𝑃(𝑜 3 ) constraints 12
Network consistency constraints: Actual clauses 13
Displaying structure • For a tree T • Choice of a parent for reticulation nodes • Variables for correspondence between network and tree nodes • Collapsing non-branching paths – Whether particular nodes were removed or not – Parent relations after collapsing • 𝑃(𝑢𝑜 2 ) variables 14
Displaying consistency constraints • All T nodes are uniquely mapped to network nodes • Parent relations in the tree uniquely correspond to the network structure after selecting directions at reticulation points and collapsing paths • Parent relations in the network are consistent • 𝑃(𝑢𝑜 3 ) constraints 15
Displaying consistency constraints: Actual clauses (1) 16
Displaying consistency constraints: Actual clauses (2) 17
All clauses 18
Additional optimizations • Splitting into independent problems • Symmetry breaking 19
Experiments • 57 grasses dataset by Group G.P.W. et al • CryptoMiniSAT solver • 1000 s time limit • Comparison with PIRNs 20
Experiments • 57 grasses datasets by Group G.P.W. et al Grass Phylogeny Working Group • CryptoMiniSAT solver • 1000s time limit • Comparison with PIRNs 21
Results • Exact solution (out of 57) – PhyloSAT: 36 – PIRN C : 29 • Non-exact – PhyloSAT: 48 (40 optimal) – PIRN CH : 43 (36 optimal) 22
Results for k >= 6 23 hybridization number (time in seconds)
Future work • Different SAT-solvers • Improving reduction • Using upper and lower bounds on k • Searching for all minimal solutions 24
Conclusions • Constructing parsimonious hybridization networks can be approached with reducing to SAT • This approach outperforms known exact solver and compares well with heuristic solver • Solving bigger instances is still challenging 25
The End https://github.com/ctlab/PhyloSAT Vladimir Ulyantsev (ulyntsev@rain.ifmo.ru) 26
Recommend
More recommend