F AST R ECOVERY OF E VOLUTIONARY T REES WITH T HOUSANDS OF L EAVES Mikl´ os Cs˝ ur¨ os Department of Computer Science Yale University
Molecular evolution evolutionary tree (Noro et al. 1998) Woolly mammoth African elephant Asian elephant Dugong Manatee homologous gene sequences Woolly mammoth ...CTAAATCATCACTGATC--AAAGAGAGC... African elephant ...CTAAATCATCACCGATC--AAAGAGAGC... Asian elephant ...CTAAATCATCGCTGATC--AAAGAGAGC... Dugong ...TTAAATCACTCCCGATCATAAAG-GAGC... Manatee ...TCAAATCATTACTGACCATAAAG-GAGC... differences between sequences grow with time
� � ✂ � Markov model each character evolves independently root sequence characters are i.i.d. character transitions on edges parent 1 1-p 1 child p 10010... 11010... q 0 1-q 0 ✁ u character at node u : ξ – random variables forming a Markov chain on each path
� � ☛ ✂ ☎ ✂ Distance based algorithms Distance [coin-toss model: symmetric mutations] ✁ u ✁ u ✁ v ✁ u ✁ v ✟✡✠ ξ ξ ☛☞✝✌✟☞✠ ξ ξ D ✄ v ln ✂✆☎✞✝ ✂✆☎ ✂✎✍ symmetric additive along paths Distance-based algorithm: D 1. distance estimation between leaves ˆ 2. algorithm using pairwise distance matrix
✂ Additive tree problem build edge-weighted tree from sum-of-edge-weigths on paths between leaves – use triplets (eg., Waterman, Smith, Singh, Beyer 1977) u o o v w u v w ✁ u ✁ u ✁ v ✁ u D ✄ v D ✄ w D ✄ w ✂✑✏ ✂✒✝ D ✄ o ✂✆☎ 2
☎ � � � Estimated distances Use relative frequencies in sample ✁ u ✁ u ✁ v ✁ u ✁ v ✠ ξ ξ ✠ ξ ξ D ˆ ✄ v ln ˆ P P ˆ ✂✆☎✞✝ ✂✓☎ ✂✔☛☞✝ ✂✎✍ ✂✔☛ estimation error harder to recognize separate triplet centers estimation error grows with distance
✚ ✛ ✖ ✟ ✂ ✂ ✗ ✘ ✚ ✂ ✖ ✂ ✢ ✂ ☎ ✚ ✏ ✚ ✏ ☎ Triplet center estimation Similarity: ✁ u ✁ u S ✄ v ✝ D ✄ v exp ✂✆☎ ✁ u ✁ v ✁ u ✁ v ☎✕✟☞✠ ξ ξ ✂✔☛☞✝✌✟☞✠ ξ ξ ✂✓☎ ✂✎✍ ✂✔☛ ε Distance estimation error: for 0 1, ✁ u ✁ u ✁ u ✘ ε ln ✙ 1 ✜ ε 2 S 2 D ˆ ✄ o ✝ D ✄ o a exp ✝ b ✄ v ✄ w 2 (with a ✄ b 0 constants) Average similarity: ✁ u 3 S ✄ v ✄ w 1 1 1 S ✙ u ✣ v S ✙ u ✣ w S ✙ v ✣ w
� � � Harmonic Greedy Triplets Add one internal node and leaf at a time greedy selection of triplet by average similarity recognize separate inner nodes (four-point condition) restrict pool of triplets considered (relevant triplets)
✂ ✏ ✝ ✖ ✏ ✛ ✛ ✛ � ✖ ✝ ✝ Sample length Bounded mutation probabilities on edges 1 f p e g 0 2 There exists log 1 log n δ ✜✤☎✦✥ ✁ 1 ✚ f 2 ✙ d 2 g ✂★✧ δ , topology is recovered correctly s.t. with probability 1 ✁ n tree depth: d 1 log 2 1
� � � Simulated experiments compare to Neighbor Joining (Saitou and Nei 1987) and other algorithms simulate DNA sequence evolution (Jukes-Cantor & K2P+ Γ ) 500 leaf tree (Chase et al. 1993) tree of 500 seed plants from rbcL gene 1895 leaf tree (RDP 1999) tree of 1895 eukaryotes from ribosomal SSU 3135 leaf tree (RDP 1999) tree of 3135 Proteobacteria from ribosomal SSU evaluate by Robinson-Foulds distance (1981): percentage of misplaced internal edges
8 7 10 9 89 14 13 12 11 56 58 55 57 53 54 28 47 35 18 22 25 16 83 87 23 38 30 34 32 36 31 33 37 39 21 20 24 26 80 79 78 86 85 84 82 81 77 88 29 15 17 415 91 59 90 71 64 396 93 44 46 45 73 108 106 101 103 105 102 104 100 98 96 107 97 99 94 95 92 43 72 75 74 50 76 399 49 42 40 41 52 51 19 230 207 424 228 221 227 224 225 217 215 218 213 214 226 212 211 223 222 216 220 210 205 204 206 202 219 203 231 229 208 209 200 201 423 416 418 408 67 373 63 65 60 66 405 62 61 368 48 70 232 173 176 179 180 178 177 402 393 413 421 355 382 411 403 69 412 422 68 407 27 193 174 169 175 172 392 157 166 167 159 165 168 163 162 164 161 158 160 499 500 490 498 487 493 497 496 494 495 488 492 486 491 489 482 485 484 500-leaf tree 481 483 480 477 479 478 474 476 475 472 473 470 471 125 113 141 140 122 138 123 142 121 120 119 117 127 126 118 130 116 128 129 114 115 299 324 322 323 300 321 301 302 284 298 288 303 297 317 310 309 307 315 306 305 308 318 304 320 319 280 242 278 279 277 283 274 265 264 261 267 266 269 270 268 271 258 259 263 257 262 275 276 273 272 282 281 248 253 252 246 247 255 260 245 250 251 239 249 256 241 237 235 236 234 238 316 233 314 313 292 296 291 289 285 243 312 311 290 254 293 240 244 287 286 295 294 5 150 151 148 144 152 149 153 143 136 135 134 139 137 133 131 132 155 156 154 147 145 146 124 112 109 111 6 4 110 419 420 417 344 342 343 340 341 339 338 334 329 332 331 327 328 333 330 326 337 336 325 335 451 449 447 448 446 450 404 406 401 460 465 455 468 458 467 469 462 463 461 459 457 466 456 454 464 430 431 429 428 427 426 425 365 364 366 363 374 367 362 358 436 432 435 437 440 434 439 433 438 445 443 444 441 442 197 3 383 198 196 375 356 357 199 192 194 195 191 190 181 189 188 186 187 184 185 183 182 384 385 391 390 388 389 387 171 386 359 360 348 349 346 347 453 170 398 452 397 395 350 361 410 400 409 372 351 369 352 414 394 376 345 370 371 353 354 381 380 378 379 2 377 1
Experimental sample length — 500 leaf tree varying sample length RF% 500-leaf tree, high mutation probabilities 10 Neighbor-joining 1 HGT/FP 200 5000 1000 10000 sample length
Experimental sample length — 1895 leaf tree RF% 1895-leaf tree, high mutation probabilities Neighbor-joining 10 1 0.1 HGT/FP 200 1000 5000 10000 sample length
Experimental success — 1895 leaf tree varying mutation probabilities 1895-leaf tree, high mutation probabilities RF% Neighbor-Joining 10 HGT/FP 1 0.1 0.1 0.5 1 2 maximum edge length
Experimental success — 3135 leaf tree 3135-leaf tree, high mutation probabilities RF% Neighbor-Joining 10 1 HGT/FP 0.1 0.1 0.5 1 2 maximum edge length
� � ✥ ✥ � ✩ Summary distance-based algorithm with polynomial sample size (Jukes-Cantor, Kimura’s, paralinear, LogDet) n 2 running time ✁ n ✂ work space good experimental performance on large divergent trees fastest algorithm with polynomial sample size ✪✒✫✒✫✑✬✮✭✰✯✒✯✒✱✑✱✒✱✮✲✴✳✒✵✆✲✰✶✒✷✒✸✒✹✺✲✰✹✒✻✒✼✒✯✆✽✒✳✒✵✴✼✒✾✑✿✓✵✴❀✒❁✓❂❄❃✒✸✒✿✓✵✴✯✑✬✒✷✒✬✒✹✒✾✆✵✴✯
Recommend
More recommend