information theoretic analysis of molecular co evolution
play

Information-Theoretic Analysis of Molecular (Co)Evolution Using - PowerPoint PPT Presentation

Information-Theoretic Analysis of Molecular (Co)Evolution Using Graphics Processing Units Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher . . . AEERYAEYKEAFTLFDSDGD. . . . . .


  1. Information-Theoretic Analysis of Molecular (Co)Evolution Using Graphics Processing Units Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher . . . AEERYAEYKEAFTLFDSDGD. . . . . . TEEQGRQFRQM FEM FDKNGD. . . . . . TDEQQRQYRQM FETFDKDGN. . . . . . TKEQVEEFKQAFSM FDTDGD. . . . . . SEEQVAEFKEAFDRFDKNKD. . . . . . SKEQVAKFKEAFDRI DKNKD. . . . . . SPEQVAEFKQAFSRFDKNGD. . . . . . SEEQVAKFKAAFSRFDTNGD. . . . . . PPEQVAKFKEVFSRFDKNGD. . . . . . AEERYAEYKEAFTLFDSDGD. . . FDKNGD. . . FETFDKDGN. . . FDTDGD. . . . . . SEEQVAEFKEAFDRFDKNKD. . . . . . SKEQVAKFKEAFDRI DKNKD. . . . . . SPEQVAEFKQAFSRFDKNGD. . . . . . SEEQVAKFKAAFSRFDTNGD. . . . . . PPEQVAKFKEVFSRFDKNGD. . . FEM . . . TKEQVEEFKQAFSM . . . TEEQGRQFRQM . . . TDEQQRQYRQM June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 1

  2. Motivation ● Huge amount of Multiple Sequence Alignments (MSAs) available, some of them really large ● E.g., HIV protease [1]: > 45,000 sequences of length > 1400 ● Put them to use for coevolutionary and structural analysis ● But: Our computations take >25 days [1] Pan et. al.:“The HIV positive selection mutation database” June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 2

  3. Outline ● In this talk we will show… ● MSA analysis using Mutual Information ● GPU parallelization & speed improvements ● 3-point Mutual Information contributions ● an application to a well-known protein ● that the use of this is beneficial June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 3

  4. Introduction – Mutual Information ● Given an MSA: Sequence 1: AEERYAEYKEAFTLFDSDGD. . . Sequence 2: TEEQGRQFRQM FEM FDKNGD. . . Sequence 3: TDEQQRQYRQM FETFDKDGN. . . Sequence 4: TKEQVEEFKQAFSM FDTDGD. . . Sequence 5: SEEQVAEFKEAFDRFDKNKD. . . Sequence 6: SKEQVAKFKEAFDRI DKNKD. . . Sequence 7: SPEQVAEFKQAFSRFDKNGD. . . Sequence 8: SEEQVAKFKAAFSRFDTNGD. . . ● Mutual Information between two columns (correlation  coevolution): ● Iteration over all column pairs  MI matrix: June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 4

  5. Introduction – Shuffling Null-Model ● MI is sensitive to underlying amino acid distribution ● Computational Normalization: Shuffling Null-Model [2] ● Is MI distinguishable from “random evolution” MI? [2] K. Hamacher: “Relating sequence evolution of HIV1-protease to its underlying molecular mechanics” June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber

  6. Introduction – Shuffling Null-Model ● Compute original MI ● Iterate 10,000 times: AEER. . . SEEQ. . . ● Shuffle each MSA column TEEQ. . . TDER. . . TDEQ. . . TKEQ. . . ● Compute rand. MI matrix SEEQ. . . SEEQ. . . SKEQ. . . APEQ. . . PPEQ. . . PEEQ. . . SEEQ. . . SEEQ. . . PEEQ. . . ● Normalize original MI TPEQ. . . AKEQ. . . using random MI: TDER. . . . . . . . . June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 6 6

  7. Massive parallelism ● Highly compute intensive ● HIV-1 protease on single core: ● MI computation for all column pairs: ~3.5 min ● Repeat for 10,000 iterations: > 25 days ● But: ● Computation of each MI matrix entry independent of all others ● Shuffling of each MSA column independent of all others ● Parallelizable (to hundreds of thousands of threads) June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 7

  8. GPU Implementation ● Iterate 10,000 times: ● Shuffling . . . AEERYA. . . . . . TEEQGR. . . – Map MSA columns to blocks of threads . . . TDEQQR. . . . . . TKEQVE. . . – Shuffle columns (GPU suited algorithm) . . . SEEQVA. . . – Synchronize . . . SKEQVA. . . . . . SPEQVA. . . ● MI computation – Map MI matrix entries to blocks of threads (suitable for MSA access pattern) – Compute MI matrix entries – Synchronize ● Combine results & normalize orig. MI with randomized MI June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 8

  9. Speed Results GeForce GTX 480 4 threads on Core i7 ‐ 960 Calmodulin 1.1 min 13.4 min 753 sequences ~ 12x speed ‐ up of length 264 HIV ‐ 1 protease 1.85 days 7.3 days > 45,000 seqs. ~ 4x speed ‐ up of length > 1400 ● Problem size dependent June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 9

  10. Implications ● One order of magnitude speed-up ● Quickly redo previous steps (e.g., alignment) and recompute MI ● New analysis tool feasible: 3-point MI: Coevolution of a ‘3-clique’ of MSA columns ● Can we deduce more information from 3-point MI than we could from 2-point MI alone? June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 10

  11. Calmodulin ● 149 amino acids ● Ca 2+ binding  conformational change ● Regulates various signaling pathways June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 11

  12. Coevolution in Calmodulin – 2-point MI ● Finding coevolving pairs of amino acids ● Structural or functional connection ● Here: Coevolution within N- and C-terminus ● Ca 2+ binding ● Propagation of conformational change ● Conserved inner helix ● No coevolution without variation June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 12

  13. Coevolution in Calmodulin – 3-point MI ● ‘3-cliques’ of amino acids ● Higher order correlations ● Concerted motions ● Binding sites June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 13

  14. Coevolution in Calmodulin – 3-point MI ● ‘3-cliques’ of amino acids ● ● Color indicates the frequency with which an amino acid contributes to the ‘3-cliques’ set ● Key residues for important functions June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 14

  15. Conclusions ● MI for coevolutionary analysis ● GPU implementation ~10x faster on typical MSAs ● 3-point MI analysis possible in acceptable time ● 3-point MI does reveal new insights ● Next step could be k-point MI ● It may be possible to detect key residues in unknown proteins June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 15

  16. What happened since? ● Multi-GPU parallelization: ● Distribute Shuffling Null-Model iterations among GPUs ● First tests: 32 GPUs  ~32x speed-up (on top of basic GPU speed- up!) June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 16

  17. Please visit tinyurl.com/tud ‐ comic Thank you. for code & documentation or contact us. June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 17

Recommend


More recommend