gpu accelerated virtual cell biology
play

GPU Accelerated Virtual Cell Biology and SIMD Enhanced High - PowerPoint PPT Presentation

GPU Accelerated Virtual Cell Biology and SIMD Enhanced High Throughput Computational Biology Narayan Ganesan Assistant professor at the Department of Electrical and Computer Engineering Hanyu Jiang Ph.D. student, research assistant at the


  1. GPU Accelerated Virtual Cell Biology and SIMD Enhanced High Throughput Computational Biology Narayan Ganesan Assistant professor at the Department of Electrical and Computer Engineering Hanyu Jiang Ph.D. student, research assistant at the Department of Electrical and Computer Engineering

  2. PART I: Agent Based Virtual Cell Biology

  3. Advantages of Process Simulation in 3D Space • Serves as a computational microscope into the behavior of the cell. • Helps observe behavioral patterns, such as expected time to DNA transcription, induced variance in protein translation and decay. • Helps model and study noise in biological processes. • Helps study cross talk between several large and complex pathways. 3

  4. Computational Tasks Involved Each particle maintains its own identity and attributes. • The list of interactions between the different chemical species are specified at the • start of the simulation. The particles are allowed to diffuse independently, at the given rate of diffusion and • given variance in their velocities. (random walk in 3D space) When two particles that can react with each other come within the vicinity of each • other (radius of reaction), then the scheduler schedules a reaction between them and marks the particles as inactive. The reaction is executed wherein new particles (products of the reaction) are added • to the system, along with their new identity and attributes.

  5. Computational Challenge: Parallel Selection • Each thread is assigned to a particle along with its identity and attributes. • Thus each particle is independent and autonomous agent within the 3D space. • Feasible list returns the set of feasible neighboring particles that it can react with based on all the reactions within the system. Reactions + = + = + = Feasible list generated will include particles feasible for all reactions.

  6. Computational Challenge: Parallel Selection Inconsistent reaction selection Consistent reaction selection Feasible List Generation Computational Workgroup of threads

  7. Algorithm For Consistent Parallel Selection 1) Build the feasibleList for each particle, which is a subset of neighborList and contains the set of particles capable of reacting with the current particle. 2) Sort the feasibleList according to the Euclidean metric in order to set the reaction priority. 3) Each particle selects the first available particle in its sorted feasibleList for reaction. 4) If the selection is mutual then schedule the corresponding reaction in the reaction pipeline and mark the particle as not available for any more selections. – else mark the particle as still available for selection by other particles. – Perform steps 3) and 4) until converged or no more available particles in the feasibleList . Algorithm Converges within 6 iterations.

  8. Example: JAK-STAT Signaling Mechanism 1. JAK binds to IFN- γ receptor and forms IFNR-JAK complex (RJ). 2. IFN- γ binds to extra cellular domain of RJ complex and forms IFNRJ complex. 3. Dimerization of IFNRJ leads to formation of IFNRJ2. 4. IFNRJ2 is phosphorylated and IFNRJ2* is formed. 5. STAT1c binds to IFNRJ2* and is phosphorylated (STAT1c*). 6. Phosphorylated STAT1c (STAT1c*) forms a homo-dimer (STAT1c*- STAT1c*). 7. Homo-dimer (STAT1c*-STAT1c*) are trans-located to nucleus (STAT1n*-STAT1n*). 8. STAT1n*-STAT1n* works as a transcription factor. 9. SOCS1 is induced by JAK/STAT pathway. 10. SOCS1 binds to the activated receptor (IFNRJ2*) and inhibits its activity.

  9. ODEs for JAK-STAT Signaling Pathway

  10. Computing Framework – Input Config. File #---------------------------------------------------------------- Regions # Regionid, x_orig, y_orig, z_orig, x_length, y_length, z_length #---------------------------------------------------------------- 4 0.0 0.0 57.0 60.0 60.0 3.0 #extracellular medium 3 0.0 0.0 54.0 60.0 60.0 3.0 #cellplasma membrane 2 0.0 0.0 8.0 60.0 60.0 46.0 #cytoplasm 1 0.0 0.0 5.0 60.0 60.0 3.0 #nuclear membrane 0 0.0 0.0 0.0 60.0 60.0 5.0 #nucleus # all concentrations are in nM/L. 1nM/L = 602.3*VOL*conc parts in Cell, VOL = 3.3 ncc. #---------------------------------------- #----------------------------------------------- Reagents Reactions # Reagent inertia, init_cond, region_id # reaction, forward_rate, reverse_rate #---------------------------------------- #----------------------------------------------- R, 0.5, 12.0 IFNRJ2 = IFNRJ2x, 0.005, 0.0, JAK, 0.5, 12.0 IFNRJ2x + STAT1c = IFNRJ2x-STAT1c, 1.0, 0.1, RJ, 0.5, 0.0 IFNRJ2x-STAT1c = IFNRJ2x + STAT1cx, 0.4, 0.0, IFN, 0.5, 15.0 IFNRJ2x + STAT1cx = IFNRJ2x-STAT1cx, 1.0, 0.1, IFNRJ, 1.0, 0.0 STAT1cx + STAT1cx = STAT1cx2, 1.0, 0.005, IFNRJ2, 1.0, 0.0 …. IFNRJ2x, 1.0, 0.0 STAT1c, 1.0, 300.0

  11. Process Simulation Framework: Workflow Input: 3D trajectory and snapshots Process Configuration of particles within the Simulation Time to biological cell. Framework simulate Particle Concentrations Sample Output: Step R JAK RJ IFN IFNRJ IFNRJ2 IFNRJ2x STAT1c 1 2 3 4 5 6 7 8 9 0 24140 24140 0 60350 0 0 0 603504 . . . 10015913 15913 153 52276 584 3649 15 603372 . . . 2008855 8855 69 45134 666 6928 28 602761 . . . 3005963 5963 25 42198 768 7967 69 601335 . . . 4004485 4485 25 40720 700 8351 72 599184 . . . . . .

  12. GPU Enabled Virtual Cell Biology Simulation The particle concentration is output as a function of time.

  13. Performance and Scalability Weak Scaling w.r.t. number of Processors Strong Linear Scaling w.r.t. number of agents 13

  14. Part II: SIMD Enhanced Protein Motif Detection

  15. Hidden Markov Model and hmmsearch of HMMER Each Sample Path follows a set of predefined transition probabilities between the states 15

  16. HMM Model & Sequence Database HMM model Protein sequence database 16

  17. Dependencies and Computational Hotspot • Match states: 𝐽 (𝑗 − 1, 𝑘 − 1) + 𝑈 𝐽𝑁 (𝑘 − 1, 𝑘) , 𝑊 𝑁 𝑗, 𝑘 = 𝜁 𝑆 𝑗 , 𝑁 𝑘 + max { 𝑊 𝑁 (𝑗 − 1, 𝑘 − 1) + 𝑈 𝑁𝑁 (𝑘 − 1, 𝑘), 𝑊 𝑊 𝐸 (𝑗 − 1, 𝑘 − 1) + 𝑈 𝐸𝑁 (𝑘 − 1, 𝑘) , 𝐶 + 𝑈 𝐶𝑁 (𝑁 𝑘 ) } • Insert states: 𝑁 (𝑗 − 1, 𝑘) + 𝑈 𝑁𝐽 (𝑘, 𝑘), 𝑊 𝐽 𝑗, 𝑘 = max { 𝑊 𝑊 𝐽 (𝑗 − 1, 𝑘) + 𝑈 𝐽𝐽 (𝑘, 𝑘)} • Delete states: 𝑁 (𝑗, 𝑘 − 1) + 𝑈 𝑁𝐸 (𝑘 − 1, 𝑘), 𝑊 𝐸 𝑗, 𝑘 = max { 𝑊 𝑊 𝐸 (𝑗, 𝑘 − 1) + 𝑈 𝐸𝐸 (𝑘 − 1, 𝑘)} 17

  18. How the computational kernel looks like… Match • HMM states Insert • 1 M MSV needs Match score and X E of X E • Delete • previous row 1 Viterbi needs adjacent Delete • One sequence score in current row Dependence on X E impose a row • major order computation Maximum probability that the sequence was generated by the model: N O(MxN) 18

  19. Multi-tiered Parallel Framework for Acceleration 19

  20. Detail #1: Synchronize-free Execution One warp pick up one • Warp #1 sequence Sequence Database Once done, move to next • Done Warp #2 schedule automatically Eliminate block-scoped • Warp #3 __syncthreads() caused by: Intra-states dependency of • HMM model Unbalance sequence data • Keep threads active • High throughput • 20

  21. Detail #2: Striped Layout vs. Sequential Layout Sequential Layout Striped Layout • • Straightforward Only one reordering request per DP row • • Private data dependence across adjacent threads All parallel execution • • More sequential overhead and thread idling • 21

  22. Detail #3: PTX assembly for Reordering Reorder 128 scores within one warp • Shifting • Exchange (Intra-warp shuffle) • Merge • Ready to go next! • 22

  23. Detail #4: PTX assembly for Max-Reduction Max-reduction • SIMD max • Intra-warp shuffle • broadcast • 23

  24. Benchmark Performance Overused shared memory hurts occupancy Considering cases like pipeline usage and • • and overall performance available registers, more threads/warps may not results in further speedup. Larger capacity of local memory for each • Reversely, it may bring stalling and register thread available is a good news spills. 24

  25. Benchmark Performance – cont. GCUPS = GigaCell Update Per Second Complex algorithms bring in intensive • • register pressure and off-chip data transfer Larger model, better performance. • Lower hit ratio on L1, L2 and Read-Only • About 5x faster than highly-optimized CPU • caches is the performance killer implementations. 25

  26. Acknowledgements • NVIDIA-Professor Partnership • Xilinx University Program (XUP) • Stevens Institute of Technology Start-up Foundation Contact Information • Narayan Ganesan Email: nganesan@stevens.edu • Hanyu Jiang Email: hjiang5@stevens.edu 26

Recommend


More recommend