Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY Stony Brook Joint work with E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton
Outline • Motivation • Cardiac Models as Reaction Diffusion Systems • CUDA Programming Model • Reaction Diffusion in CUDA • Case Studies • Work in Progress
Motivation for our Work Simulation-Based Analysis • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Breakup • Spiral Breakup • Spiral Breakup • Spiral Breakup • Spiral Breakup • Tip Tracking • Tip Tracking • Tip Tracking • Tip Tracking • Tip Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Curvature Analysis • Curvature Analysis • Curvature Analysis • Curvature Analysis • Curvature Analysis • Conduction Velocity • Conduction Velocity • Conduction Velocity • Conduction Velocity • Conduction Velocity • Restitution Analysis • Restitution Analysis • Restitution Analysis • Restitution Analysis • Restitution Analysis
Cardiac Models as Reaction Diffusion Systems Schematic Action Potential Membrane’s AP depends on: • Stimulus (voltage or current): – External / Neighboring cells • Cell’s state voltage AP has nonlinear behavior! Threshold • Reaction diffusion system: s u l failed initiation u m ∂ u i t S ∂ t = R ( u ) + ∇ ( D ∇ u ) Resting potential time Behavior Reaction Diffusion In time
Cardiac Models • Minimal Model (Flavio-Cherry) (4 v) Human • Beeler-Reuter (8 v) Canine • Ten Tussher Panfilov (19 v) Human • Iyer (67 v) Human
Available Technologies CPU based GPU based The GPU devotes more transistors to data processing This image is from CUDA programming guide
GPU vs CPU
GPU Architecture MULTIPROCESSORS Each GPU consists of a Registers Registers Registers Registers Set of multiprocessors. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory
GPU Architecture MULTIPROCESSORS Each Multiprocesssor Registers Registers Registers Registers can have 8/32 Stream Processors (SP) (called SP SP SP SP SP SP SP SP by NVIDIA also cores) SP SP SP SP SP SP SP SP which share access to local memory. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory
GPU Architecture MULTIPROCESSORS Each Stream Processor Registers Registers Registers Registers (core) contains a fused multiply-adder capable SP SP SP SP SP SP SP SP of single precision arithmetic. It is capable SP SP SP SP SP SP SP SP of completing 3 floating SP SP SP SP SP SP point operations per SP SP cycle - a fused MADD SP SP SP SP SP SP SP SP and a MUL. Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory
GPU Architecture MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP Each multiprocessor can SP SP SP SP SP SP SP SP contain one or more 64- bit fused multiple adder SP SP SP SP SP SP SP SP for double precision operations. SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory
Memory Hierarchy MULTIPROCESSORS The fastest available Registers Registers Registers Registers Memory for GPU computation is device SP SP SP SP SP SP SP SP registers. Each multiprocessor contains SP SP SP SP SP SP SP SP 16KB of registers. … SP SP SP SP SP SP SP SP The registers are partitioned among the SP SP SP SP SP SP SP SP MP-resident threads Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory
Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP Shared memory (16KB) is primarily intended SP SP SP SP SP SP SP SP as a means to provide fast communication Shared Memory Shared Memory Shared Memory Shared Memory between threads of the executed by the same multiprocessor, although, Constant Cache due to its speed, it can also be used as a Texture Cache programmer controlled memory cache. Global Memory
Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache GPUs have also DRAM The latency is 150x is slower Texture Cache then registers and shared memory Global Memory
Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant memory, as the Constant Cache name implies, is a read-only region which also has a small cache. Texture Cache Global Memory
Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Texture memory is read-only with a small cache optimized Constant Cache for manipulation of textures. It also provides built-in linear Texture Cache interpolation of the data. also provides built-in linear interpolation of the data. Global Memory
Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global memory is available to all threads and is persistent between GPU calls. Global Memory
CUDA Programming Model Single Instruction Multiple Threads (SIMT) similar to CPU Host GPU Device Single Instruction Multiple Data (SIMD) Grid k THREAD ID 0 THREAD ID 1 THREAD ID 2 THREAD ID 3 Serial Code i A[ID]=ID A[ID]=ID A[ID]=ID A[ID]=ID Block (0,1) Block (0,0) if (ID%2) if (ID%2) if (ID%2) if (ID%2) Block (1,0) Block (1,1) A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; Kernel else else else else Block (2,1) Block (2,0) A[ID]=0; A[ID]=0; A[ID]=0; A[ID]=0; Invocation endif endif endif endif A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2; Block (3,0) Block (3,1) Block (4,0) Block (4,1) 6 2 10 2 2 0 1 0 4 2 3 0 4 0 0 1 8 2 0 3 4 0 8 0 Serial Code j Block (5,1) Block (5,0) A vector When branches occur in the code (e.g. due to if statements) the divergent When execution merges, the threads can threads will become inactive until the conforming threads complete their continue to operate in parallel. separate execution.
CUDA Programming Model Different threads are multiplexed and executed by the same core GRID BLOCK in order to reduce the latency of memory access. Each Thread block is executed by a multiprocessors The max number of threads for a thread block is 512 and it depends on the amount of registers that each thread may need. GPU DEVICE THREAD BLOCK REGISTERS Global Memory SHARED MEMORY
Tesla C1060 Fermi C2070 14 Multiprocessors 30 Multiprocessors 448 Cores 240 Cores Processor core clock: 1.15 GHz Processor core clock: 1.296 GHz 1030 Gigaflops (Single precision) 933 Gigaflops (Single precision) 515 Gigaflops (Double precision) 78 Gigaflops (Double Precision) Max Bandwidth (144 GBytes/sec) Max Bandwidth(102 Gigabytes/sec) 6 GB of DRAM 4 GB of DRAM Cost: 1000 $ Cost: 3200 $
Reaction-Diffusion in CUDA ∂ u ∂ t = R ( u ) + ∇ ( D ∇ u ) For each time step, a set of (ODEs) and Partial Differential Equations (PDEs) must be solved. for (timestep=1; timestep < nend; i++){ solveODEs <<grid, block>> (….); calcLaplacian <<grid, block>> (….); } Solving ODEs using different method Solving PDEs (Calc the Laplacian) depending on the model: -Runge-Kutta 4 th order -Forward Euclid -Backward Euclid -Semi-Implicit ( ) ∇ ( D ∇ u ) i , j = Ddt dx 2 u i − 1, j + u i , j − 1 + u i + 1, j + u i , j + 1 − 4 u i , j ……
Optimize the Reaction Term in CUDA Heaviside Simplification • In order to avoid divergent branches among threads we substitute if then else condition of Heaviside functions in this way: If (x > a){ y = b; y =b + (x > a)*(b_c) }else{ y = c; a, b, b_c (b-c) are constant } Precomputing Lookup Tables using Texture • We build tables where we precompute nonlinear part of the ODEs, we bind these table to textures and we exploit the built-in linear interpolation feature of the texture.
Recommend
More recommend