GPU-Based Simulation of Spiking Neural Networks with Real-Time - PowerPoint PPT Presentation

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

Agenda  Motivation  Neural network models  Simulation systems of neural networks  Parker-Sochacki numerical integration method  CUDA GPU architecture  Implementation: software architecture, computation phases  Verification  Results  Conclusion and future work  Q&A

Motivation  Other works: accuracy and verification problem  J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks , Jul. 2009.  A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on , vol. 0, pp. 137-144, 2009.  J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.  To provide scalable accuracy  To perform direct verification  Based on:  R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience , vol. 27, no. 1, pp. 115-33, Aug. 2009.

Neuron Models: IF, HH, IZ IF HH IZ IF: simple, but has poor spiking  response HH: has reach response, but  complex IZ: simple, has reach response, but  phenomenological

System Modeling: Synchronous Systems Aligned events  good for  parallel computing Time quantization error  introduced by dt Smaller dt  more precise, but  computation hangry May result in missing events   STDP unfriendly Order of computation per second of simulated time N – network size F - average firing rate of a neuron p – average target neurons per spike source R. Brette, et al.

System Modeling: Asynchronous Systems Small computation order  Events are unique in time  no  quantization error  more accurate, STDP friendly Events are processed sequentially  More computation per unit-time  Order of computation per second of simulated time Spike predictor-corrector   excessive re-computation N – network size Assumes analytical solution F - average firing rate of a neuron  p – average target neurons per spike source R. Brette, et al.

System Modeling: Hybrid Systems Refreshes every dt  more  structured than event-driven  good for parallel computing Events are unique in time  no  quantization error  more accurate, STDP friendly Doesn’t require analytical solution  Events are processed sequentially  Largest possible dt is limited by  minimum delay and highest possible transient Order of computation per second of simulated time N – network size, F - average firing rate of a neuron, p – average target neurons per spike source R. Brette, et al.

Choice of Numerical Integration Method Motivation: need to solve an IVP  Euler: compute next y based on tangent to current y  Modified Euler: predict with Euler, correct with average slope  Runge-Kutta 4 th Order: evaluate and average  Bulirsch – Stoer: modified midpoint method with evaluation and  error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions. Parker-Sochacki: express IVP as power series. Adaptive order 

Parcker-Sochacki Method A typical IVP: Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

Parcker-Sochacki Method If is linear: Shift it to eliminate constant term: As a result, the equation becomes: With finite order N:  LLP  Parallel reduction

Parcker-Sochacki Method If is quadratic: Shift it to eliminate constant term: As a result, the equation becomes: Quadratic term can be converted with series multiplication:

Parcker-Sochacki Method and the equation becomes: With finite order N:  Loop-carried circular dependence on d  Only partial parallelism possible

Parcker-Sochacki Method  Local Lipschitz constant determines the number of iterations for achieving certain error tolerance:  Power series representation  adaptive order  error tolerance control Limitations:  Cauchy product reduces parallelism

CUDA: SW  Kernel: code separate, task division  Thread  Block (1D, 2D, 3D)  Grid (1D, 2D)  Divide computation based on IDs  Granularity: bit level (after warp bcast access)

CUDA: HW Scheduling Scheduling: parallel and sequential  Scalability  requirement for blocks to  be independent Warp Warp = 32 threads  Warp divergence  Warp level synchronization  Active blocks and threads: Active threads / SM: maximum1024  Goal: full occupancy = 1024 threads 

Software Architecture

Update Phase Stewart and Bair Adaptive order p according to  required error tolerance Can be processed in parallel for  each neuron

Propagation Phase Translate spikes to synaptic events: global communication is required  Encoded spikes are written to the global memory: bit mask + time values  A propagation block reads and filters all spikes, decodes, fetches synaptic data and  distributes into time slots

Sorting Phase Satish et al.

Software Architecture

Results: Verification Input Conditions Random parameter allocation  Random connectivity  Zero PS error tolerance  GPU Device: GTX 260 CPU Device: AMD Opteron 285 24 symmetric multiprocessors Dual core   Shared memory size, 16 KB / SM L2 cache size, 1 KB / core   Global memory size, 938 MB RAM size, 4 GB   Clock rate, 1.3 GHz Clock rate, 2.6 GHz   Output Membrane potential traces  Passed test for equality 

Results: Simulation Time vs. Network Size 250 200 2% 4% 8% 16% 2% 4% 8% 16% Simulation Time, sec. 150 100 50 0 2 3 4 5 6 7 8 9 Network size, 1000 x neurons Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation,  initially excited by 0 – 200 pA current. Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks  with size 2048 – 4096 neurons. Major limiting factors: shared memory, number of SM 

Results: Simulation Time vs. Event Throughput 410 360 2% 4% 8% 16% 2% 4% 8% 16% 310 Simulation Time, sec. 260 210 160 110 60 10 0 2 4 6 8 10 12 Mean Event Throughput, 1000 x events/(sec. x neuron) Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096  neurons, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current. Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT  performance for 0-2% - connected networks with size of 2048 – 4096. Major limiting factors: shared memory, number of SM 

Results: Comparison with Other Works Metric This Work Other Reason works Increase in 6 – 9, 10 – 35, GPU device, speed RT RT complexity of computation, numerical Network 2K - 8K 16K - 200K integration methods, Size simulation type, time scale Connectivity 100 - 1.3K 100 – 1K per neuron Accuracy Full single Undefined Numerical integration precision FP method Verification Direct Indirect

Conclusion  Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU  Directly verified implementation Future Work  Add accurate STDP implementation  Characterize accuracy in relation to signal processing, network size, network speed, learning  Provide an example of application  Port to Open CL  Further optimization

GPU-Based Simulation of Spiking Neural Networks with Real-Time - PowerPoint PPT Presentation

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Verifying real inequalities Jeremy Avigad Department of Philosophy Carnegie Mellon University

Locations of Zeros and Approximate Counting Chihao Zhang Shanghai Jiao Tong University Apr. 11

6.1 Power series solutions of DEs (and review) a lesson for MATH F302 Differential Equations Ed

Calculus II Sections 9.5, 10.3: Finding and Using Taylor Series, part 1 April 21, 2020 (Section

menu del dia learning (7 slides) purushottam kar (iit kanpur) accelerated kernel learning

Python Concurrency Threading, parallel and GIL adventures Chris McCafferty, SunGard Global

General Techniques for Constructing Variational Integrators Melvin Leok Mathematics, University

Bernoulli, Ramanujan, Toeplitz e le matrici triangolari Carmine Di Fiore, Francesco Tudisco, Paolo

GPU-Based Simulation of Spiking Neural Networks with Real-Time - PowerPoint PPT Presentation

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Verifying real inequalities Jeremy Avigad Department of Philosophy Carnegie Mellon University

Locations of Zeros and Approximate Counting Chihao Zhang Shanghai Jiao Tong University Apr. 11

6.1 Power series solutions of DEs (and review) a lesson for MATH F302 Differential Equations Ed

Calculus II Sections 9.5, 10.3: Finding and Using Taylor Series, part 1 April 21, 2020 (Section

menu del dia learning (7 slides) purushottam kar (iit kanpur) accelerated kernel learning

Python Concurrency Threading, parallel and GIL adventures Chris McCafferty, SunGard Global

General Techniques for Constructing Variational Integrators Melvin Leok Mathematics, University

Bernoulli, Ramanujan, Toeplitz e le matrici triangolari Carmine Di Fiore, Francesco Tudisco, Paolo

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,