gpu based simulation of
play

GPU-Based Simulation of Spiking Neural Networks with Real-Time - PowerPoint PPT Presentation

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI


  1. GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

  2. Agenda  Motivation  Neural network models  Simulation systems of neural networks  Parker-Sochacki numerical integration method  CUDA GPU architecture  Implementation: software architecture, computation phases  Verification  Results  Conclusion and future work  Q&A

  3. Motivation  Other works: accuracy and verification problem  J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks , Jul. 2009.  A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on , vol. 0, pp. 137-144, 2009.  J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.  To provide scalable accuracy  To perform direct verification  Based on:  R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience , vol. 27, no. 1, pp. 115-33, Aug. 2009.

  4. Neuron Models: IF, HH, IZ IF HH IZ IF: simple, but has poor spiking  response HH: has reach response, but  complex IZ: simple, has reach response, but  phenomenological

  5. System Modeling: Synchronous Systems Aligned events  good for  parallel computing Time quantization error  introduced by dt Smaller dt  more precise, but  computation hangry May result in missing events   STDP unfriendly Order of computation per second of simulated time N – network size F - average firing rate of a neuron p – average target neurons per spike source R. Brette, et al.

  6. System Modeling: Asynchronous Systems Small computation order  Events are unique in time  no  quantization error  more accurate, STDP friendly Events are processed sequentially  More computation per unit-time  Order of computation per second of simulated time Spike predictor-corrector   excessive re-computation N – network size Assumes analytical solution F - average firing rate of a neuron  p – average target neurons per spike source R. Brette, et al.

  7. System Modeling: Hybrid Systems Refreshes every dt  more  structured than event-driven  good for parallel computing Events are unique in time  no  quantization error  more accurate, STDP friendly Doesn’t require analytical solution  Events are processed sequentially  Largest possible dt is limited by  minimum delay and highest possible transient Order of computation per second of simulated time N – network size, F - average firing rate of a neuron, p – average target neurons per spike source R. Brette, et al.

  8. Choice of Numerical Integration Method Motivation: need to solve an IVP  Euler: compute next y based on tangent to current y  Modified Euler: predict with Euler, correct with average slope  Runge-Kutta 4 th Order: evaluate and average  Bulirsch – Stoer: modified midpoint method with evaluation and  error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions. Parker-Sochacki: express IVP as power series. Adaptive order 

  9. Parcker-Sochacki Method A typical IVP: Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

  10. Parcker-Sochacki Method If is linear: Shift it to eliminate constant term: As a result, the equation becomes: With finite order N:  LLP  Parallel reduction

  11. Parcker-Sochacki Method If is quadratic: Shift it to eliminate constant term: As a result, the equation becomes: Quadratic term can be converted with series multiplication:

  12. Parcker-Sochacki Method and the equation becomes: With finite order N:  Loop-carried circular dependence on d  Only partial parallelism possible

  13. Parcker-Sochacki Method  Local Lipschitz constant determines the number of iterations for achieving certain error tolerance:  Power series representation  adaptive order  error tolerance control Limitations:  Cauchy product reduces parallelism

  14. CUDA: SW  Kernel: code separate, task division  Thread  Block (1D, 2D, 3D)  Grid (1D, 2D)  Divide computation based on IDs  Granularity: bit level (after warp bcast access)

  15. CUDA: HW Scheduling Scheduling: parallel and sequential  Scalability  requirement for blocks to  be independent Warp Warp = 32 threads  Warp divergence  Warp level synchronization  Active blocks and threads: Active threads / SM: maximum1024  Goal: full occupancy = 1024 threads 

  16. Software Architecture

  17. Update Phase Stewart and Bair Adaptive order p according to  required error tolerance Can be processed in parallel for  each neuron

  18. Propagation Phase Translate spikes to synaptic events: global communication is required  Encoded spikes are written to the global memory: bit mask + time values  A propagation block reads and filters all spikes, decodes, fetches synaptic data and  distributes into time slots

  19. Sorting Phase Satish et al.

  20. Software Architecture

  21. Results: Verification Input Conditions Random parameter allocation  Random connectivity  Zero PS error tolerance  GPU Device: GTX 260 CPU Device: AMD Opteron 285 24 symmetric multiprocessors Dual core   Shared memory size, 16 KB / SM L2 cache size, 1 KB / core   Global memory size, 938 MB RAM size, 4 GB   Clock rate, 1.3 GHz Clock rate, 2.6 GHz   Output Membrane potential traces  Passed test for equality 

  22. Results: Simulation Time vs. Network Size 250 200 2% 4% 8% 16% 2% 4% 8% 16% Simulation Time, sec. 150 100 50 0 2 3 4 5 6 7 8 9 Network size, 1000 x neurons Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation,  initially excited by 0 – 200 pA current. Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks  with size 2048 – 4096 neurons. Major limiting factors: shared memory, number of SM 

  23. Results: Simulation Time vs. Event Throughput 410 360 2% 4% 8% 16% 2% 4% 8% 16% 310 Simulation Time, sec. 260 210 160 110 60 10 0 2 4 6 8 10 12 Mean Event Throughput, 1000 x events/(sec. x neuron) Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096  neurons, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current. Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT  performance for 0-2% - connected networks with size of 2048 – 4096. Major limiting factors: shared memory, number of SM 

  24. Results: Comparison with Other Works Metric This Work Other Reason works Increase in 6 – 9, 10 – 35, GPU device, speed RT RT complexity of computation, numerical Network 2K - 8K 16K - 200K integration methods, Size simulation type, time scale Connectivity 100 - 1.3K 100 – 1K per neuron Accuracy Full single Undefined Numerical integration precision FP method Verification Direct Indirect

  25. Conclusion  Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU  Directly verified implementation Future Work  Add accurate STDP implementation  Characterize accuracy in relation to signal processing, network size, network speed, learning  Provide an example of application  Port to Open CL  Further optimization

Recommend


More recommend