A GPU-Accelerated 3D Kinematic Modeling Platform for Behavioral Neuroscience John Long, PhD Buzsáki Laboratory Neuroscience Institute New York University Langone Medical Center 03.17.2015
A little about me… Jose Carmena György Buzsáki
…and my previous work. Venkatraman et al. 2009 Koralek et al. 2012 Long and Carmena 2011 Long and Carmena 2013
Why am I at a GPU conference? • As a neuroscientist, I find small form factor, massively parallel computing machines intriguing. • For ease of interface and visualization, I often program in Matlab or Python, and I suffer agonizing computational bottlenecks, which has led me to GPUs. • I’ve had a fair amount of success applying GPU computing to my scientific work.
What I have in store for you… • An introduction to the work I do in behavioral neuroscience that led me to GPU computing. • A detailed description of one of the CUDA programs I have implemented in the context of my research. • Throughout, I’ll mention a workflow I’ve found useful for porting CUDA code into Matlab and Python.
Who reads the maps in the brain? Hippocampal “ place ” cells Sensory receptive fields (Hubel and Wiesel 1959) (O ’ Keefe and Nadel, 1978; O ’ Keefe and Recce, 1993) Geisler, Sirota, Zugaro, Robbe, Buzsaki, PNAS 2007 Lurilli et al. 2012
The State of the Art in Behavioral Neuroscience More and more neural data!
The State of the Art in Behavioral Neuroscience The behaving rat…
Advances in Motion Capture Corazza et al. 2006
Environment Construction 6 5 4 1 2 3
Multiple Camera Synchronization Lines to cameras Line to Amplipex system
Image Segmentation
Camera Calibration 3D to 2D perspective transformation Svoboda et al. 2005
Visual Hull Construction Visual Hull Algorithm modified from Forbes et al. 2006
Kinematic Model Design
Kinematic Model Manipulation Murray et al. 1994
Kinematic Model Fitting Generate Candidate Poses Compute Cost Components n j n j n i D j d ij M i d ij = ||M i – D j || 2 α ij = dot(n i ,n j ) Update Posterior Estimate Score Each Pose
Kinematic Model Fitting Generate Candidate Poses Compute Cost Components n j n j n i D j d ij M i d ij = ||M i – D j || 2 α ij = dot(n i ,n j ) Update Posterior Estimate Score Each Pose
Open Chain Kinematics P1 = t1a * t1b * t1c * t1d * t1e * P1; N1 = t1a * t1b * t1c * t1d * t1e * N1; N1 = N1-P1; P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4; N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4; N4 = N4-P4; P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5; N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5; N5 = N5-P5; P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7; N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7; N7 = N7-P7; “A mathematical introduction to robotic manipulation” by Murray, Li, and Sastry 1994
Open Chain Kinematics: On the GPU Exposing Parallelism //Thread parameters unsigned int hWID = threadIdx.x / halfWarpSz; unsigned int hWoff = threadIdx.x % halfWarpSz; • All 4x4 transformation matrices t i can be unsigned int x = hWoff % DimXY; unsigned int y = hWoff / DimXY; computed in parallel. • There are many shared computational //MATRIX REDUCTION: across temporary variables over twists float sum[2]; //1st reduction from 16, 4x4 matrices to 8, 4x4 matrices blocks. if(hWID < 8) { sum[0] = 0.0f; #pragma unroll for(int k = 0; k < 4; k++) { P1 = t1a * t1b * t1c * t1d * t1e * P1; sum[0] += Stwists[4*(2*hWID) + y][k]* N1 = t1a * t1b * t1c * t1d * t1e * N1; Stwists[4*(2*hWID+1) + k][x]; } N1 = N1-P1; Transtmp0[4*hWID + y][x] = sum[0]; }; P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4; __syncthreads(); N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4; N4 = N4-P4; P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5; N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5; N5 = N5-P5; P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7; N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7; N7 = N7-P7;
Open Chain Kinematics: On the GPU: Results Compute Time Comparison CUDA ported into Matlab via Mex single Matlab: mean = 12.6 sec Per frame compute time (seconds) • x22.5 speedup relative to single Matlab process parfor Matlab: mean = 8.2 sec • x14.6 speedup relative to parallel Matlab process (6 CPUs) • Qualitative speedup allowed for parameter tuning resulting in an average 50% reduction in per frame model fit error i.e. better model CUDA in Matlab: mean = 0.55 sec fits! • Promising approach to open chain kinematic Model Fit Comparison CUDA where you need it Per frame model fit error (a.u.) errors prior to tuning • Qualitative speedups mean more efficient science. • Work where you need to and let user friendly languages like Matlab and Python do the rest. errors after tuning Frame number
Putting it all together
Promising Directions Berman et al. 2014
Label Clusters Kinematic Modeling Behavioral Classification forward gaze Parameterize rearing Dynamics Wavelet Analysis tight scan 1 st principal component Cluster Embedded Dynamics T-SNE 2 nd principal component Map Dynamics 3 rd principal component Time (seconds)
Conclusion • Fitting kinematic models to 3D visual hull data is greatly accelerated by GPUs. • The framework I’ve presented can be applied to open chain kinematics models in general. • In science, Big Data too often means sitting around waiting to find out you need to run your analysis again. GPUs are a game changer. • Interfaces like mex (Matlab) and ctypes (Python) allow you to tackle the hard parts and be lazy about the easy parts. – You can incrementally deal with bottlenecks of decreasing priority.
Acknowledgements György Buzsáki The entire Buzsáki lab Andres Grosmark Antal Berenyi Thank you!
Kinematic Model Design
Recommend
More recommend