Peter Zinterhof Scientific Computing, Salzburg University Distributed Computation of Feature-Detectors for Medical Image Processing on GPGPU and Cell Processors Peter Zinterhof Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Task: • automated recognition of 'regions of interest' within medical imaging data here: recognition of kidneys within CT-slices Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University General approach: Evolve filter-sets, which generate feature vectors that can be classified by means of a Kohonen self-organizing map. Image filtering: Filter: 4 x 4 pixels, fed into perceptron Perceptron sums 16 inputs linearly, fires if threshold is exceeded → sliding-window (convolution) each filter delivers single number for one image aggregate 8 filters into feature vector of length 8 Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Training phase: 1.Apply eight perceptron-based filters on training images (64 x 64 pixels) 2.Sum up fire-events to create feature vectors (8-bin histograms) 3.train Kohonen map 4.Repeat steps 1-3 for many 'individuals' in parallel and compute fitness of each filter-set (recognition rate). 5.Apply evolutionary strategy onto perceptrons to evolve such perceptrons that eventually generate high recognition rates. Recall phase: 1.Apply best filter-set onto image to be classified (kidney/ non-kidney) 2.Use pre-computed Kohonen map to classify feature vector Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Solution: • Use of clusters of General Purpose Graphics Processing Units (GPGPUs) and Cell processors enables training within days instead of many months. • Two levels of parallelism coarse grained fine grained systems level image filtering Kohonen map Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University GPU mpich2 + gbic GPU + Cell mpich2 + pvm3 + gbic Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Samplecode GPU: image filtering dim3 dimBlock (60,8); // 480 threads → dim3 dimGrid (images , 1); // 12000 blocks > 5.7 Mio.. threads __global__ void Perceptrons (unsigned char *a1, float *weights, unsigned char *border, int *ReturnVectors_dlocal) { __shared__ float w[Channels][4][4]; __shared__ unsigned char borders[Channels+Channels][4][4]; __shared__ unsigned char a[64][64]; if (threadIdx.y==0) { // fetch data from global memory aw = &w[0][0][0]; for (i=threadIdx.x; i < (1*Channels*16); i+=60) { Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University aw[i] = weights[i]; } ue = &borders[0][0][0]; for (i=threadIdx.x; i < (2*Channels*16); i+=60) { ue[i] = border[i]; } base = blockIdx.x*(4096); uw = &a[0][0]; for (i=threadIdx.x; i < 4096; i+=60) { uw[i] = a1[base+i]; } } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University mask=threadIdx.y; // which of the 8 filters to compute locally Counter=0; for (j=0; j < 60; j++) { sum=0.0; if ((a[j+0][threadIdx.x+0]>=borders[mask][0][0])&&(a[j+0] [threadIdx.x+0]<borders[mask+Channels][0][0])) sum+=w[mask][0][0]; if ((a[j+1][threadIdx.x+0]>=borders[mask][0][1])&&(a[j+1] [threadIdx.x+0]<borders[mask+Channels][0][1])) sum+=w[mask][0][1]; if ((a[j+2][threadIdx.x+0]>=borders[mask][0][2])&&(a[j+2] [threadIdx.x+0]<borders[mask+Channels][0][2])) sum+=w[mask][0][2]; if ((a[j+3][threadIdx.x+0]>=borders[mask][0][3])&&(a[j+3] [threadIdx.x+0]<borders[mask+Channels][0][3])) Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University sum+=w[mask][0][3]; if (sum > THETA) { Counter++; } } // next j atomicAdd ( &ReturnVectors_dlocal [(blockIdx.x*Channels)+mask] , Counter); } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Codesample Cell: nearest-neighbor search // ********************************************************************* // DEMONSTRATION OF spu_cmpgt and spu_sel for a nearest-neighbor search // mymap: array of Kohonen-map vectors (1/6 of total map) // testvector: random vector, whose nearest neighbor is computed // ********************************************************************* for (i=0; i < RES*DIM; i++) { sum=(vector float){0.0,0.0,0.0,0.0}; current_nr=spu_splats (i); diff = mymap[i][0] - testvector[0]; sum = spu_madd (diff,diff, sum); diff = mymap[i][1] - testvector[1]; sum = spu_madd (diff,diff, sum); ... diff = mymap[i][7] - testvector[7]; sum = spu_madd (diff,diff, sum); mask = spu_cmpgt (localbest, sum); // if localbest > sum -> arg 1, else arg 0 is new minimum localbest= spu_sel (localbest,sum,mask); localbest_nr = spu_sel (localbest_nr,current_nr,mask); } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Codesample Cell: Improved barrier function Transporting nearest neighbor information 'piggy-back'-style, yielding approx. 14 % speedup Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Benchmarks: Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Peter Zinterhof Scientific Computing, Salzburg University Conclusions: • Kohonen map training is 12x faster on the Cell processor than on a single i7-core (920, 2.67 GHz) • in conjunction with a single GPGPU (used for filtering only), Cell reaches 50 % of a GPGPU's performance, even in a cluster of 8 PS3 consoles one additional GPGPU suffices. • GPGPUs reach an overall speedup of 22 x over a Xeon • combining 8 GPGPUs with 8 Cells yields a speedup of ~248x over a → single i7 core 5 minutes on a cluster of 'unconventional hardware' is equivalent to 1 day on the CPU Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
Recommend
More recommend