The context : bioimage data explosion High-throughput imaging techniques have led to bioimage data explosion Submicrometer resolution Whole mouse brain mapping TeraByte scale images Heterogeneous images Need for tools capable to process these huge datasets Which requirements? Need for automation to extract information Fully- or semi-automated analysis? Is human assistance needed? March 26, 2018 2 University Campus Bio-Medico of Rome
TeraStitcher : born for Confocal Light Sheet Microscope Images Generates a 2D matrix of 3D tiles (image stacks) Tiles are regularly spaced with reasonably reliable nominal positions - Stack acquisition is performed with a single movement guaranteeing a reliable constant inter-slices distance along z - Long range movements (order of cm) make necessary an alignment step in all directions Sufficient overlap between adjacent tiles March 26, 2018 4 University Campus Bio-Medico of Rome
TeraStitcher : Ultra-terabyte images Need to process 16 bits images larger than 20,000 x 20,000 x 3,000 voxels ( ~2.5 TB) Increasingly high resolutions in X-Y Multiple channels Very large slices (e.g. 40 K x 30 K pixels) Solutions: Thousands of slices Sparse datasets Compressed formats Parallelism Multi-channel support March 26, 2018 5 University Campus Bio-Medico of Rome
?
What is image stitching? Image stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image (from Wikipedia) March 26, 2018 7 University Campus Bio-Medico of Rome
TeraStitcher : overview Initially developed for confocal light sheet microscopy (CLSM) images Fast 2D approach to align adjacent stacks Efficient use of memory resources Minimum I/O workload (2 readings, 1 writing) Can generate a multi-resolution representation suited for further processing March 26, 2018 8 University Campus Bio-Medico of Rome
TeraStitcher : software structure VolumeManager Stack StackedVolume 1 N « use s» Stitcher 1 Stitcher + computeDisplacements(algorithm_type: int , … ) + projectDisplacements() + thresholdDisplacements(thres: float ) + computeTilesPlacement(algorithm_type: int ) + mergeTiles(blending_type: int , … ) The software structure provides N « executes » « executes » « use s» means for incorporating new « interfac e» « interfac e» « interfac e» functionalities Ti l esPl acem ent Al go Pai r w i seD i spl Al go D i spl acem ent # def int[3] _di spl acem ent s: … + execut … + execut Displacement e( ) : e( ) « produces » int[3] # di spl acem ent s: + eval float Rel i abi l i t y( ) : Object-oriented software int + get D i spl acem ent ( di r ect i on: i nt ) : « use s» float ) + t hr eshol d( t hr es: Displacement ) + pr oj ect ( di spl : architecture and design patterns … … MST MIP-NCC … + execute( … ) + execute (… MIP-NCC-Displ ): Displacement Functional decomposition visible to - MIP_displacements: int[6] - NCC_widths: float[3] Strategy - NCC_maxs: float[3] the user through command line - reliab_factors: float[3] + evalReliability(): float options + … Abstract Factory March 26, 2018 9 University Campus Bio-Medico of Rome
TeraStitcher : separating the strategy from the algorithm Strategy (aims at minimizing memory requirements) o Divide the volume into layers (substacks) along Z o For each layer o Keep in memory a row (or column) of substacks at the time o For each pair of adjacent substacks, call Algorithm o Compute the Maximum Intensity Projections (MIPs) along X, Y, and Z for the two substacks o Apply 2D Normalized Cross-Correlation (NCC) to the three pairs of MIPs to find three 2D displacements o Output “most - reliable” displacement for each direction March 26, 2018 10 University Campus Bio-Medico of Rome
TeraStitcher : global optimization step Global optimization is horizontal axis H performed using V:375( ) V:375( ) V:375( ) V:0( ) computed (or even H:375( ) H:0( ) H:0( ) H:0( ) D:0 ( ) D:0( ) D:0( ) D:0( ) V:375( ) V:375( ) V:375( ) V:375( ) V:372( 1,41 ) MST along V H:0( ) H:0( ) H:0( ) H:0( ) nominal) alignments H:3( 1,22 ) D:0( ) D:0( ) D:0( ) D:0( ) D:-4( 1,35 ) V:0( ) V:0( ) V:0( ) V:1( 1,37 ) and their reliability H:375( ) H:375( ) H:373( 1,15 ) H:371( 1,35 ) D:0 ( ) D:0 ( ) D:2( 1,11 ) D:-3( 1,37 ) vertical axis V V:375( ) V:375( ) V:371( 1,36 ) V:372( 1,21 ) V:374( 1,25 ) H:0( ) H:0( ) H:2( 1,35 ) H:4( 1,35 ) H:6( 1,15 ) D:0( ) D:0( ) D:2( 1,33 ) D:-1( 1,31 ) D:2( 1,12 ) The resulting V:0( ) V:0( ) V:1( 1,40 ) V:3( 1,34 ) MST along H H:375( ) H:375( ) H:375( 1,26 ) H:373( 1,23 ) D:0 ( ) D:0 ( ) D:-1( 1,12 ) D:0( 1,08 ) alignments are used to V:375( ) V:375( ) V:375( ) V:375( ) V:372( 1,35 ) H:0( ) H:0( ) H:0( ) H:0( ) H:3( 1,22 ) D:0( ) D:0( ) D:0( ) D:0( ) D:-1( 1,37 ) generate the xml file V:0( ) V:0( ) V:0( ) V:0( ) H:375( ) H:375( ) H:375 ( ) H:375( ) D:0 ( ) D:0( ) D:0( ) D:0 ( ) V:375( ) V:375( ) V:375( ) V:372( 1,37 ) V:374( 1,35 ) MST along D H:0( ) H:0( ) H:0( ) H:1( 1,30 ) H:5( 1,22 ) Manual intervention is D:0( ) D:0( ) D:0( ) D:0( 1,40 ) D:-1( 1,12 ) V:0( ) V:0( ) V:1( 1,35 ) V:0( 1,32 ) H:375( ) H:375( ) H:372( 1,40 ) H:373( 1,22 ) possible to correct D:0 ( ) D:0 ( ) D:-3( 1,09 ) D:-1( 1,10 ) isolated errors March 26, 2018 12 University Campus Bio-Medico of Rome
TeraStitcher : developments and issues Parallelization: maintenance issues different alternatives to exploit parallel processing different code for sequential and parallel version Solution: launching multiple instances on disjoint sub- regions command line options to specify the sub-region to process use a scripting language (python) to implement a driver Alignment computation (computation bound) each instance processes a group of adjacent sub-stacks merge the xml files with alignment information Generation of the stitched image (I/O bound) the directory structure is first generated each instance writes disjoint groups of tiles in parallel metadata to be used in subsequent steps are finally generated March 26, 2018 13 University Campus Bio-Medico of Rome
Why do we need GPUs? Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel) Tile matrix: 3 x 3 Total dataset size: Tile size x #tiles ( >220 GByte ) Stitching on one Power8 (2.061 Ghz) core: ~15000 secs ! Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs ! GPUs come to rescue! March 26, 2018 16
Normalized Cross Correlation The most time-consuming part of the stitching process is the evaluation of the Normalized Cross- Correlation (NCC) between the Maximum Intensity Projections (MIP) of the tiles. NCC is a variant of the classic Cross Correlation in which data are normalized by subtracting the mean and dividing by the standard deviation of the two datasets. TeraStitcher implements a fast NCC from J. P. Lewis, “Fast Template Matching”, Vision Interface (1995). March 26, 2018 17
CPU Normalized Cross Correlation f_mean = t_mean = 0.; for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) { for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) { f_mean += *pxl1; t_mean += *pxl2; } } f_mean /= (dimi*dimj); t_mean /= (dimi*dimj); numerator = factor1 = factor2 = 0.; for ( ij=0, pxl1=im1, pxl2=im2; ij <dimi*dimj; ij++) { f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean; t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean; numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime; factor1 += f_prime * f_prime; factor2 += t_prime * t_prime; } March 26, 2018 18
From CPU TO GPU NCC The previous fragment of code is repeated ( 2*delayu+1 ) x ( 2*delayv+1 ) times at each iteration (there are few hundreds iterations per run). A typical value of delayu and delayv is ~ 100 We developed a gpu_NCC kernel that does all the work with a single invocation per iteration: minimize the overhead due to the invocation allow to overlap memory copies from CPU to GPU (using streams ) March 26, 2018 19
GPU NCC The gpu_NCC kernel heavily relies on shuffle primitives to compute the mean values required by the NCC #define CALCSUM(v) (v)=warpReduceSumD((v)); \ if(lane==0) shared[wid]=(v); \ __syncthreads(); \ (v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \ if(wid==0) (v)=warpReduceSumD((v)); \ if(tid==0) shared[0]=(v); \ __syncthreads(); \ (v)=shared[0]; \ __syncthreads(); #define CALCAVE(v) CALCSUM(v) \ (v) /= (dimi*dimj); March 26, 2018 20
Test Platform IBM S822LC HPC, courtesy of IBM Italy (G. Richelli) 512 GB of RAM, 16 Power8 cores (SMT 8) 2 960 GB solid state disks 4 P100 GPUs. March 26, 2018 21
Recommend
More recommend