the context bioimage data explosion
play

The context : bioimage data explosion High-throughput imaging - PowerPoint PPT Presentation

The context : bioimage data explosion High-throughput imaging techniques have led to bioimage data explosion Submicrometer resolution Whole mouse brain mapping TeraByte scale images Heterogeneous images Need for tools capable to


  1. The context : bioimage data explosion  High-throughput imaging techniques have led to bioimage data explosion  Submicrometer resolution Whole mouse brain mapping  TeraByte scale images  Heterogeneous images  Need for tools capable to process these huge datasets  Which requirements?  Need for automation to extract information  Fully- or semi-automated analysis?  Is human assistance needed? March 26, 2018 2 University Campus Bio-Medico of Rome

  2. TeraStitcher : born for Confocal Light Sheet Microscope Images  Generates a 2D matrix of 3D tiles (image stacks)  Tiles are regularly spaced with reasonably reliable nominal positions - Stack acquisition is performed with a single movement guaranteeing a reliable constant inter-slices distance along z - Long range movements (order of cm) make necessary an alignment step in all directions  Sufficient overlap between adjacent tiles March 26, 2018 4 University Campus Bio-Medico of Rome

  3. TeraStitcher : Ultra-terabyte images  Need to process 16 bits images larger than 20,000 x 20,000 x 3,000 voxels ( ~2.5 TB)  Increasingly high resolutions in X-Y  Multiple channels  Very large slices (e.g. 40 K x 30 K pixels)  Solutions:  Thousands of slices  Sparse datasets  Compressed formats  Parallelism  Multi-channel support March 26, 2018 5 University Campus Bio-Medico of Rome

  4. ?

  5. What is image stitching? Image stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image (from Wikipedia) March 26, 2018 7 University Campus Bio-Medico of Rome

  6. TeraStitcher : overview  Initially developed for confocal light sheet microscopy (CLSM) images  Fast 2D approach to align adjacent stacks  Efficient use of memory resources  Minimum I/O workload (2 readings, 1 writing)  Can generate a multi-resolution representation suited for further processing March 26, 2018 8 University Campus Bio-Medico of Rome

  7. TeraStitcher : software structure VolumeManager Stack StackedVolume 1 N « use s» Stitcher 1 Stitcher + computeDisplacements(algorithm_type: int , … ) + projectDisplacements() + thresholdDisplacements(thres: float ) + computeTilesPlacement(algorithm_type: int ) + mergeTiles(blending_type: int , … )  The software structure provides N « executes » « executes » « use s» means for incorporating new « interfac e» « interfac e» « interfac e» functionalities Ti l esPl acem ent Al go Pai r w i seD i spl Al go D i spl acem ent # def int[3] _di spl acem ent s: … + execut … + execut Displacement e( ) : e( ) « produces » int[3] # di spl acem ent s: + eval float Rel i abi l i t y( ) :  Object-oriented software int + get D i spl acem ent ( di r ect i on: i nt ) : « use s» float ) + t hr eshol d( t hr es: Displacement ) + pr oj ect ( di spl : architecture and design patterns … … MST MIP-NCC … + execute( … ) + execute (… MIP-NCC-Displ ): Displacement  Functional decomposition visible to - MIP_displacements: int[6] - NCC_widths: float[3] Strategy - NCC_maxs: float[3] the user through command line - reliab_factors: float[3] + evalReliability(): float options + … Abstract Factory March 26, 2018 9 University Campus Bio-Medico of Rome

  8. TeraStitcher : separating the strategy from the algorithm  Strategy (aims at minimizing memory requirements) o Divide the volume into layers (substacks) along Z o For each layer o Keep in memory a row (or column) of substacks at the time o For each pair of adjacent substacks, call  Algorithm o Compute the Maximum Intensity Projections (MIPs) along X, Y, and Z for the two substacks o Apply 2D Normalized Cross-Correlation (NCC) to the three pairs of MIPs to find three 2D displacements o Output “most - reliable” displacement for each direction March 26, 2018 10 University Campus Bio-Medico of Rome

  9. TeraStitcher : global optimization step  Global optimization is horizontal axis H performed using V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:0( ฀ ) computed (or even H:375( ฀ ) H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) D:0 ( ฀ ) D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:372( 1,41 ) MST along V H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) nominal) alignments H:3( 1,22 ) D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) D:-4( 1,35 ) V:0( ฀ ) V:0( ฀ ) V:0( ฀ ) V:1( 1,37 ) and their reliability H:375( ฀ ) H:375( ฀ ) H:373( 1,15 ) H:371( 1,35 ) D:0 ( ฀ ) D:0 ( ฀ ) D:2( 1,11 ) D:-3( 1,37 ) vertical axis V V:375( ฀ ) V:375( ฀ ) V:371( 1,36 ) V:372( 1,21 ) V:374( 1,25 ) H:0( ฀ ) H:0( ฀ ) H:2( 1,35 ) H:4( 1,35 ) H:6( 1,15 ) D:0( ฀ ) D:0( ฀ ) D:2( 1,33 ) D:-1( 1,31 ) D:2( 1,12 )  The resulting V:0( ฀ ) V:0( ฀ ) V:1( 1,40 ) V:3( 1,34 ) MST along H H:375( ฀ ) H:375( ฀ ) H:375( 1,26 ) H:373( 1,23 ) D:0 ( ฀ ) D:0 ( ฀ ) D:-1( 1,12 ) D:0( 1,08 ) alignments are used to V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:372( 1,35 ) H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) H:3( 1,22 ) D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) D:-1( 1,37 ) generate the xml file V:0( ฀ ) V:0( ฀ ) V:0( ฀ ) V:0( ฀ ) H:375( ฀ ) H:375( ฀ ) H:375 ( ฀ ) H:375( ฀ ) D:0 ( ฀ ) D:0( ฀ ) D:0( ฀ ) D:0 ( ฀ ) V:375( ฀ ) V:375( ฀ ) V:375( ฀ ) V:372( 1,37 ) V:374( 1,35 ) MST along D H:0( ฀ ) H:0( ฀ ) H:0( ฀ ) H:1( 1,30 ) H:5( 1,22 )  Manual intervention is D:0( ฀ ) D:0( ฀ ) D:0( ฀ ) D:0( 1,40 ) D:-1( 1,12 ) V:0( ฀ ) V:0( ฀ ) V:1( 1,35 ) V:0( 1,32 ) H:375( ฀ ) H:375( ฀ ) H:372( 1,40 ) H:373( 1,22 ) possible to correct D:0 ( ฀ ) D:0 ( ฀ ) D:-3( 1,09 ) D:-1( 1,10 ) isolated errors March 26, 2018 12 University Campus Bio-Medico of Rome

  10. TeraStitcher : developments and issues  Parallelization: maintenance issues  different alternatives to exploit parallel processing  different code for sequential and parallel version  Solution: launching multiple instances on disjoint sub- regions  command line options to specify the sub-region to process  use a scripting language (python) to implement a driver  Alignment computation (computation bound)  each instance processes a group of adjacent sub-stacks  merge the xml files with alignment information  Generation of the stitched image (I/O bound)  the directory structure is first generated  each instance writes disjoint groups of tiles in parallel  metadata to be used in subsequent steps are finally generated March 26, 2018 13 University Campus Bio-Medico of Rome

  11. Why do we need GPUs?  Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel)  Tile matrix: 3 x 3  Total dataset size: Tile size x #tiles ( >220 GByte )  Stitching on one Power8 (2.061 Ghz) core: ~15000 secs !  Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs !  GPUs come to rescue! March 26, 2018 16

  12. Normalized Cross Correlation  The most time-consuming part of the stitching process is the evaluation of the Normalized Cross- Correlation (NCC) between the Maximum Intensity Projections (MIP) of the tiles.  NCC is a variant of the classic Cross Correlation in which data are normalized by subtracting the mean and dividing by the standard deviation of the two datasets.  TeraStitcher implements a fast NCC from J. P. Lewis, “Fast Template Matching”, Vision Interface (1995). March 26, 2018 17

  13. CPU Normalized Cross Correlation f_mean = t_mean = 0.; for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) { for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) { f_mean += *pxl1; t_mean += *pxl2; } } f_mean /= (dimi*dimj); t_mean /= (dimi*dimj); numerator = factor1 = factor2 = 0.; for ( ij=0, pxl1=im1, pxl2=im2; ij <dimi*dimj; ij++) { f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean; t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean; numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime; factor1 += f_prime * f_prime; factor2 += t_prime * t_prime; } March 26, 2018 18

  14. From CPU TO GPU NCC  The previous fragment of code is repeated ( 2*delayu+1 ) x ( 2*delayv+1 ) times at each iteration (there are few hundreds iterations per run).  A typical value of delayu and delayv is ~ 100 We developed a gpu_NCC kernel that does all the work with a single invocation per iteration:  minimize the overhead due to the invocation  allow to overlap memory copies from CPU to GPU (using streams ) March 26, 2018 19

  15. GPU NCC  The gpu_NCC kernel heavily relies on shuffle primitives to compute the mean values required by the NCC #define CALCSUM(v) (v)=warpReduceSumD((v)); \ if(lane==0) shared[wid]=(v); \ __syncthreads(); \ (v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \ if(wid==0) (v)=warpReduceSumD((v)); \ if(tid==0) shared[0]=(v); \ __syncthreads(); \ (v)=shared[0]; \ __syncthreads(); #define CALCAVE(v) CALCSUM(v) \ (v) /= (dimi*dimj); March 26, 2018 20

  16. Test Platform  IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)  512 GB of RAM,  16 Power8 cores (SMT 8)  2 960 GB solid state disks  4 P100 GPUs. March 26, 2018 21

Recommend


More recommend