Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology Xiangwei Zhang, Chui Haili Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015
Summary Computer aided detection (CAD) of breast cancer in 3D digital breast tomo-synthesis An introduction; GPU kernel optimization A case study of convolution filtering; GPU optimization in DBT CAD GPU/CPU data copy; GPU memory management; Conclusions/Questions; 3/25/2015 Slide 2
Cancer Incidence/Mortality Rates (USA) 250,000 200,000 Number of Cases 150,000 100,000 50,000 0 Lung Colorect Breast Prostate Cancer al Cancer Cancer Cancer Incidence 169400 148300 205000 189000 Mortality 154900 56600 40000 30200 Disease Type 2001 – American Cancer Society 3/25/2015 Slide 3
Computer Aided Detection (CAD) Early detection of cancer is the key to reduce the mortality rate; Medical imaging can help the early detection of cancer; Breast X-ray mammography, chest X-ray, lung CT, colonoscopy, brain MRI, etc; Interpreting the image to find signs of cancer is very challenging for radiologists; Automated processing using computer software helps radiologists in clinical decision; Various image analysis software, including computer aided detection (CAD) and diagnosis (CADx); 3/25/2015 Slide 4
Medical Imaging Applications Lung CT Breast Mammography Chest X-ray Colonoscopy
Micro-calcifications in digital mammography 3/25/2015 Slide 6
CAD for 2D Mammography Each patient/examination has 4 views (left/right breast, CC/MLO views); There are four 2D image to be processed; CAD generates marker overlays ( triangle - micro-calcification clusters; star – mass density or spiculation/architectural distortion) 3/25/2015 Slide 7
2D Mammography CAD processing flow Pre-processing Pixel value transformation (log, invert, scaling); Segmentations; breast, pectoral muscle (MLO view), roll – off region; Suspicious candidate generation; Filtering(general and dedicated), region growing; Region analysis/classification; feature extraction/selection, classification; It takes ~10 seconds/view to complete (pure CPU implementation); 3/25/2015 Slide 8
Digital Breast Tomo-synthesis (DBT) Acquisition Multiple 2D projection views (PVs) are acquired in different angles (11 to 15); The angle span is limited to get high in-plane resolution (15 to 30 degree); Each projection uses a much lower dose than 2D mammography; Reconstruction Back projection is used to reconstruct a 3D volume with 1mm slice interval; Usually a volume consists of 40 to 80 slices (1560x2457 pixels/slice); Advantage (vs 2D mammogram) Reduce tissue overlap reveal 3D anatomical structures hidden in 2D; Disadvantage Much more data to interpret and store; 3/25/2015 Slide 9
DBT acquisition and reconstruction X ray tube Reconstruction slices PV 1 Compression paddle PV n … … Compressed PV 2 breast PV n-1 PV m Digital detector Center of rotation PV 1 , PV 2 , PV 3 , …, PV m PV 1 , PV 2 , PV 3 , …, PV m 3/25/2015 Slide 10
DBT CAD processing flow Slice by slice processing (similar to first three steps in 2D CAD) 3D region growing; 3D Region analysis/classification; Prototype (2007) Pure CPU implementation; It takes ~10 minutes/view to complete; Clinically unacceptable; What we can do to speedup? CUDA computation on GPGPU; 3/25/2015 Slide 11
GPU kernel performance optimization Key requirements for good GPU kernel performance Sufficient parallelism; Efficient memory access; Efficient instruction execution; Efficient memory access: a case study of 1D convolution on 2D image with different implementations CPU; GPU Global memory; Texture memory; Shared memory; 3/25/2015 Slide 12
GPU CUDA memory space Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Local Local Local Memory Memory Memory Memory Host Global Memory Constant Memory Texture Memory 13 3/25/2015 Slide 13
GPU global memory access optimization GPU global memory DRAM -- High latency; Not necessarily cached; Many algorithms are memory-limited Or at least somewhat sensitive to memory bandwidth; Arithmetic operation to memory access ratio is low; Optimization goal: maximize bandwidth utilization Memory accesses are per warp (warp – 32 consecutive threads in a single block); Memory accesses are in discrete chunks (line – 128 bytes; segments – 32 bytes); The key is to have sufficient concurrent memory access per warp; 3/25/2015 Slide 14
Efficient GPU memory access addresses from a warp one 4-segment transaction ... 32 64 224 256 288 384 416 448 0 96 128 160 192 320 352 Memory addresses addresses from a warp ... 32 64 96 128 160 192 224 256 288 320 352 384 416 448 0 Memory addresses 3/25/2015 Slide 15
Not efficient GPU memory access addresses from a warp ... 32 64 224 256 288 384 416 448 0 96 128 160 192 320 352 Memory addresses addresses from a warp ... 3/25/2015 Slide 16
A case study: 1D vertical convolution on 2D image Pixel in a 2D image Pixel of interest X Local neighborhood X Convolution: For each pixel, the new pixel value is the weighted sum of the pixel value in the defined neighborhood; 3/25/2015 Slide 17
Running time comparison: CPU and GPU Platform Host – Dell Precision 7500 CPU: Intel Xeon dual core @3.07GHz, @3.06GHz; RAM: 16.0GB; Device GeForce GTX 690 (dual card); 3072 CUDA cores (1536x2), 16 SMs; 4GB 512-Bit GDDR5; PCI Express 3.0x16; CUDA 5.5; OS Window 7 Professional Service Pack 1, 2009; 3/25/2015 Slide 18
CPU based – Two ways of serial processing Assumption: 2D image stored in continuous linear memory space; left: row-wise; right: column-wise; X X 3/25/2015 Slide 19
CPU based – Two ways of serial processing The speeds are different; Due to the linear memory structure and data caching; Vertical = 146.59 ms/run; Horizontal = 104.19 ms/run; Running time for two CPU versions (milliseonds/run) 160 140 120 100 CPU Vertical 80 CPU Horizontal 60 40 20 0 CPU 3/25/2015 Slide 20
Global memory based – thread-block design 1 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (1x128); 3/25/2015 Slide 21
Global memory based – thread-block design 2 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA threadblock X Cuda implementation: The whole image is divided into multiple horizontal bar threadblocks (128x1); 3/25/2015 Slide 22
Global memory based – Vertical vs Horizontal The speeds are quite different; Due to the linear memory structure and concurrent aligned reading in a WARP; Vertical = 13.325; Horizontal = 1.652 ms/run; 60x speedup compared to CPU version; Running time for two GPU versions (milliseonds/run) 160 140 120 100 80 Vertical 60 Horizontal 40 20 0 CPU GPU Global Memory 3/25/2015 Slide 23
Texture memory based version Texture memory Read only cache; Good for scattered reads; Caching is 32 bytes (one segment); Two different thread-blocks Vertical (1x128); Horizontal (128x1); 3/25/2015 Slide 24
Texture memory based – Vertical vs Horizontal The speeds are different; Vertical = 2.507; Horizontal = 1.707 ms/run; Horizontal: comparable to global memory version; Vertical: much better than global memory version (better at scattered data); Running time for two GPU versions (milliseonds/run) 14 12 10 8 Vertical 6 Horizontal 4 2 0 GPU Global GPU Texture Memory Memory 3/25/2015 Slide 25
Shared memory based version Shared memory Read/write cache in SM; Low latency compared to global memory, or even texture memory; Using as read cache, the original data still need to be loaded from global memory; Two different thread-blocks Vertical (1x768); Horizontal (32x24); 3/25/2015 Slide 26
Shared memory based – thread-block design 1 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (1x768); 3/25/2015 Slide 27
Shared memory based – thread-block design 2 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (32x24 = 768 threads); 3/25/2015 Slide 28
Shared memory based – Vertical vs Horizontal The speeds are quite different; Vertical = 3.725 ms/run; Horizontal = 1.084 ms/run; Horizontal: better than both global/texture memory; Vertical: better than global memory, worse than texture memory; Running time for two GPU versions (milliseonds/run) 14 12 10 8 Vertical 6 Horizontal 4 2 0 GPU Global GPU Texture GPU Shared Memory Memory Memory 3/25/2015 Slide 29
Recommend
More recommend