computer aided detection cad for 3d breast imaging and
play

Computer Aided Detection (CAD) for 3D Breast Imaging and GPU - PowerPoint PPT Presentation

Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology Xiangwei Zhang, Chui Haili Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015 Summary Computer aided detection (CAD) of breast cancer in 3D digital


  1. Computer Aided Detection (CAD) for 3D Breast Imaging and GPU Technology Xiangwei Zhang, Chui Haili Imaging and CAD science, Hologic Inc., Santa Clara, CA 03/19/2015

  2. Summary  Computer aided detection (CAD) of breast cancer in 3D digital breast tomo-synthesis  An introduction;  GPU kernel optimization  A case study of convolution filtering;  GPU optimization in DBT CAD  GPU/CPU data copy; GPU memory management;  Conclusions/Questions; 3/25/2015 Slide 2

  3. Cancer Incidence/Mortality Rates (USA) 250,000 200,000 Number of Cases 150,000 100,000 50,000 0 Lung Colorect Breast Prostate Cancer al Cancer Cancer Cancer Incidence 169400 148300 205000 189000 Mortality 154900 56600 40000 30200 Disease Type 2001 – American Cancer Society 3/25/2015 Slide 3

  4. Computer Aided Detection (CAD)  Early detection of cancer is the key to reduce the mortality rate;  Medical imaging can help the early detection of cancer;  Breast X-ray mammography, chest X-ray, lung CT, colonoscopy, brain MRI, etc;  Interpreting the image to find signs of cancer is very challenging for radiologists;  Automated processing using computer software helps radiologists in clinical decision;  Various image analysis software, including computer aided detection (CAD) and diagnosis (CADx); 3/25/2015 Slide 4

  5. Medical Imaging Applications Lung CT Breast Mammography Chest X-ray Colonoscopy

  6. Micro-calcifications in digital mammography 3/25/2015 Slide 6

  7. CAD for 2D Mammography  Each patient/examination has 4 views (left/right breast, CC/MLO views);  There are four 2D image to be processed;  CAD generates marker overlays ( triangle - micro-calcification clusters; star – mass density or spiculation/architectural distortion) 3/25/2015 Slide 7

  8. 2D Mammography CAD processing flow  Pre-processing  Pixel value transformation (log, invert, scaling);  Segmentations;  breast, pectoral muscle (MLO view), roll – off region;  Suspicious candidate generation;  Filtering(general and dedicated), region growing;  Region analysis/classification;  feature extraction/selection, classification;  It takes ~10 seconds/view to complete (pure CPU implementation); 3/25/2015 Slide 8

  9. Digital Breast Tomo-synthesis (DBT)  Acquisition  Multiple 2D projection views (PVs) are acquired in different angles (11 to 15);  The angle span is limited to get high in-plane resolution (15 to 30 degree);  Each projection uses a much lower dose than 2D mammography;  Reconstruction  Back projection is used to reconstruct a 3D volume with 1mm slice interval;  Usually a volume consists of 40 to 80 slices (1560x2457 pixels/slice);  Advantage (vs 2D mammogram)  Reduce tissue overlap  reveal 3D anatomical structures hidden in 2D;  Disadvantage  Much more data to interpret and store; 3/25/2015 Slide 9

  10. DBT acquisition and reconstruction X ray tube Reconstruction slices PV 1 Compression paddle PV n … … Compressed PV 2 breast PV n-1 PV m Digital detector Center of rotation PV 1 , PV 2 , PV 3 , …, PV m PV 1 , PV 2 , PV 3 , …, PV m 3/25/2015 Slide 10

  11. DBT CAD processing flow  Slice by slice processing (similar to first three steps in 2D CAD)  3D region growing;  3D Region analysis/classification;  Prototype (2007)  Pure CPU implementation;  It takes ~10 minutes/view to complete;  Clinically unacceptable;  What we can do to speedup?  CUDA computation on GPGPU; 3/25/2015 Slide 11

  12. GPU kernel performance optimization  Key requirements for good GPU kernel performance  Sufficient parallelism;  Efficient memory access;  Efficient instruction execution;  Efficient memory access: a case study of 1D convolution on 2D image with different implementations  CPU;  GPU  Global memory;  Texture memory;  Shared memory; 3/25/2015 Slide 12

  13. GPU CUDA memory space Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Local Local Local Memory Memory Memory Memory Host Global Memory Constant Memory Texture Memory 13 3/25/2015 Slide 13

  14. GPU global memory access optimization  GPU global memory  DRAM -- High latency;  Not necessarily cached;  Many algorithms are memory-limited  Or at least somewhat sensitive to memory bandwidth;  Arithmetic operation to memory access ratio is low;  Optimization goal: maximize bandwidth utilization  Memory accesses are per warp (warp – 32 consecutive threads in a single block);  Memory accesses are in discrete chunks (line – 128 bytes; segments – 32 bytes);  The key is to have sufficient concurrent memory access per warp; 3/25/2015 Slide 14

  15. Efficient GPU memory access addresses from a warp one 4-segment transaction ... 32 64 224 256 288 384 416 448 0 96 128 160 192 320 352 Memory addresses addresses from a warp ... 32 64 96 128 160 192 224 256 288 320 352 384 416 448 0 Memory addresses 3/25/2015 Slide 15

  16. Not efficient GPU memory access addresses from a warp ... 32 64 224 256 288 384 416 448 0 96 128 160 192 320 352 Memory addresses addresses from a warp ... 3/25/2015 Slide 16

  17. A case study: 1D vertical convolution on 2D image Pixel in a 2D image Pixel of interest X Local neighborhood X Convolution: For each pixel, the new pixel value is the weighted sum of the pixel value in the defined neighborhood; 3/25/2015 Slide 17

  18. Running time comparison: CPU and GPU  Platform  Host – Dell Precision 7500  CPU: Intel Xeon dual core @3.07GHz, @3.06GHz;  RAM: 16.0GB;  Device  GeForce GTX 690 (dual card);  3072 CUDA cores (1536x2), 16 SMs;  4GB 512-Bit GDDR5;  PCI Express 3.0x16;  CUDA 5.5;  OS  Window 7 Professional Service Pack 1, 2009; 3/25/2015 Slide 18

  19. CPU based – Two ways of serial processing Assumption: 2D image stored in continuous linear memory space; left: row-wise; right: column-wise; X X 3/25/2015 Slide 19

  20. CPU based – Two ways of serial processing  The speeds are different;  Due to the linear memory structure and data caching;  Vertical = 146.59 ms/run; Horizontal = 104.19 ms/run; Running time for two CPU versions (milliseonds/run) 160 140 120 100 CPU Vertical 80 CPU Horizontal 60 40 20 0 CPU 3/25/2015 Slide 20

  21. Global memory based – thread-block design 1 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (1x128); 3/25/2015 Slide 21

  22. Global memory based – thread-block design 2 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA threadblock X Cuda implementation: The whole image is divided into multiple horizontal bar threadblocks (128x1); 3/25/2015 Slide 22

  23. Global memory based – Vertical vs Horizontal  The speeds are quite different;  Due to the linear memory structure and concurrent aligned reading in a WARP;  Vertical = 13.325; Horizontal = 1.652 ms/run;  60x speedup compared to CPU version; Running time for two GPU versions (milliseonds/run) 160 140 120 100 80 Vertical 60 Horizontal 40 20 0 CPU GPU Global Memory 3/25/2015 Slide 23

  24. Texture memory based version  Texture memory  Read only cache;  Good for scattered reads;  Caching is 32 bytes (one segment);  Two different thread-blocks  Vertical (1x128);  Horizontal (128x1); 3/25/2015 Slide 24

  25. Texture memory based – Vertical vs Horizontal  The speeds are different;  Vertical = 2.507; Horizontal = 1.707 ms/run;  Horizontal: comparable to global memory version;  Vertical: much better than global memory version (better at scattered data); Running time for two GPU versions (milliseonds/run) 14 12 10 8 Vertical 6 Horizontal 4 2 0 GPU Global GPU Texture Memory Memory 3/25/2015 Slide 25

  26. Shared memory based version  Shared memory  Read/write cache in SM;  Low latency compared to global memory, or even texture memory;  Using as read cache, the original data still need to be loaded from global memory;  Two different thread-blocks  Vertical (1x768);  Horizontal (32x24); 3/25/2015 Slide 26

  27. Shared memory based – thread-block design 1 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (1x768); 3/25/2015 Slide 27

  28. Shared memory based – thread-block design 2 Pixel in a 2D memory chunk Pixel of interest X Local neighborhood CUDA thread-block X Cuda implementation: The whole image is divided into multiple vertical bar shape thread-blocks (32x24 = 768 threads); 3/25/2015 Slide 28

  29. Shared memory based – Vertical vs Horizontal  The speeds are quite different;  Vertical = 3.725 ms/run; Horizontal = 1.084 ms/run;  Horizontal: better than both global/texture memory;  Vertical: better than global memory, worse than texture memory; Running time for two GPU versions (milliseonds/run) 14 12 10 8 Vertical 6 Horizontal 4 2 0 GPU Global GPU Texture GPU Shared Memory Memory Memory 3/25/2015 Slide 29

Recommend


More recommend