parallelization of an image
play

Parallelization of an Image Retrieval Algorithm Zhenman Fang , - PowerPoint PPT Presentation

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1 Exploding Multimedia Data Figure from [Report on


  1. A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1

  2. Exploding Multimedia Data Figure from [Report on American Consumers 09] Cisco VNI Global Consumer Internet Traffic Forecast 2

  3. Multimedia Retrieval App. Important to retrieve useful data  E.g. medical imagery, video recommendation Data-intensive and computing-intensive Significant challenges for real-time retrieval 3

  4. Multi-core Era Is Coming Software Performance Single Core # of cores Performance Figure from Michael MoCool’s (Intel) many core slides 4

  5. New Opportunities Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval  To optimize them on current architectures  To design future architectures for them 5

  6. Image Retrieval Image retrieval: also core of video retrieval 6

  7. Image Retrieval feature extraction 7

  8. Image Retrieval feature extraction + feature match 8

  9. Image Retrieval (cont.) Two classes of algorithms  Global feature based ~60% precision  Color features [Wan 08]  Texture features  Local feature based  Shape context accurate but time consuming  SIFT Features  SURF Features robust & appealing: insensitive to scale and rotation transformation [ Mikolajczyk 05, Bauer 07 ] 9

  10. SURF Overview Input Image Detection Description Integral Image Orientation Assignment Scale Space Analysis Descriptor Vector Construction Interest Point Localization Features 10

  11. Integral Image ∑ g(x,y) g(x,y) I(x,y) Input Image Integral Image 11

  12. Scale Space Analysis   Hessian Matrix Dxx Dxy for (x,y) [ Bay 06 ] Dxy Dyy Det(x,y) insensitive to scale transformation Det(x,y) = Dxx*Dyy – 0.81*Dxy*Dxy 12

  13. Interest Point Localization Ipoint with max det value 13

  14. Orientation Assignment insensitive to rotation transformation Orientation Based on Haar Wavelet [ Bay 06 ] 14

  15. Descriptor Vector Construction 15

  16. Descriptor Vector Construction 64-dimension feature vector 4-dimension vector calculated based on Haar Wavelet 16

  17. Execution Profile of SURF Input Image Detection Description Experiment Enviroment  Prog: OpenSURF Integral Image Orientation  Input: 48 images 1% time Assignment  HW: 16-core server 20% time 32GB memory Scale Space Analysis Descriptor 24% time Vector Construction 53% time Interest Point Localization 73% time 2% time 27% time Features 17

  18. Interest Points Distribution Imbalanced distribution for images/blocks 600 # of Interest Points 500 400 300 Average line 200 100 0 0 32 64 96 128 160 192 Block ID 18

  19. Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD and Other Parallelism  Combination of Pipeline and Task Parallelism 19

  20. 2-stage Pipeline Description reads interest point from the buffer Detection writes … Description interest point to the buffer … Detection Description … Description 20

  21. 3-stage Pipeline Further divide Description into two stages Descriptor … Vector Construction Orientation … Detection Assignment Descriptor … Vector Construction 21

  22. Results of Pipeline Parallelism Pipeline parallelism does not scale 4 3 Speedup 2 1 0 2-stage 3-stage 22

  23. Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD to Others  Combination of Task and Pipeline Parallelism 23

  24. Scale-level Parallelism Describe each group Each scale of interest points computed concurrently concurrently Integral Scale Space Interest Point Description Image Analysis Localization 24

  25. Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup 4 2 0 4-core 8-core 12-core 16-core 25

  26. Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup  Imbalanced computation 4  Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 26

  27. Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD to Others  Combination of Task and Pipeline Parallelism 27

  28. Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Description Description Description 28

  29. Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Block-level parallelism with synchronization (Block-Sync) Description Description Description 29

  30. Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Description Description Description 30

  31. Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Block-level parallelism without synchronization (BlockPar) Description Description Description 31

  32. Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 32

  33. Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 Communication overhead between cores is non-trivial; and it could be 4 reduced by additional computation 2 0 4-core 8-core 12-core 16-core 33

  34. Comparison for Each Parallelism Block-level parallelism is more efficient 10 Pipeline ScalePar 8 BlockPar Speedup 6 4 2 0 4-core 8-core 12-core 16-core 34

  35. Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD to Others  Combination of Task and Pipeline Parallelism 35

  36. Combination of SIMD to Others Use ICC to generate SIMD instructions 12 Pipeline ScalePar 10 BlockPar 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 36

  37. Combination of SIMD to Others Use ICC to generate SIMD instructions 12 11% Speedup Pipeline+SIMD ScalePar+SIMD 10 BlockPar+SIMD 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 37

  38. Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD to Others  Combination of Task and Pipeline Parallelism 38

  39. Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6 4 2 0 4-core 8-core 12-core 16-core 39

  40. Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6  Fewer computation  Better locality 4 2 0 4-core 8-core 12-core 16-core 40

  41. Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 41

  42. Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6  1.84X Speedup over P-SURF 4  Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 42

  43. Comparison to Prior Work (cont.) Our implementation on GPGPU 99% Sequential SURF on CPU Execution Time % Initialization on CPU BlockPar on GPGPU 1% Init SURF Sequential CPU + GPU 43 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

  44. Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar on GPGPU Execution Time % Initialization on CPU 53% 47% BlockPar on GPGPU Init SURF Sequential CPU + GPU 44 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

  45. Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar Initialization on GPGPU Execution Time % on CPU 53% 47% … BlockPar on GPGPU Init SURF CPU + GPU Pipeline 45 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

  46. Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20 10 0 CUDA SURF Our BlockPar Our Block+Pipe 46 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

  47. Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20  1.53X speedup over CUDA SURF  CPU+GPU Pipeline not exploited 10 0 CUDA SURF Our BlockPar Our Block+Pipe 47 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Recommend


More recommend