A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1
Exploding Multimedia Data Figure from [Report on American Consumers 09] Cisco VNI Global Consumer Internet Traffic Forecast 2
Multimedia Retrieval App. Important to retrieve useful data E.g. medical imagery, video recommendation Data-intensive and computing-intensive Significant challenges for real-time retrieval 3
Multi-core Era Is Coming Software Performance Single Core # of cores Performance Figure from Michael MoCool’s (Intel) many core slides 4
New Opportunities Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval To optimize them on current architectures To design future architectures for them 5
Image Retrieval Image retrieval: also core of video retrieval 6
Image Retrieval feature extraction 7
Image Retrieval feature extraction + feature match 8
Image Retrieval (cont.) Two classes of algorithms Global feature based ~60% precision Color features [Wan 08] Texture features Local feature based Shape context accurate but time consuming SIFT Features SURF Features robust & appealing: insensitive to scale and rotation transformation [ Mikolajczyk 05, Bauer 07 ] 9
SURF Overview Input Image Detection Description Integral Image Orientation Assignment Scale Space Analysis Descriptor Vector Construction Interest Point Localization Features 10
Integral Image ∑ g(x,y) g(x,y) I(x,y) Input Image Integral Image 11
Scale Space Analysis Hessian Matrix Dxx Dxy for (x,y) [ Bay 06 ] Dxy Dyy Det(x,y) insensitive to scale transformation Det(x,y) = Dxx*Dyy – 0.81*Dxy*Dxy 12
Interest Point Localization Ipoint with max det value 13
Orientation Assignment insensitive to rotation transformation Orientation Based on Haar Wavelet [ Bay 06 ] 14
Descriptor Vector Construction 15
Descriptor Vector Construction 64-dimension feature vector 4-dimension vector calculated based on Haar Wavelet 16
Execution Profile of SURF Input Image Detection Description Experiment Enviroment Prog: OpenSURF Integral Image Orientation Input: 48 images 1% time Assignment HW: 16-core server 20% time 32GB memory Scale Space Analysis Descriptor 24% time Vector Construction 53% time Interest Point Localization 73% time 2% time 27% time Features 17
Interest Points Distribution Imbalanced distribution for images/blocks 600 # of Interest Points 500 400 300 Average line 200 100 0 0 32 64 96 128 160 192 Block ID 18
Parallel Analysis Pipeline Parallelism Task Parallelism Scale-level Parallelism Block-level Parallelism Combination of Different Parallelism Combination of SIMD and Other Parallelism Combination of Pipeline and Task Parallelism 19
2-stage Pipeline Description reads interest point from the buffer Detection writes … Description interest point to the buffer … Detection Description … Description 20
3-stage Pipeline Further divide Description into two stages Descriptor … Vector Construction Orientation … Detection Assignment Descriptor … Vector Construction 21
Results of Pipeline Parallelism Pipeline parallelism does not scale 4 3 Speedup 2 1 0 2-stage 3-stage 22
Parallel Analysis Pipeline Parallelism Task Parallelism Scale-level Parallelism Block-level Parallelism Combination of Different Parallelism Combination of SIMD to Others Combination of Task and Pipeline Parallelism 23
Scale-level Parallelism Describe each group Each scale of interest points computed concurrently concurrently Integral Scale Space Interest Point Description Image Analysis Localization 24
Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup 4 2 0 4-core 8-core 12-core 16-core 25
Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup Imbalanced computation 4 Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 26
Parallel Analysis Pipeline Parallelism Task Parallelism Scale-level Parallelism Block-level Parallelism Combination of Different Parallelism Combination of SIMD to Others Combination of Task and Pipeline Parallelism 27
Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Description Description Description 28
Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Block-level parallelism with synchronization (Block-Sync) Description Description Description 29
Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Description Description Description 30
Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Block-level parallelism without synchronization (BlockPar) Description Description Description 31
Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 32
Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 Communication overhead between cores is non-trivial; and it could be 4 reduced by additional computation 2 0 4-core 8-core 12-core 16-core 33
Comparison for Each Parallelism Block-level parallelism is more efficient 10 Pipeline ScalePar 8 BlockPar Speedup 6 4 2 0 4-core 8-core 12-core 16-core 34
Parallel Analysis Pipeline Parallelism Task Parallelism Scale-level Parallelism Block-level Parallelism Combination of Different Parallelism Combination of SIMD to Others Combination of Task and Pipeline Parallelism 35
Combination of SIMD to Others Use ICC to generate SIMD instructions 12 Pipeline ScalePar 10 BlockPar 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 36
Combination of SIMD to Others Use ICC to generate SIMD instructions 12 11% Speedup Pipeline+SIMD ScalePar+SIMD 10 BlockPar+SIMD 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 37
Parallel Analysis Pipeline Parallelism Task Parallelism Scale-level Parallelism Block-level Parallelism Combination of Different Parallelism Combination of SIMD to Others Combination of Task and Pipeline Parallelism 38
Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6 4 2 0 4-core 8-core 12-core 16-core 39
Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6 Fewer computation Better locality 4 2 0 4-core 8-core 12-core 16-core 40
Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 41
Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6 1.84X Speedup over P-SURF 4 Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 42
Comparison to Prior Work (cont.) Our implementation on GPGPU 99% Sequential SURF on CPU Execution Time % Initialization on CPU BlockPar on GPGPU 1% Init SURF Sequential CPU + GPU 43 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar on GPGPU Execution Time % Initialization on CPU 53% 47% BlockPar on GPGPU Init SURF Sequential CPU + GPU 44 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar Initialization on GPGPU Execution Time % on CPU 53% 47% … BlockPar on GPGPU Init SURF CPU + GPU Pipeline 45 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20 10 0 CUDA SURF Our BlockPar Our Block+Pipe 46 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20 1.53X speedup over CUDA SURF CPU+GPU Pipeline not exploited 10 0 CUDA SURF Our BlockPar Our Block+Pipe 47 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Recommend
More recommend