Parallelization of an Image Retrieval Algorithm Zhenman Fang , - PowerPoint PPT Presentation

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1

Exploding Multimedia Data Figure from [Report on American Consumers 09] Cisco VNI Global Consumer Internet Traffic Forecast 2

Multimedia Retrieval App. Important to retrieve useful data  E.g. medical imagery, video recommendation Data-intensive and computing-intensive Significant challenges for real-time retrieval 3

Multi-core Era Is Coming Software Performance Single Core # of cores Performance Figure from Michael MoCool’s (Intel) many core slides 4

New Opportunities Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval  To optimize them on current architectures  To design future architectures for them 5

Image Retrieval Image retrieval: also core of video retrieval 6

Image Retrieval feature extraction 7

Image Retrieval feature extraction + feature match 8

Image Retrieval (cont.) Two classes of algorithms  Global feature based ~60% precision  Color features [Wan 08]  Texture features  Local feature based  Shape context accurate but time consuming  SIFT Features  SURF Features robust & appealing: insensitive to scale and rotation transformation [ Mikolajczyk 05, Bauer 07 ] 9

SURF Overview Input Image Detection Description Integral Image Orientation Assignment Scale Space Analysis Descriptor Vector Construction Interest Point Localization Features 10

Integral Image ∑ g(x,y) g(x,y) I(x,y) Input Image Integral Image 11

Scale Space Analysis   Hessian Matrix Dxx Dxy for (x,y) [ Bay 06 ] Dxy Dyy Det(x,y) insensitive to scale transformation Det(x,y) = Dxx*Dyy – 0.81*Dxy*Dxy 12

Interest Point Localization Ipoint with max det value 13

Orientation Assignment insensitive to rotation transformation Orientation Based on Haar Wavelet [ Bay 06 ] 14

Descriptor Vector Construction 15

Descriptor Vector Construction 64-dimension feature vector 4-dimension vector calculated based on Haar Wavelet 16

Execution Profile of SURF Input Image Detection Description Experiment Enviroment  Prog: OpenSURF Integral Image Orientation  Input: 48 images 1% time Assignment  HW: 16-core server 20% time 32GB memory Scale Space Analysis Descriptor 24% time Vector Construction 53% time Interest Point Localization 73% time 2% time 27% time Features 17

Interest Points Distribution Imbalanced distribution for images/blocks 600 # of Interest Points 500 400 300 Average line 200 100 0 0 32 64 96 128 160 192 Block ID 18

Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD and Other Parallelism  Combination of Pipeline and Task Parallelism 19

2-stage Pipeline Description reads interest point from the buffer Detection writes … Description interest point to the buffer … Detection Description … Description 20

3-stage Pipeline Further divide Description into two stages Descriptor … Vector Construction Orientation … Detection Assignment Descriptor … Vector Construction 21

Results of Pipeline Parallelism Pipeline parallelism does not scale 4 3 Speedup 2 1 0 2-stage 3-stage 22

Parallel Analysis Pipeline Parallelism Task Parallelism  Scale-level Parallelism  Block-level Parallelism Combination of Different Parallelism  Combination of SIMD to Others  Combination of Task and Pipeline Parallelism 23

Scale-level Parallelism Describe each group Each scale of interest points computed concurrently concurrently Integral Scale Space Interest Point Description Image Analysis Localization 24

Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup 4 2 0 4-core 8-core 12-core 16-core 25

Results of Scale-level Parallelism Not scale when exceeding 12 cores 8 6 Speedup  Imbalanced computation 4  Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 26

Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Description Description Description 28

Block-level Parallelism Input Image Sync between neighbor blocks Image Image Image Block Block Block Detection Detection Detection Block-level parallelism with synchronization (Block-Sync) Description Description Description 29

Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Description Description Description 30

Block-level Parallelism Use additional Input Image computation to avoid sync Image Image Image Block Block Block Detection Detection Detection Block-level parallelism without synchronization (BlockPar) Description Description Description 31

Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 32

Results of Block-level Parallelism BlockPar scales well 10 BlockPar Block-Sync 8 Speedup 6 Communication overhead between cores is non-trivial; and it could be 4 reduced by additional computation 2 0 4-core 8-core 12-core 16-core 33

Comparison for Each Parallelism Block-level parallelism is more efficient 10 Pipeline ScalePar 8 BlockPar Speedup 6 4 2 0 4-core 8-core 12-core 16-core 34

Combination of SIMD to Others Use ICC to generate SIMD instructions 12 Pipeline ScalePar 10 BlockPar 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 36

Combination of SIMD to Others Use ICC to generate SIMD instructions 12 11% Speedup Pipeline+SIMD ScalePar+SIMD 10 BlockPar+SIMD 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 37

Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6 4 2 0 4-core 8-core 12-core 16-core 39

Combination of Task & Pipeline BlockPar + Pipeline is the most efficient 13X 14 BlockPar Block+Pipe 12 BlockPar+SIMD 10 Block+Pipe+SIMD Speedup 8 6  Fewer computation  Better locality 4 2 0 4-core 8-core 12-core 16-core 40

Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6 4 2 0 4-core 8-core 12-core 16-core 41

Comparison to Prior Work Compared to P-SURF [ Zhang 10 ] on multi-core CPU 12 P-SURF Our BlockPar 10 Our Block+Pipe 8 Speedup 6  1.84X Speedup over P-SURF 4  Non-trivial communication overhead 2 0 4-core 8-core 12-core 16-core 42

Comparison to Prior Work (cont.) Our implementation on GPGPU 99% Sequential SURF on CPU Execution Time % Initialization on CPU BlockPar on GPGPU 1% Init SURF Sequential CPU + GPU 43 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar on GPGPU Execution Time % Initialization on CPU 53% 47% BlockPar on GPGPU Init SURF Sequential CPU + GPU 44 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Comparison to Prior Work (cont.) Our implementation on GPGPU After BlockPar Initialization on GPGPU Execution Time % on CPU 53% 47% … BlockPar on GPGPU Init SURF CPU + GPU Pipeline 45 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20 10 0 CUDA SURF Our BlockPar Our Block+Pipe 46 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Comparison to Prior Work (cont.) Compared to CUDA SURF * on GPGPU ( Nvidia GTX 260 ) 50 46X 40 30X 30X 30 Speedup 20  1.53X speedup over CUDA SURF  CPU+GPU Pipeline not exploited 10 0 CUDA SURF Our BlockPar Our Block+Pipe 47 * Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Parallelization of an Image Retrieval Algorithm Zhenman Fang , - PowerPoint PPT Presentation

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1 Exploding Multimedia Data Figure from [Report on

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Topic 7: Topic 7: Image Morphing Image Morphing 1. 1. Intro to basic image morphing Intro to

Image Features Sanja Fidler CSC420: Intro to Image Understanding 1 / 64 Image Features Image

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Image Features Sanja Fidler CSC420: Intro to Image Understanding 1 / 1 Image Features Image

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL

Parallelization in Time Mark Maienschein-Cline Department of Chemistry University of Chicago

Parallelization of Geodesic Ray-Tracing for Arbitrary Metrics Guillermo Andree Oliva Mercado

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in

Is Digital Technology Image 1 Restructuring the brain? Globally connected Image 2 Image 3 How

Orientation Lectures 5-6 ANLP Lecture 8 Task: Language modelling Part-of-speech tagging

Degree-constrained orientations of embedded graphs Yann Disser Jannik Matuschke The

Advanced topics in software systems Reid Holmes Winter 2010 CSEP504 Lecture 6 CSEP 504:

Dark, Beyond Deep --- Rethink About Computer Vision Song-Chun Zhu 1 Distribution Statement

DNS/DNSSEC/DANE/DNS-over- TLS etc. Team IETF95 Hackathon In-Person: Ray Bellis, Sebastian

Incident Management Team COVID-19 Incident Briefing Thursday, July 30, 2020 Bill Bullock Carbon

Belbin Team Roles Semester 1 2004 University of Edinburgh Management School 1 Full Time

A Tale of Testing the Untestable A Tale of Testing the Untestable Angie Jones Senior Developer