Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband Engine on the Cell Broadband Engine Seunghwa Kang David A. Bader
Key Contributions • We design an efficient data decomposition scheme to We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity • We introduce multiple Cell/B.E. and DWT specific We introduce multiple Cell/B.E. and DWT specific optimization issues and solutions optimization issues and solutions • Our implementation achieves 34 and 56 times speedup over Our implementation achieves 34 and 56 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over the cutting edge multicore processor (AMD Barcelona), for the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively. lossless and lossy DWT, respectively.
Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previou Previous work work - Data decomposition scheme Data decomposition scheme - Real number Real number representation representation - Loop int Loop interleavin rleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AMD Barcelona Comparison with the AMD Barcelona • Conclusions Conclusions
Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previous work Previou work - Data decomposition scheme Data decomposition scheme - Real number Real number representation representation - Loop int Loop interleavin rleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AMD Barcelona Comparison with the AMD Barcelona • Conclusions Conclusions
Discrete Wavelet Transform (in JPEG2000) • Decompose an image in both vertical and horizontal Decompose an image in both vertical and horizontal direction to the sub-bands representing the coarse and direction to the sub-bands representing the coarse and detail part detail part while preserving space information while preserving space information LL HL LH HH
Discrete Wavelet Transform (in JPEG2000) • Vertical Vertical filtering followed by filtering followed by horizontal horizontal filtering filtering • Highly parallel but bandwidth intensive Highly parallel but bandwidth intensive • Distinct memory access pattern Distinct memory access pattern becomes a problem becomes a problem • Adopt Jasper [Adams2005] as a baseline code Adopt Jasper [Adams2005] as a baseline code
Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform •Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previous work Previous work - Data decomposition scheme Data decomposition scheme - Real number representation Real number representation - Loop interleaving Loop interleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AM Comparison with the AMD Barcelona D Barcelona • Conclusions Conclusions
Cell/B.E. vs Traditional Multi-core Processor Traditional SPE Multi-core Processor • In-order In-order • Out-of-order Out-of-order • No dynamic branch No dynamic branch • Dynamic branch Dynamic branch prediction prediction prediction prediction • SIMD only SIMD only • Scalar + SIMD Scalar + SIMD => Small and simple core => Small and simple core => Large and complex core => Large and complex core
Cell/B.E. vs Traditional Multi-core Processor Exe. I1 D1 Pipeline L2 Exe. LS Pipeline L3 Main Memory Main Memory • Isolated Isolated constant latency constant latency • Every memory access is Every memory access is LS access LS access cache coherent cache coherent • Software controlled Software controlled DMA DMA • Hardware controlled Hardware controlled data data data transfer between LS data transfer between LS transfer transfer and main memory and main memory
Cell/B.E. Architecture - Performance • More cores within power and transistor budget More cores within power and transistor budget • Invest the larger fraction of the die area for actual Invest the larger fraction of the die area for actual computation computation • Highly scalable memory architecture Highly scalable memory architecture • Enable fine-grain data transfer control Enable fine-grain data transfer control • Efficient vectorization is even more important (No scalar Efficient vectorization is even more important (No scalar unit) unit)
Cell/B.E. Architecture - Programmability • Software ( Software (mostly programmer mostly programmer up to date) controlled data up to date) controlled data transfer transfer • Limited LS size Limited LS size • Manual vectorization Manual vectorization • Manual branch hint, loop unrolling, etc. Manual branch hint, loop unrolling, etc. • Efficient DMA data transfer requires Efficient DMA data transfer requires cache line alignment cache line alignment and transfer size needs to be and transfer size needs to be a multiple of cache line size. a multiple of cache line size. • Vectorization (SIMD) requires 16 byte alignment Vectorization (SIMD) requires 16 byte alignment and vector and vector size needs to be size needs to be 16 byte. 16 byte. => Challenging to deal with misaligned => Challenging to deal with misaligned data !!! data !!!
Cell/B.E. Architecture - Programmability No guarantee Satisfies for( i = 0 ; i < n ; i++ ) { alignment and in alignment a[i] = b[i] + c[i] and size size } requirements n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4; v_a = ( vector int* )a; n_head = n_head % 4; Head n_body = ( n – n_head ) / 4; v_b = ( vector int* )b; n_tail = ( n – n_head ) % 4; for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i]; v_c = ( vector int* )c; } v_a = ( vector int* )( a + n_head ); for( i = 0 ; i < n_c / 4 ; i++ ) { v_b = ( vector int* )( b + n_head ); Body v_c = ( vector int* )( c + n_head ); for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) v_a[i] = v_add( v_b[i], v_c[i] ) } } a = ( int* )( v_a + n_body ); b = ( int* )( v_b + n_body ); c = ( int* )( v_c + n_body ); Tail //n_c: a constant multiple of 4 for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i]; } =>Even more complex if a, b, and c are misaligned!!!
Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the tradit Comparison with the traditional multicore processor ional multicore processor - Impact in performance and programmabil Impact in performance and prog rammability ity •Optimization Strategies Optimization Strategies - Previous work Previous work - Data Decomposition Scheme Data Decomposition Scheme - Real Number Representation Real Number Representation - Loop Interleaving Loop Interleaving - Fine-grain Data Transfer Control Fine-grain Data Transfer Control • Performance Evaluation Performance Evaluation - Comparison with the AM Comparison with the AMD Barcelona D Barcelona • Conclusions Conclusions
Previous work • Column grouping [Chaver2002] to enhance cache behavior Column grouping [Chaver2002] to enhance cache behavior in vertical filtering in vertical filtering • Muta Muta et al. [Muta2007] optimized et al. [Muta2007] optimized convolution based convolution based (require up to 2 times more operations than (require up to 2 times more operations than lifting based lifting based approach) DWT for Cell/B.E. approach) DWT for Cell/B.E. - High single SPE performance High single SPE performance - Does not scale Does not scale above 1 SPE above 1 SPE
Data Decomposition Scheme Cache line aligned A multiple of the cache line size 2-D array width Row padding A unit of data transfer and 2-D computation array height A unit of data distribution to the processing elements A multiple of the Remainder cache line size Distributed to Processed by the SPEs the PPE
Recommend
More recommend