High performance data processing with Halide Roel Jordans
High performance computjng 2
Architecture Trends 3
Processing platgorm architectures Increasing latency / energy CPU CPU GPU Cache Cache Cache Cache Local memory Memory 4
Programming challenges ● Data movement is costly ● Requires early planning of data distributjon ● May require explicit moves from CPU to GPU memory ● Traditjonal programming languages (like C) are not designed for this kind of architecture ● Require library support ● Highly customized code ● Vendor specifjc, low portability 5
An example: 3x3 box fjlter Horizontal Vertjcal blur blur • A simple, two-stage imaging pipeline: 3x3 blur . Basic functjon: a summatjon over a 3x3 area: • We leave out the averaging step. 6
Blur: Inlined implementatjon C code int in[W*H]; int by[W*H]; for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = in[(x-1) + (y-1)*W] + in[(x-1) + y*W] + in[(x-1) + (y+1)*W] + in[x + (y-1)*W] + in[x + y*W] + in[x + (y+1)*W] + in[(x+1) + (y-1)*W] + in[(x+1) + y*W] + in[(x+1) + (y+1)*W]; } } - 9 loads per output pixel - 8 additjons per output pixel - Minimal memory footprint - Completely parallelizable (independent pixels) - Unnecessary recomputatjon 7
Blur: Stored implementatjon C code int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<H; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 +y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x+ (y+1)*W]; } } - 6 loads, 1 store per output pixel - 4 additjons per output pixel - Very low locality (big bufger) - No recomputatjon - Stjll parallelizable 8
Blur: Fused pipeline C code int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[x + (y+1)*W] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x + (y+1)*W]; } } - 6 loads, 1 store per output pixel - 4 additjons per output pixel - Not directly parallelizable - High locality (producer, consumer moved together) - No recomputatjon 9
Blur: Folded storage C code int in[W*H]; int bx[W* 3 ]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[ ( x + (y+1)*W )%3 ] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[ ( x + (y-1)*W )%3 ] + bx[ ( x + y*W )%3 ] + bx[ ( x + (y+1)*W )%3 ]; } } - Same results as last slide, but: - With a smaller intermediate bufger (W*3 instead of W*H) 10
Data mapping freedom ● Two extremes ● Inline everything → Lots of computatjons ● Store everything → Lots of memory required ● Many optjons in between ● Can be tuned to match memory hierarchy! ● Can result in really complex loop structures 11
Next level optjmizatjons C allows us to specifjcally program the executjon order Many optjmizatjons: • Loop fusion, storage folding, tjling, multj-threading, vectorizatjon, ... • Most obscure functjonality • Most are architecture specifjc • Requires rewritjng and debugging to optjmize • Exploratjon of optjmizatjons is increasingly diffjcult 12
Blur optjmized #pragma omp parallel for for ( int yTile = 0 ; yTile < out . height (); yTile += 32 ) { __m128i a , b , c , sum , avg ; __m128i tmp [( 128 / 8 ) * ( 32 + 2 )]; for ( int xTile = 0 ; xTile < out . width (); xTile += 128 ) { Func blur_3x3 ( Func input ) { __m128i * tmpPtr = tmp ; Func blur_x , blur_y ; for ( int y = 0 ; y < 32 + 2 ; y ++) { Var x , y , xi , yi ; const uint16_t * inPtr = &( in ( xTile , yTile + y )); for ( int x = 0 ; x < 128 ; x += 8 ) { a = _mm_load_si128 (( const __m128i *)( inPtr )); // The algorithm - no storage or order b = _mm_loadu_si128 (( const __m128i *)( inPtr + 1 )); bx ( x , y ) = ( in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y )); c = _mm_loadu_si128 (( const __m128i *)( inPtr + 2 )); sum = _mm_add_epi16 ( _mm_add_epi16 ( a , b ), c ); by ( x , y ) = ( bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 )); avg = _mm_mulhi_epi16 ( sum , one_third ); _mm_store_si128 ( tmpPtr ++, avg ); inPtr += 8 ; // The schedule - defines order, locality; implies storage } by . tile ( x , y , xi , yi , 256 , 32 ) } . vectorize ( xi , 8 ). parallel ( y ); tmpPtr = tmp ; bx . compute_at ( by , x ). vectorize ( x , 8 ); for ( int y = 0 ; y < 32 ; y ++) { __m128i * outPtr = ( __m128i *)(&( out ( xTile , yTile + y ))); for ( int x = 0 ; x < 128 ; x += 8 ) { return by ; a = _mm_load_si128 ( tmpPtr +( 2 * 128 )/ 8 ); Halide (complete implementatjon) b = _mm_load_si128 ( tmpPtr + 128 / 8 ); } c = _mm_load_si128 ( tmpPtr ++); sum = _mm_add_epi16 ( _mm_add_epi16 ( a , b ), c ); avg = _mm_mulhi_epi16 ( sum , one_third ); _mm_store_si128 ( outPtr ++, avg ); } } } C (partjal implementatjon) } 13
Halide? ● A domain specifjc language (DSL) targetjng image processing pipelines ● Embedded in C++ → Uses an pre-existjng compiler for most of the heavy lifuing ● Available for many target architectures (x86, ARM, CUDA, …) ● Support from industry: Google, Adobe 14
Halide! ● Main idea ● Decouple algorithm defjnitjon from optjmizatjon schedule → Apply optjmizatjons without complicatjng the code ● Result ● Easier and faster design space exploratjon ● Improved readability and portability → For a new architecture we should only change the schedule 15
Blur: Halide Horizontal Vertjcal blur blur Func in, bx, by; Var x, y; bx(x,y) = in(x-1,y) + in(x,y), + in(x+1,y); by(x,y) = bx(x,y-1) + bx(x,y) + bx(x,y+1); by.realize(10,10); // build and execute the loop nest over a 10x10 area 16
Blur: Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); Note that in the body, there is no notjon of : • tjme (executjon order). • space (bufger assignment, image size, memory allocatjon) • hardware (because no tjme and space) • a very clear , concise and readable algorithm. • we have not chosen any optjmizatjon strategy yet. • eg. we can use this same startjng point on any target architecture. • (in C, a naïve implementatjon would already require scheduling decisions) 17
Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); by.realize ( 10,10 ); • Internally, Halide converts this functjonal representatjon to a C-like loop nest. • By default, if nothing else is done, everything is inlined. 18
Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_root(); by.realize ( 10,10 ); compute_root() : compute and store all outputs of a (producer) functjon before startjng computatjon of the next . 19
Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_at(by,y); by.realize ( 10,10 ); • compute_root() is actually a special case of compute_at(). • compute_at (by, y) means: “Whenever stage by starts an iteratjon of the y loop, fjrst calculate the pixels of stage bx that will be consumed.” • In other words: computatjon of bx is fused at the loop over y of by . Not completely equivalent to our initjal fused version 20
Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_at(by,y); bx.store_root(); by.realize ( 10,10 ); • For this, we can separate computatjon from storage using store_at() and store_root() . • bx.store_root() means: “Allocate the bufger for bx outside the loop nest.” Halide automatjcally applies storage folding as well! 21
Many more scheduling optjons • We looked at the syntax which interleaves computatjon between stages . • There is also syntax which changes the order of computatjon within a single stage : • Reorder loop variables → by.reorder(y,x); • Split loop variables into inner and outer → by.split(x, xout, xin, 4); • Tiling is a just combinatjon of the above: • by.split(x, xout, xin, 4); • by.split(y, yout, yin, 4); • by.reorder(xin, yin, xout, yout); • Because this is so common, syntactjc sugar (a “shortcut”) is ofgered: • by.tjle(x, y, xout, xin, yout, yin, 4, 4); • Execute loop iteratjons in parallel using multj-threading: • by.parallel(x); //executes each x iteratjon simultaneously in threads • Turn a loop into a (series of) vector operatjon(s) : • by.vectorize(xin); //loop over xin, which has 4 iteratjons, is vectorized • by.vectorize(x, 4); //shortcut: split x into out and in of 4, then vectorize in 22
Recommend
More recommend