Hardware Design and Analysis of Efficient Loop Coarsening and Border - PowerPoint PPT Presentation

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nürnberg ASAP , July 11, 2017, Seattle

Motivation: Coarse-grained parallelism on FPGA Memory bandwidth limits can be reached by processing multiple pixels per cycle: • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a “modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706) • High-speed serial transceiver technology on FPGAs enables the communica- tion interfaces to operate at high data rates … … … …

Motivation: Coarse-grained parallelism on FPGA Memory bandwidth limits can be reached by processing multiple pixels per cycle: • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a “modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706) • High-speed serial transceiver technology on FPGAs enables the communica- tion interfaces to operate at high data rates … … … … How to provide efficient coarse-grained parallelism for image processing applications?

Motivation: Image Processing Applications We can define three characteristic data operations in image processing applications: input image output image Point Operators: Output data is determined by single input data input image output image Local Operators: Output data is determined by a local region of the input data (stencil pattern-based calculations) input image output image Global Operators: Output data is determined by all of the input data

Motivation: Image Processing Applications A great portion of image processing applications can be described as task graphs of point, local, and global operators: dx sx gx input output gxy sxy hc gy dy sy An example task graph for Harris Corner Detection (square: local operator, circle: point operator)

Motivation: Parallelization of Image Processing Applications A naive way would be replicating the accelerator hardware: dx sx gx dx sx gx dx sx gx dx sx gx hc input output hc gxy sxy gxy sxy gxy sxy gxy sxy hc gy dy sy gy dy sy dy gy hc sy gy dy sy

Motivation: Parallelization of Image Processing Applications Is there a more resource-efficient approach? {sx, sx, {gx, gx, gx, gx} sx, sx} output input {dx, dx, dx, dx} {hc, {sxy, hc, sxy, {gxy, gxy, gxy, gxy} hc, sxy, hc} sxy} {dy, dy, dy, dy} {sy, sy, {gy, gy, gy, gy} sy, sy}

Motivation: Parallelization of point operators Coarse-grained parallelization of point operators is rather straightforward: input input f {f, f, f, f} output output The throughput is linear with the resource usage (when further data-path optimizations are ignored).

Motivation: Parallelization of point operators Coarse-grained parallelization of point operators is rather straightforward: input input f {f, f, f, f} output output The throughput is linear with the resource usage (when further data-path optimizations are ignored). What are efficient parallelization methods for local operators?

Motivation: Image border handling • a fundamental image processing issue for local operators • mostly overlooked by the digital hardware design community • should be considered together with coarse-grained parallelization 0 0 0 1 2 3 3 3 5 4 4 5 6 7 7 6 10 9 8 9 10 11 10 9 c c c c c c c c 0 0 0 3 3 3 0 0 3 3 6 5 5 6 7 6 5 c c c c c c c c 1 2 1 1 2 2 4 c c c c 0 0 0 1 2 3 3 3 1 0 0 1 2 3 3 2 2 1 0 1 2 3 2 1 0 1 2 3 4 4 4 5 6 7 7 7 5 4 4 5 6 7 7 6 6 5 4 5 6 7 6 5 c c 4 5 6 7 c c 8 8 8 9 10 11 11 11 9 8 8 9 10 11 11 10 10 9 8 9 10 11 10 9 c c 8 9 10 11 c c 12 12 12 13 14 15 15 15 13 12 12 13 14 15 15 14 14 13 12 13 14 15 14 13 c c 12 13 14 15 c c c c c c c c c c 12 12 12 13 14 15 15 15 13 12 12 13 14 15 15 14 10 9 8 9 10 11 10 9 12 12 12 13 14 15 15 15 9 8 8 9 10 11 11 10 6 5 4 5 6 7 6 5 c c c c c c c c (a) clamp (b) mirror (c) mirror-101 (d) constant Common border handling modes.

Outline Loop Coarsening Border Handling Best Architecture Selection Evaluation and Results

Loop Coarsening

Loop Coarsening: Schmid’s 1 Approach Coarsening the outer horizontal loop of a 2D input by a factor of v : for(int y = 0; y < IMAGE_HEIGHT; y++){ for(int x = 0; x < IMAGE_WIDTH; x + v){ (DataBeatType*)(out[y][x]) = local_op(stencil_p1(y, x), ..); } } … … … … Raster order processing facilitates burst mode read, thus highest external memory bandwidth! 1 M. Schmid, O. Reiche, F . Hannig, and J. Teich, “Loop coarsening in C-based high-level synthesis”, ASAP15. ASAP’17 2 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Schmid’s Approach The line buffer and sliding window are modified to store so-called data beats . Sliding Window Line Bu ff er … … … … f f f f … … The throughput is sub-linear with the resource usage . ASAP’17 3 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: OUT: Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: 0 1 2 3 OUT: 0 1 2 3 Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency +1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: 0 1 2 3 4 5 6 7 OUT: 0 1 2 3 4 5 6 7 Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 Deploys additional registers when r w mod v � = 0 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Fetch and Calc (F&C) Redundant registers in Schmid’s architecture are eliminated Schmid’s shift shift input Line Bu ff er Fetch And Calc shift shift input … … … … f f f f (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 5 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: x 0 1 2 OUT: shift input Line Bu ff er … … f f f f 0 1 2 … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: x 0 1 2 3 4 5 6 OUT: 0 1 2 3 shift input Line Bu ff er … … f f f f 4 5 6 … … 0 1 2 3 (kernel width) w = 3, (coarsening factor) v = 4 output ([ x , x + v ] , t ) = pack { out ([ 0 , v − r w − 1 ] , t − 1 ) , out ([ v − r w , v − 1 ] , t ) } ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency + 1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: x 0 1 2 3 4 5 6 7 8 9 10 OUT: 0 1 2 3 4 5 6 7 shift input Line Bu ff er … … f f f f 8 9 10 … … 4 5 6 7 (kernel width) w = 3, (coarsening factor) v = 4 output ([ x , x + v ] , t ) = pack { out ([ 0 , v − r w − 1 ] , t − 1 ) , out ([ v − r w , v − 1 ] , t ) } ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Loop Coarsening Overview Resource usage is the # of registers when border handling is ignored: shift shift input Schmid’s: k in · h · ( v + 2 · ( v ·⌈ r w / v ⌉ )) f f f f shift shift input Fetch And Calc: C F&C reg = k in · h · ( r w + v · ( ⌈ r w / v ⌉ + 1 )) f f f f shift input Calc and Pack: C C&P reg = k in · h · ( 2 · r w + v )+ k out · ( v − ( r w mod v )) f f f f ASAP’17 7 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

Border Handling

Hardware Design and Analysis of Efficient Loop Coarsening and Border - PowerPoint PPT Presentation

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif zkan, Oliver Reiche, Frank Hannig, and Jrgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nrnberg

Multi-level Coarsening Algorithm Perform Edge Coarsening (EC) Visit nodes and break ties in

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Recursive regularity and the finest regular coarsening of a polyhedral subdivision Rafel Jaume

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Resource-aware Program Analysis via Online Abstraction Coarsening Kihong Heo Hakjoo Oh Hongseok

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

FACTOR ABUNDANCE AND TRADE: HECKSCHER-OHLIN MODEL NUMERICAL EXAMPLE Two goods, Beer and Cheese.

Identifiability of Gaussian DAG models with one latent source Hisayuki Hara Niigata University

ON SOME FACTORIZATIONS OF RANDOM WORDS PHILIPPE CHASSAING INSTITUT ELIE CARTAN & ELAHE

Week 7 Video 5 Factor Analysis Factor Analysis You have a whole lot of variables Can

The complexity of factoring univariate polynomials over the rationals Mark van Hoeij Florida

Short Stickelberger Class Relations and application to Ideal-SVP Ronald Cramer L eo Ducas

2 Settings Continuous setting: C n : a lattice, : component-wise product on C n . v

Hardware Design and Analysis of Efficient Loop Coarsening and Border - PowerPoint PPT Presentation

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif zkan, Oliver Reiche, Frank Hannig, and Jrgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nrnberg

Multi-level Coarsening Algorithm Perform Edge Coarsening (EC) Visit nodes and break ties in

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Recursive regularity and the finest regular coarsening of a polyhedral subdivision Rafel Jaume

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Resource-aware Program Analysis via Online Abstraction Coarsening Kihong Heo Hakjoo Oh Hongseok

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements &amp; Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

FACTOR ABUNDANCE AND TRADE: HECKSCHER-OHLIN MODEL NUMERICAL EXAMPLE Two goods, Beer and Cheese.

Identifiability of Gaussian DAG models with one latent source Hisayuki Hara Niigata University

ON SOME FACTORIZATIONS OF RANDOM WORDS PHILIPPE CHASSAING INSTITUT ELIE CARTAN &amp; ELAHE

Week 7 Video 5 Factor Analysis Factor Analysis You have a whole lot of variables Can

The complexity of factoring univariate polynomials over the rationals Mark van Hoeij Florida

Short Stickelberger Class Relations and application to Ideal-SVP Ronald Cramer L eo Ducas

2 Settings Continuous setting: C n : a lattice, : component-wise product on C n . v

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

ON SOME FACTORIZATIONS OF RANDOM WORDS PHILIPPE CHASSAING INSTITUT ELIE CARTAN & ELAHE