 
              SOCAO: Source-to-Source OpenCL Compiler for Intel-Altera FPGAs Johanna Rohde 1 , Marcos Martinez-Peiró 2 , Rafael Gadea-Gironés 2 Date: 7.9.2017 1 Computer Systems Group, TU Darmstadt 2 Department of Electronic Engineering, Universitat Politècnica de València 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 1
Overview ● Introduction ● Background ● Design ● Implementation ● Evaluation ● Conclusion 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 2
Introduction Problem: Accelerate a program with an FPGA ● How do I program the FPGA? ● How do I communicate with the FPGA? ● How much time do I need to rewrite the code? 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 3
Introduction Low Level OpenCL SOCAO Compiler ASIC Tools for FPGA for C to OpenCL FPGA Programmers Parallel Programmers Sofuware Programmers 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 4
Background ● OpenCL Open programming standard for heterogeneous parallel systems – Calculations are passed to external accelerator – Accelerator can be – ● CPU ● GPU ● FPGA ● ... 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 5
Background OpenCL Platform ● Host – Manages the system ● Host Is connected to one or more ● compute devices Compute Device – Compute Device Executes a kernel Compute Unit Compute Unit Compute Unit ● PE PE PE PE PE PE Consists of multiple compute ... ... ... ... ● units PE PE PE PE PE PE Compute Unit – Consists of multiple processing ● elements 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 6
Background OpenCL ● Memory Model – Global Memory ● Used to transfer data ● Accessible by all work groups DEVICE Global ● Normally the slowest memory Memory WORK GROUP WORK Host HOST – Constant Memory ITEM Memory Local Private Memory Memory ● Used to save constants Constant WORK ITEM Memory – Local Memory Private Memory ... ● Accessible by all work items of one work group ● Not accessible by host – Private Memory ● Accessible by one work item ● Holds intermediate values 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 7
Background OpenCL Host Program Flow ● Host OpenCL Host OpenCL Program Kernel Program Kernel Set Kernel Write Buffer Arguments Initialize Load Kernel Environment Launch Kernel Read Buffer Execution 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 8
Background OpenCL for FPGAs ● 2 additional forms of parallelism ● Instruction-level parallelism – Instructions that are independent of each other can be calculated at the same time a b + a b c d c d + + + x = ( a + b )*( c + d ); y = x - e ; * e 4 * time e z = x << 4 ; - << - y z y 4 << z 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 9
Background OpenCL for FPGAs ● Loop Pipelining – Iterations are overlapped – Ideal case: One Iteration per clock cycle – Problem when the loop has loop-carried dependencies time +/+ * <</- for( int a = 0 ; a < 4 ; a ++) iteration 1: { +/+ * <</- iteration 2: x = ( a + b )*( c + d ); +/+ * <</- y = x << 4 ; iteration 3: z = x - e ; +/+ * <</- } iteration 4: 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 10
Background OpenCL for FPGAs ● Intel’s SDK for OpenCL Host Code OpenCL Accelerator Code void sum ( int * A , int * B , __kernel void void sum ( int * A , int * B , __kernel void int * res , int size ) sum ( __global int * A , int * res , int size ) sum ( __global int * A , { __global int * B , { __global int * B , clEnqueueWriteBuffer (...); __global int * res , clEnqueueWriteBuffer (...); __global int * res , ClEnqueueTask (...); int size ) ClEnqueueTask (...); int size ) ClEnqueueReadBuffer (...); { ClEnqueueReadBuffer (...); { } for( int i = 0 ; i < size ; i ++) } for( int i = 0 ; i < size ; i ++) Intel Intel res [ i ] = A [ i ] + B [ i ]; res [ i ] = A [ i ] + B [ i ]; Cross Cross Offmine } Offmine } Compiler Compiler Compiler Compiler . exe . exe . aocx . aocx 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 11
Design Input Program //Altera_OpenCL_Accelerate //Altera_OpenCL_Accelerate //Altera_OpcnCL_size A size //Altera_OpcnCL_size A size //Altera_OpenCL... //Altera_OpenCL... void sum ( int * A , int * B , int * res , int size ) void sum ( int * A , int * B , int * res , int size ) { { for( int i = 0 ; i < size ; i ++) for( int i = 0 ; i < size ; i ++) res [ i ] = A [ i ]+ B [ i ]; res [ i ] = A [ i ]+ B [ i ]; } } SOCAO Compiler SOCAO Compiler SOCAO Compiler SOCAO Compiler Host Code OpenCL Accelerator Code void sum ( int * A , int * B , __kernel void void sum ( int * A , int * B , __kernel void int * res , int size ) aocl_generated_kernel ( int * res , int size ) aocl_generated_kernel ( { __global int * A , { __global int * A , clEnqueueWriteBuffer ( ... ); __global int * B , clEnqueueWriteBuffer ( ... ); __global int * B , clEnqueueTask ( … ); __global int * res , clEnqueueTask ( … ); __global int * res , clEnqueueReadBuffer ( … ); int size ) clEnqueueReadBuffer ( … ); int size ) } { } { for( int i = 0 ; i < size ; i ++) for( int i = 0 ; i < size ; i ++) res [ i ] = A [ i ] + B [ i ]; res [ i ] = A [ i ] + B [ i ]; } } 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 12
Implementation Front-end Middle-end Back-end Semantic Check Unparse Start Function Detection yes Parse Successful End Function Analysis & no yes Transformation Successful Abort no Host Program Abort Transformation Kernel Creation • ROSE Framework – Open-source compiler framework – Provides front-end, back-end and additional functionalities 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 13
Implementation Front-end Middle-end Back-end Semantic Check Unparse Start Function Detection yes Parse Successful End Function Analysis & no yes Transformation Successful Abort no Host Program Abort Transformation Kernel Creation • ROSE Framework – Open-source compiler framework – Provides front-end, back-end and additional functionalities 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 14
Implementation Inline Constant Value Constant Constant Array 2D to 1D Array Transformation Transformation Folding Analysis Transformation Memory Typedef Parameter Loop Unrolling In/Out Analysis Analysis Analysis Analysis ● The Function Analysis & Transformation phase is the most important ● All decisions are made during this phase ● Consists of 10 analysis/transformation steps 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 15
Implementation Inline Constant Value Constant Constant Array 2D to 1D Array Transformation Transformation Folding Analysis Transformation Memory Typedef Parameter Loop Unrolling In/Out Analysis Analysis Analysis Analysis void vector_process ( char * input , void vector_update ( char * input , char value ) int ilen ) { { int i ; { for( i = 0 ; i < 64 ; i ++) char value_1 = 'c' ; input [ i ] += value ; int i ; } for( i = 0 ; i < 64 ; i ++) input [ i ] += value_1 ; void vector_update ( char * input , } int ilen ) } { vector_process ( input , 'c' ); } 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 16
Implementation Inline Constant Value Constant Constant Array 2D to 1D Array Transformation Transformation Folding Analysis Transformation Memory Typedef Parameter Loop Unrolling In/Out Analysis Analysis Analysis Analysis program.cpp aocl_kernel.cl const uint32_t K [] = { 0x428A2F98 , __constant const uint32_t K [] = { …}; 0x428A2F98 , …}; //Altera_OpenCL_Accelerate __kernel void //Altera_OpenCL_size K 64 aocl_generated_kernel ( …. ) //Altera_OpenCL_const_vec K { … … void update_accelerated ( … ) } 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 17
Implementation Inline Constant Value Constant Constant Array 2D to 1D Array Transformation Transformation Folding Analysis Transformation Memory Typedef Parameter Loop Unrolling In/Out Analysis Analysis Analysis Analysis #define WIDTH 512 #define WIDTH 512 int [ 5 ][ WIDTH ] A ; __kernel void aocl_generated_kernel ( int //Altera_OpenCL_Accelerate __global __restrict__ * A , ...) //Altera_OpenCL_size A 2560 { void func (...) ... { int tmp = A [ 2 * WIDTH + 3 ]; int tmp = A [ 2 ][ 3 ]; } } 07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 18
Recommend
More recommend