Programming Models and Runtime Systems for Heterogeneous Architectures Sylvain Henry sylvain.henry@inria.fr Advisors: Denis Barthou and Alexandre Denis November 14, 2013 1
High-Performance Computing Sources: Dassault aviation, BMW, Larousse, Interstices 2
Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds 3
Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds � The number of transistors on a chip keeps increasing � Increase in the number of cores per chip � Multi-core architectures are omnipresent 3
Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds � The number of transistors on a chip keeps increasing � Increase in the number of cores per chip � Multi-core architectures are omnipresent � Trend � Multi-core with lower frequencies and more cores 3
Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers 4
Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations 4
Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations � System-on-chip (SoC) � e.g. ARM, AMD Fusion � Integrated CPU, GPU, DSP. . . 4
Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations � System-on-chip (SoC) � e.g. ARM, AMD Fusion � Integrated CPU, GPU, DSP. . . � Trend: heterogeneous architectures � Composition of different architecture models 4
Heterogeneous architectures � Multi-core (CPU) + several accelerators � Most general case � Any number of accelerators � Any kind of accelerator � Any kind of interconnection network � Examples: 5
Heterogeneous architectures � Multi-core (CPU) + several accelerators � Most general case � Any number of accelerators � Any kind of accelerator � Any kind of interconnection network � Examples: � Use best suited processing unit for each computation � Manual tuning has to be repeated for each architecture � Code portability difficult to achieve 5
Abstract architecture model Memory CPU MIC GPU Memory Memory ... ... Memory Memory ... 6
Abstract architecture model Host PU PU PU PU cpu cpu cpu cpu Memory Network of memories... Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic ...with associated heterogeneous processing units 6
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B C Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B C Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B C Memory Memory A B C Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B C Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7
Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU cpu cpu cpu cpu Memory Memory Memory Memory PU PU PU gpu gpu acc Device 1 Device 2 Device 3 Context 1 Context 2 8
Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU cpu cpu cpu cpu Command submission Memory Per device command queues Memory Memory Memory PU PU PU gpu gpu acc Device 1 Device 2 Device 3 Context 1 Context 2 8
Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU OpenCL cpu cpu cpu cpu Command graph Command Host submission Device 1 Device N Memory Tr Tr Per device command queues K ... Tr Callback Memory Memory Memory PU PU PU gpu gpu acc ... ... Device 1 Device 2 Device 3 Context 1 Context 2 8
OpenCL example (uncluttered) C ← A + B float A[256], B[256], C[256]; clGetPlatformIDs(&platforms ...); Select accelerator clGetDeviceIDs(platforms [0], &devices ...); cl_context context = clCreateContext(devices ...); cl_command_queue cq = clCreateCommandQueue(context, devices[0]...); cl_mem bufA = clCreateBuffer(context, 1024...); Allocate buffers cl_mem bufB = clCreateBuffer(context, 1024...); cl_mem bufC = clCreateBuffer(context, 1024...); clEnqueueWriteBuffer(cq, bufA, 0, 1024, A, NULL, &event1...); Send data clEnqueueWriteBuffer(cq, bufB, 0, 1024, B, NULL, &event2...); clSetKernelArg(kernelAdd, 0, sizeof (cl_mem), &bufA); Execute kernel clSetKernelArg(kernelAdd, 1, sizeof (cl_mem), &bufB); clSetKernelArg(kernelAdd, 2, sizeof (cl_mem), &bufC); cl_event deps[] = {event1,event2}; clEnqueueNDRangeKernel(cq, kernelAdd, deps, &event3...); Receive data clEnqueueReadBuffer(cq, bufC, 0, 1024, C, &event3, &event4); clWaitForEvents(event4); Release buffers clReleaseMemObject(bufA); clReleaseMemObject(bufB); clReleaseMemObject(bufC); 9
Recommend
More recommend