HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) AND THE SOFTWARE ECOSYSTEM MANJU HEGDE, CORPORATE VP, PRODUCTS GROUP, AMD
OUTLINE Motivation HSA architecture v1 Software stack Workload analysis Software Ecosystem 2
PARADIGM SHIFTS…. Heterogeneous Single-Core Era Multi-Core Era Systems Era Temporarily Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Moore’s Moore’s Law Constrained by: Power Power Abundant data SMP Parallel SW Programming Law parallelism Complexity architecture Voltage Scalability Power efficient models Comm.overhead GPUs Scaling pthreads OpenMP / TBB … Shader CUDA OpenCL !!! Assembly C/C++ Java … Modern Application Single-thread Performance Throughput Performance Performance ? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 3
WITNESS DISCRETE CPU AND DISCRETE GPU COMPUTE PCIe CPU CPU CPU … GPU 1 2 N CPU Memory (Coherent) GPU Memory Compute acceleration works well for large offload Slow data transfer between CPU and GPU Expert programming necessary to take advantage of the GPU compute 4
FIRST AND SECOND GENERATION APU S High speed CPU CPU CPU … GPU 1 2 N Internal Bus CPU Partition (Coherent) GPU Partition First integration of CPU and GPU on-chip Common physical memory but not to programmer Faster transfer of data between CPU and GPU to enable more code to run on the GPU 5
COMMON PHYSICAL MEMORY BUT NOT TO PROGRAMMER CPU explicitly copies data to GPU memory GPU completes computation CPU explicitly copies result back to CPU memory GPU CPU | | | | | | | | | | | | | | | | | | | | CPU Memory GPU Memory 6
WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE SOCs are quickly following into the same many CPU core bottlenecks of the PC To move beyond this we need to look at right processor(s) and/or execution device for given workload at reasonable power While addressing the core issues of Easier to program Easier to optimize Easier to load balance High performance Lower power 7
COMBINE INTO UNIFIED PROGRAMMING MODEL Encode Audio Video CPU Decode Processor Hardware Engines Shared Memory, Coherency, User Mode Queues Fixed Image GPU DSP Function Signal Accelerator Processing 8
WHO IS DOING THIS? HSA FOUNDATION MEMBERSHIP – JUNE 2013 Founders Promoters Supporters Contributors Academic Associates 9
HSA FOUNDATION’S FOCUS Identify design features to make accelerators first class processors Attract mainstream programmers Create a platform architecture for ALL accelerators 10
HSA ARCHITECTURE V 1 GPU compute C++ support Coherency, User Shared Memory User Mode Scheduling Encode Audio Video CPU Mode Queues Decode Processor Hardware Fully coherent memory between CPU & GPU GPU uses pageable system Fixed Image memory via CPU pointers GPU Function Signal DSP Acctr Processing GPU graphics pre-emption GPU compute context switch 11
HSA KEY FEATURES Coherent Memory: C P GPU U Ensures CPU and HW Cache Cache GPU Coherency caches both see an up-to-date view of data Pageable memory: Physical Memory The GPU can seamlessly access virtual memory addresses that are not (yet) present in physical memory Virtual Memory Entire memory space: Both CPU and GPU can access and allocate any location in the system’s virtual memory space 12
WITH HSA CPU simply passes a pointer to GPU GPU completes computation CPU can read the result directly – no copying needed! GPU CPU | | | | | | | | | | CPU / GPU Uniform Memory 13
HSA Software Stack HSA ARCHITECTURE V 1 AppsAppsAppsAppsAppsApps GPU compute C++ support HSA Domain Libraries, OpenCL ™ 2.x Runtime User Mode Scheduling HSA Runtime Task Queuing HSA JIT Fully coherent memory Librarie s HSA Kernel between CPU & GPU Mode Driver GPU uses pageable system Coherency, User Shared Memory Encode Audio Video memory via CPU pointers CPU Mode Queues Decode Processor Hardware GPU graphics pre-emption GPU compute context switch Fixed Image GPU Function Signal DSP Acctr Processing 14
HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves under HSA 15
TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer GPU A HARDWARE Hardware Queue 16
TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware Queue User Kernel Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 17
TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow B A B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware Queue User Kernel Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 18
TODAY’S COMMAND AND DISPATCH FLOW Command Flow Data Flow User Kernel Application Soft Direct3D Mode Mode A Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow B A B C User Kernel GPU Application Soft A Direct3D Mode Mode HARDWARE B Queue Driver Driver Command Buffer DMA Buffer Command Flow Data Flow Hardware Queue User Kernel Application Soft Direct3D Mode Mode C Queue Driver Driver Command Buffer DMA Buffer 19
HSA COMMAND AND DISPATCH FLOW C C C C Application Application codes to the C C hardware User mode queuing Hardware Queue Optional Dispatch Hardware scheduling Buffer B B Low dispatch times B GPU Application HARDWARE B No APIs Hardware Queue No Soft Queues A A No User Mode Drivers A Application No Kernel Mode Transitions A No Overhead! Hardware Queue 20
COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 21
MAKING GPUS AND APUS EASIER TO PROGRAM: TASK QUEUING RUNTIMES Popular pattern for task and data parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced HSA is designed to extend this pattern to run on heterogeneous systems 22
TASK QUEUING RUNTIME ON CPU S Work Stealing Runtime Q Q Q Q CPU CPU CPU CPU Worker Worker Worker Worker X86 CPU X86 CPU X86 CPU X86 CPU CPU Threads GPU Threads Memory 23
TASK QUEUING RUNTIME ON THE HSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU CPU CPU CPU GPU Memory Worker Worker Worker Worker Manager X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch S S S S S I I I I I M M M M M CPU Threads GPU Threads Memory D D D D D 24
Driver Stack HSA Software Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps HSA Domain Libraries, Domain Libraries OpenCL ™ 2.x Runtime OpenCL™ 1.x, DX Runtimes, HSA Runtime User Mode Drivers Task Queuing HSA JIT Libraries HSA Kernel Graphics Kernel Mode Driver Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component Components contributed by third parties 25
HSA INTERMEDIATE LANGUAGE - HSAIL HSAIL is the intermediate language for parallel compute in HSA Generated by a high level compiler (LLVM, gcc, Java VM, etc) Compiled down to GPU ISA or other parallel processor ISA by an IHV Finalizer Finalizer may execute at run time, install time or build time, depending on platform type HSAIL is a low level instruction set designed for parallel compute in a shared virtual memory environment. HSAIL is SIMT in form and does not dictate hardware microarchitecture HSAIL is designed for fast compile time, moving most optimizations to HL compiler HSAIL is at the same level as PTX: an intermediate assembly or Virtual Machine Target Represented as bit-code in in a Brig file format with support late binding of libraries 26
HSA BRINGS A MODERN OPEN COMPILATION FOUNDATION OpenCL ™ Cuda EDG or CLANG EDG or CLANG NVVM IR SPIR LLVM LLVM PTX HSAIL Hardware HARDWARE This bring about fully competitive rich complete compilation stack architecture for the creation of a broader set of GPU Computing tools, languages and libraries. HSAIL supports LLVM and other compilers – GCC, Java VM 27
Recommend
More recommend