AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.
Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal MapReduce Conclusion 2
Architecture 3
Comparison Nvidea GTX640 Radeon HD 6850 • blocks of 6 SP • 1 Controlling unit for every • 4 general ones and one 8 Stream processors overseer • advantage: easier for • one Sp with FP/Int developers due to simple arithmetic functions structure • advantage: more potential if used correctly • disadvantage: requires developer to specically program towards it Architecture 4
Comparison Less Power overall Through structure smaller Die size Less Expensive Other small differences Architecture 5
Dense Linear Algebra Classic vector and matrix operations 1 Data is typically laid out as a contiguous array and computations on elements, rows, columns, or matrix blocks are the norm 2 Examples 3 1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra Dense Linear Algebra 6
Paper Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Proceedings of 2013 IEEE International Symposium on Workload Publication Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf Dense Linear Algebra 7
Overview of the Paper Design of several fundamental dense linear algebra (DLA) algorithms in OpenCL (clMAGMA library) Efficient implementation on AMD’s Tahiti GPUs with the use of the OpenCL standard and optimized BLAS routines Observation of a wide applicability and many-fold performance improvement over highly tuned codes constituting state-of-the-art libraries for the current generation of multicore CPUs Dense Linear Algebra 8
Performance Study Hardware : AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU Dense Linear Algebra 9
Results in Detail (1) 1) LU factorization 2) Cholesky factorization (up to 5.7x speedup vs. the CPU host) (up to 5.4x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 10
Results in Detail (2) 3) QR factorization 4) Hessenberg factorization (up to 5.9x speedup vs. the CPU host) (up to 5.5x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 11
Results in Detail (3) 5) Matrix Inversion (up to 1.2x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 12
Sparse Linear Algebra Used when input matrices have a large number of zero entries 1 Compressed data structures, keeping only the non-zero entries and their indices, are the norm here 2 3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html Sparse Linear Algebra 13
Paper Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf Sparse Linear Algebra 14
Overview of the Paper Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many- core architectures on top of CUDA or OpenCL One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path Sparse Linear Algebra 15
Performance Study Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070 − CPU: Intel Core i7 930 Implementation − GPUs: OpenCL implementations from AMD and NVIDIA − CPU: OpenCL implementations from AMD and Intel Results − Distinct acceleration is observed when running a GPU path vs. the CPU path − Significant acceleration requires problems of sizes between 10 3 and 10 5 due to considerable overhead at smaller problem size − Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL Sparse Linear Algebra 16
Results in Detail (1) VexCL CPU (Intel) GPU (AMD) Source of the table : (2) Sparse Linear Algebra 17
Results in Detail (2) Performance under largest problem size: Achieved throughput GB/sec Hamiltonian lattice Time sec (percentage of theoretical peak) GPU: NVIDIA Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) GPU: Tahiti VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) CPU: Intel Core i7 930 Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) Source of the table : (2) Sparse Linear Algebra 18
Graph Traversal http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png Graph Traversal 19
Divergence Branche Divergence Multiple Threads on same wavefront Threads can go into Lockstep Memory Divergence All threads on one wavefront must access memory before next step Some threds must go through multiple adjacency lists to find correct memory Load Imbalance Graphs are in their nature umbalanced Some threads will get much more workload than others Graph Traversal 20
Speedup All data was gathered using a AMD Radeon HD7000 AMD A8-5500 accelerated processing unit Pannotia was used as an application suite Graph Traversal 21
Dijkstra and Graph Coloring http://de.wikipedia.org/wiki/Dijkstra-Algorithmus http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg #mediaviewer/File:DijkstraStep09.svg Graph Traversal 22
Dijkstra and Graph Coloring Speedups ranging from 4 to 8 Speedup tends to be better for larger graphs Strong paralisation Graph Traversal 23
Dijkstra and Graph Coloring Source: (4) Graph Traversal 24
Friend Recommendation and Connected Components Labelling http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png Graph Traversal 25
Friend Recommendation and Connected Components Labelling Speedups ranging from 1 to 2 Relativly little speedup due to strong inbalance Graph Traversal 26
Summary Effetiveness dependant on exact problem Deep understanding of GPU required Deep understanding of problem required Graph Traversal 27
Map Reduce http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg Map Reduce 28
Map Reduce AMD GPUs have two ways of accesing memory Fast Path/ complete Path All Current GPU implimentations use global atomic operations Use of global atomic operations causes AMD GPUs to use the complete path Tests show 32 times slower memory access over the complete path Map Reduce 29
Software-based Atomic add A Map Reduce Framework for Heterogeneous Computing Architectures Map Reduce 30
Map Reduce Master thread quickly becomes bottleneck Instead group by wavefront Define first thread as dominant thread Create 4 global arrays with one elment per wavefront WavefrontsAddresse, WavefrontsSum, WavefrontsPrefixSums, Finished. Map Reduce 31
Map Reduce Step 1 Threads Load address Sync and sums Map Reduce 32
Map Reduce Step 2 Local atomic add Update to generate dominate and prefixSumm and set local False increment increment to 0 Is only wavefront on address true WFprefixSum = address Sync Wfincrement = localSum Map Reduce 33
Map Reduce Step 3 true If Requesting Set wavefront addresses = 0 False true Reset Update Local global If dominant data variable False Sync Map Reduce 34
Evaluation Hardware − GPU: ATI Radeon HD 5870 (Cypress) − CPU: Intel Xeon e5405 x2 Key Performance measures Total execution time in nano-seconds Ratio of FastPath to CompletePath memory transactions MapReduce 35
Experiment Micro Benchmarks 1) without memory 2) with memory transactions transaction (up to 3x vs. system atomic (up to 1.9x vs. system operation) atomic operation) Source of the figures: (3) MapReduce 36
Recommend
More recommend