AMD GPU
Jasper Manousek Ying Li 05.02.2015
Seminar | High-Performance and Scientific Computing
- Prof. Paolo Bientinesi, Ph.D.
AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - - PowerPoint PPT Presentation
AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D. Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal
Seminar | High-Performance and Scientific Computing
Architecture
Dwarfs
Conclusion
2
3
4
Architecture
Less Power overall Through structure smaller Die size Less Expensive Other small differences
5
Architecture
6
Dense Linear Algebra
1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra
7
Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Publication Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf
Dense Linear Algebra
8
Dense Linear Algebra
9
Dense Linear Algebra
10
1) LU factorization (up to 5.7x speedup vs. the CPU host) 2) Cholesky factorization (up to 5.4x speedup vs. the CPU host)
Dense Linear Algebra
CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1)
11
3) QR factorization (up to 5.9x speedup vs. the CPU host) 4) Hessenberg factorization (up to 5.5x speedup vs. the CPU host)
Dense Linear Algebra
CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1)
12
5) Matrix Inversion (up to 1.2x speedup vs. the CPU host)
Dense Linear Algebra
Source of the figures: (1) CPU+GPU with clMAGMA CPU with MKL11.1
13
3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html
Sparse Linear Algebra
14
Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf
Sparse Linear Algebra
15
Sparse Linear Algebra
16
Hardware
Implementation
Results
Sparse Linear Algebra
17
Sparse Linear Algebra
Source of the table : (2)
VexCL CPU (Intel)
GPU (AMD)
18
Hamiltonian lattice Time sec Achieved throughput GB/sec (percentage of theoretical peak) Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) GPU: NVIDIA GPU: Tahiti CPU: Intel Core i7 930
Sparse Linear Algebra
Source of the table : (2)
Performance under largest problem size:
19
Graph Traversal
http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png
Branche Divergence
Multiple Threads on same wavefront Threads can go into Lockstep
Memory Divergence
All threads on one wavefront must access memory before next step Some threds must go through multiple adjacency lists to find correct
memory
Load Imbalance
Graphs are in their nature umbalanced Some threads will get much more workload than others
20
Graph Traversal
All data was gathered using a AMD Radeon HD7000 AMD A8-5500 accelerated processing unit Pannotia was used as an application suite
21
Graph Traversal
22
Graph Traversal
http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg http://de.wikipedia.org/wiki/Dijkstra-Algorithmus #mediaviewer/File:DijkstraStep09.svg
Speedups ranging from 4 to 8 Speedup tends to be better for larger graphs Strong paralisation
23
Graph Traversal
24
Graph Traversal
Source: (4)
25
Graph Traversal
http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png
Speedups ranging from 1 to 2 Relativly little speedup due to strong inbalance
26
Graph Traversal
Effetiveness dependant on exact problem Deep understanding of GPU required Deep understanding of problem required
27
Graph Traversal
28
Map Reduce
http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg
AMD GPUs have two ways of accesing memory Fast Path/ complete Path All Current GPU implimentations use global atomic operations Use of global atomic operations causes AMD GPUs to use the
Tests show 32 times slower memory access over the complete path
29
Map Reduce
30
Map Reduce
A Map Reduce Framework for Heterogeneous Computing Architectures
Master thread quickly becomes bottleneck Instead group by wavefront Define first thread as dominant thread Create 4 global arrays with one elment per wavefront WavefrontsAddresse, WavefrontsSum,
31
Map Reduce
32
Threads Load address and sums Sync
Map Reduce
Step 1
33
Is only wavefront
WFprefixSum = address Wfincrement = localSum Local atomic add to generate prefixSumm and increment
Map Reduce
Sync Update dominate and set local increment to 0 Step 2 true False
34
Map Reduce
Sync If Requesting wavefront Step 3 Set addresses = 0 If dominant Update global variable Reset Local data true False true False
35
MapReduce
Hardware
Key Performance measures
36
MapReduce
Source of the figures: (3)
37
MapReduce
map phase
for calculating one element of Matrix Z
keyword
locations
map phase
chunk of the input document and outputs the found locations
algorithm
input point to a closest cluster and recalculates the clusters
map and reduce phase
and reduce function recalculates clusters
38
MapReduce
The speedup of using
Ratio of FastPath to
Source of the figures: (3)
39
MapReduce The software atomic approach
helps to improve the memory read performance.
In the case of a large number of
matches, the overhead incurred by the software atomic approach for writing results offsets the benefit of using FastPath for read accesses.
Ratio of FastPath to
CompletePath memory accesses: 12:0 for software- based atomic and 1:19 for system-provided atomic implementations
Source of the figures: (3)
40
MapReduce
The speedup of using software-
Source of the figures: (3)
41
MapReduce
Source of the figures: (3)
42
43