AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - - PowerPoint PPT Presentation

amd gpu
SMART_READER_LITE
LIVE PREVIEW

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - - PowerPoint PPT Presentation

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D. Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal


slide-1
SLIDE 1

AMD GPU

Jasper Manousek Ying Li 05.02.2015

Seminar | High-Performance and Scientific Computing

  • Prof. Paolo Bientinesi, Ph.D.
slide-2
SLIDE 2

Agenda

 Architecture

 Dwarfs

  • Sparse Linear Algebra
  • Dense Linear Algebra
  • Graph Traversal
  • MapReduce

 Conclusion

2

slide-3
SLIDE 3

Architecture

3

slide-4
SLIDE 4

Comparison

4

Architecture

Nvidea GTX640

  • 1 Controlling unit for every

8 Stream processors

  • advantage: easier for

developers due to simple structure

Radeon HD 6850

  • blocks of 6 SP
  • 4 general ones and one
  • verseer
  • one Sp with FP/Int

arithmetic functions

  • advantage: more potential

if used correctly

  • disadvantage: requires

developer to specically program towards it

slide-5
SLIDE 5

 Less Power overall  Through structure smaller Die size  Less Expensive  Other small differences

Comparison

5

Architecture

slide-6
SLIDE 6

Dense Linear Algebra

 Classic vector and matrix operations1  Data is typically laid out as a contiguous array and

computations on elements, rows, columns, or matrix blocks are the norm2

 Examples3

6

Dense Linear Algebra

1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra

slide-7
SLIDE 7

Paper

7

Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Publication Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf

Dense Linear Algebra

slide-8
SLIDE 8

Overview of the Paper

8

 Design of several fundamental dense linear algebra

(DLA) algorithms in OpenCL (clMAGMA library)

 Efficient implementation on AMD’s Tahiti GPUs with the

use of the OpenCL standard and optimized BLAS routines

 Observation of a wide applicability and many-fold

performance improvement over highly tuned codes constituting state-of-the-art libraries for the current generation of multicore CPUs

Dense Linear Algebra

slide-9
SLIDE 9

Performance Study

9

 Hardware: AMD’s Radeon HD7970 card and a single

socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host

 Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU

host

 Results: Higher performance of the clMAGMA applied to

heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU

Dense Linear Algebra

slide-10
SLIDE 10

Results in Detail (1)

10

1) LU factorization (up to 5.7x speedup vs. the CPU host) 2) Cholesky factorization (up to 5.4x speedup vs. the CPU host)

Dense Linear Algebra

CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1)

slide-11
SLIDE 11

Results in Detail (2)

11

3) QR factorization (up to 5.9x speedup vs. the CPU host) 4) Hessenberg factorization (up to 5.5x speedup vs. the CPU host)

Dense Linear Algebra

CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1)

slide-12
SLIDE 12

Results in Detail (3)

12

5) Matrix Inversion (up to 1.2x speedup vs. the CPU host)

Dense Linear Algebra

Source of the figures: (1) CPU+GPU with clMAGMA CPU with MKL11.1

slide-13
SLIDE 13

Sparse Linear Algebra

 Used when input matrices have a large number of zero

entries1

 Compressed data structures, keeping only the non-zero

entries and their indices, are the norm here2

13

3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html

Sparse Linear Algebra

slide-14
SLIDE 14

Paper

14

Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf

Sparse Linear Algebra

slide-15
SLIDE 15

Overview of the Paper

15

 Comparison of several modern C++ libraries providing

high-level interfaces for programming multi- and many- core architectures on top of CUDA or OpenCL

 One of the performance and usage study: a nonlinear

disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product

 In general, all the experiments including the nonlinear

disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path

Sparse Linear Algebra

slide-16
SLIDE 16

Performance Study

16

 Hardware

− GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070 − CPU: Intel Core i7 930

 Implementation

− GPUs: OpenCL implementations from AMD and NVIDIA − CPU: OpenCL implementations from AMD and Intel

 Results

− Distinct acceleration is observed when running a GPU path vs.

the CPU path

− Significant acceleration requires problems of sizes between 103

and 105 due to considerable overhead at smaller problem size

− Overhead of using high-level libraries negligible compared to the

effort spent in getting familiar with the details of CUDA or OpenCL

Sparse Linear Algebra

slide-17
SLIDE 17

Results in Detail (1)

17

Sparse Linear Algebra

Source of the table : (2)

VexCL CPU (Intel)

GPU (AMD)

slide-18
SLIDE 18

Results in Detail (2)

18

Hamiltonian lattice Time sec Achieved throughput GB/sec (percentage of theoretical peak) Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) GPU: NVIDIA GPU: Tahiti CPU: Intel Core i7 930

Sparse Linear Algebra

Source of the table : (2)

Performance under largest problem size:

slide-19
SLIDE 19

Graph Traversal

19

Graph Traversal

http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png

slide-20
SLIDE 20

 Branche Divergence

 Multiple Threads on same wavefront  Threads can go into Lockstep

 Memory Divergence

 All threads on one wavefront must access memory before next step  Some threds must go through multiple adjacency lists to find correct

memory

 Load Imbalance

 Graphs are in their nature umbalanced  Some threads will get much more workload than others

Divergence

20

Graph Traversal

slide-21
SLIDE 21

 All data was gathered using a AMD Radeon HD7000  AMD A8-5500 accelerated processing unit  Pannotia was used as an application suite

Speedup

21

Graph Traversal

slide-22
SLIDE 22

Dijkstra and Graph Coloring

22

Graph Traversal

http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg http://de.wikipedia.org/wiki/Dijkstra-Algorithmus #mediaviewer/File:DijkstraStep09.svg

slide-23
SLIDE 23

 Speedups ranging from 4 to 8  Speedup tends to be better for larger graphs  Strong paralisation

Dijkstra and Graph Coloring

23

Graph Traversal

slide-24
SLIDE 24

Dijkstra and Graph Coloring

24

Graph Traversal

Source: (4)

slide-25
SLIDE 25

Friend Recommendation and Connected Components Labelling

25

Graph Traversal

http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png

slide-26
SLIDE 26

 Speedups ranging from 1 to 2  Relativly little speedup due to strong inbalance

Friend Recommendation and Connected Components Labelling

26

Graph Traversal

slide-27
SLIDE 27

 Effetiveness dependant on exact problem  Deep understanding of GPU required  Deep understanding of problem required

Summary

27

Graph Traversal

slide-28
SLIDE 28

Map Reduce

28

Map Reduce

http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg

slide-29
SLIDE 29

 AMD GPUs have two ways of accesing memory  Fast Path/ complete Path  All Current GPU implimentations use global atomic operations  Use of global atomic operations causes AMD GPUs to use the

complete path

 Tests show 32 times slower memory access over the complete path

Map Reduce

29

Map Reduce

slide-30
SLIDE 30

Software-based Atomic add

30

Map Reduce

A Map Reduce Framework for Heterogeneous Computing Architectures

slide-31
SLIDE 31

 Master thread quickly becomes bottleneck  Instead group by wavefront  Define first thread as dominant thread  Create 4 global arrays with one elment per wavefront  WavefrontsAddresse, WavefrontsSum,

WavefrontsPrefixSums, Finished.

Map Reduce

31

Map Reduce

slide-32
SLIDE 32

Map Reduce

32

Threads Load address and sums Sync

Map Reduce

Step 1

slide-33
SLIDE 33

Map Reduce

33

Is only wavefront

  • n address

WFprefixSum = address Wfincrement = localSum Local atomic add to generate prefixSumm and increment

Map Reduce

Sync Update dominate and set local increment to 0 Step 2 true False

slide-34
SLIDE 34

Map Reduce

34

Map Reduce

Sync If Requesting wavefront Step 3 Set addresses = 0 If dominant Update global variable Reset Local data true False true False

slide-35
SLIDE 35

Evaluation

35

MapReduce

 Hardware

− GPU: ATI Radeon HD 5870 (Cypress) − CPU: Intel Xeon e5405 x2

 Key Performance measures

Total execution time in nano-seconds Ratio of FastPath to CompletePath memory transactions

slide-36
SLIDE 36

Experiment Micro Benchmarks

1) without memory transaction (up to 1.9x vs. system atomic operation)

36

2) with memory transactions (up to 3x vs. system atomic

  • peration)

MapReduce

Source of the figures: (3)

slide-37
SLIDE 37

Experiment MapReduce: Test Applications

37

MapReduce

Matrix Multiplication (MM) String Match (SM) KMeans (KM)

  • Matrix X & Y as Input
  • Outputs Matrix Z
  • Implementation: only the

map phase

  • Each map task responsible

for calculating one element of Matrix Z

  • Searches an input

keyword

  • Outputs all matching

locations

  • Implementation: only the

map phase

  • Each map task reads a

chunk of the input document and outputs the found locations

  • Iterative clustering

algorithm

  • Each iteration assigns each

input point to a closest cluster and recalculates the clusters

  • Implementation: both the

map and reduce phase

  • Map function assigns points

and reduce function recalculates clusters

slide-38
SLIDE 38

Experiment MapReduce: Result for Matrix Multiplication

38

MapReduce

 The speedup of using

software-based atomic add

  • ver the system one increases

as the input matrices get larger (up to 13.55 folds)

 Ratio of FastPath to

CompletePath memory accesses: 30:0 for software- based atomic and 3:28 for system-provided atomic implementations

Source of the figures: (3)

slide-39
SLIDE 39

Experiment MapReduce: Result for String Match

39

MapReduce  The software atomic approach

helps to improve the memory read performance.

 In the case of a large number of

matches, the overhead incurred by the software atomic approach for writing results offsets the benefit of using FastPath for read accesses.

 Ratio of FastPath to

CompletePath memory accesses: 12:0 for software- based atomic and 1:19 for system-provided atomic implementations

Source of the figures: (3)

slide-40
SLIDE 40

Experiment MapReduce: Result KMeans

40

MapReduce

 The speedup of using software-

based atomic add over the system one increases with the number of points (up to 67.3 folds)

Source of the figures: (3)

slide-41
SLIDE 41

Conclusion AMD GPU

41

MapReduce

 Significant speedup has been observed  Readily available in most computers  Requirements for deep understanding of the architecture

and the programming language

 In contrast to NVidia more complicated implementation to

enhance the efficiency

Source of the figures: (3)

slide-42
SLIDE 42

References

1) Chongxiao Cao , Jack Dongarra , Peng Du , Mark Gates , Piotr Luszczek and Stanimire Tomov (2013): clMAGMA: High Performance Dense Linear Algebra with OpenCL. International Workshop on OpenCL 2013. 2) Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling: Programming CUDA and OpenCL(2013): A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing: Vol. 35, No. 5. 3) Marwa K. Elteir (2012).: A MapReduce Framework for Heterogeneous Computing Architectures. Dissertation, Virginia Polytechnic Institute and State University. 4) Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron(2013): Pannotia: Understanding Irregular GPGPU Graph

  • Applications. Proceedings of 2013 IEEE International Symposium on

Workload Characterization (IISWC), Sept 2013

42

slide-43
SLIDE 43

Work Distribution

43

Ying Jasper

Architecture p.3-5 Graph Traversal p.19-27 Dense Linear Algebra p.6-12 Sparse Linear Algebra p.13-18 MapReduce p.28-34 p.35-40 Conclusion p.41