outline outline
play

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) - PowerPoint PPT Presentation

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) Sparse Linear Algebra 4.) Unstructured Grids 5.) Conclusion 1 N-Body Methods Overview N-Body Methods Particle simulation Large number of simple entities


  1. Outline Outline  1.) N-Body Methods  2.) Dynamic Programming  3.) Sparse Linear Algebra  4.) Unstructured Grids  5.) Conclusion 1

  2. N-Body Methods Overview

  3. N-Body Methods • Particle simulation – Large number of simple entities – Each entity is dependent on each other – Regularly structured FP computations • Xeon Phi offers acceptable performance – Regularity leads to good auto-vectorization – Cache misses result in very high memory latency • L1 misses mostly lead to L2 misses

  4. N-Body platform comparison Source: [3]

  5. Dynamic Programming Overview

  6. Dynamic Programming • Problem is divided into smaller problems – Each smaller (and simpler) problem is solved first – Results are combined to compute larger problems – Often used for sequence alignment algorithms • Example: DNA Sequence Alignment – Xeon Phi 5110P vs. 2 GPU configurations ( GeForce GTX 480, K20c ) • Outperforms GTX 480 • Reaches 74,3% performance compared to the K20c – Computation heavily based on peak integer performance • Xeon Phi has ~ 57,3 % peak integer performance of the K20c

  7. Sparse Linear Algebra Overview

  8. Sparse Linear Algebra • Matrices with scattered non-zero entries • Linear system solvers, Eigen solvers • Different matrix structures – Patterns can arise – Irregularities • False predictions • Vector registers contain zero entries

  9. Case Study: Sparse MatrixVector Operations • Large sparse matrix A • Dense vector x • Multiply-Add • Simple instructions, large quantitiy • Parallel execution • Access patterns Image Source: [2]

  10. Bottleneck Considerations Bandwidth limits Bandwidth limits Memory bandwidth Memory bandwidth Computation bandwidth Computation bandwidth Memory latency Memory latency SIMD efficiency SIMD efficiency Vector register utilization Vector register utilization Core utilization Core utilization

  11. Algorithm Showcase • CSR / SELLPACK format – Column blocks – Finite-Window sorting • Adaptive load balancing Image Source: [1]

  12. Evaluation: Effects of single improvements Source: [1]

  13. Evaluation: Platform comparison Source: [1]

  14. Evaluation: Platform comparison [Gflops/s] Matrices from the UFL Sparse Matrix Collection (/20) Matrix 8 has much more non-zero entries than the rest. Based on data from [2]

  15. Conclusion: Sparse Linear Algebra on Xeon Phi • High potential in sparse linear algebra – Wide vector registers – High number of cores • Main problem points: – Irregular access patterns – Sparse data structures – Memory latency high on L2 cache misses • Performance extremely dependent on data

  16. Unstructured Grids Unstructured Grids 1 Image Source: [4] 6

  17. Image Source: [5]

  18. Image Source: [6]

  19. Unstructured Grids Unstructured Grids ● Partitioning the grid (few connections) ● Irregular acces-patterns => bad for auto-vectorization ● Software prefetching grid cells ● Ideal prefetch amount: MM latency * MM bandwidth 1 9

  20. Unstructured Grids Unstructured Grids  Data races can not be determined statically  „loop over edges and accessing data on edges“ => data races 2 0

  21. Airfoil Benchmark Airfoil Benchmark ● 2D inviscid airfoil code ● considered bandwith bound ● 2,800,000 cell mesh 2 Image Source: [7] 1

  22. Competitors Competitors 2 × Xeon E5-2680 Xeon Phi 5110P Tesla K40 3500 3000 2500 2000 1500 1000 500 0 Price $ # Cores Mem-Bandwidth double GFlops Last Level Cache 2 2

  23. Airfoil on Xeon Phi Airfoil on Xeon Phi  At the highest level: Message Passing Protocol (MPI)  At the lower level: OpenMP + Vector intrinsics 2 3

  24. Xeon Phi Benchmark graph Xeon Phi Benchmark graph 2 Image Source: [8] 4

  25. Benchmark graph Benchmark graph 2 Image Source: [8] 5

  26. Unstructured Grids conclusion Unstructured Grids conclusion  Benchmark probably not fair (price)  DO manual Vectorization: Speedup 1.7-1.82  Use MPI + + OpenMP 2 6

  27. Who did what Who did what Philipp Bartels:  Architecture introduction: slide 18 and 1 to 8  Presentation team x2: slide 1 and 16 to 27 Eugen Seljutin:  Presentation team x2: slide 2 to 15 2 7

  28. Credits Credits [1] Xing Liu et al.: Efficient Sparse Matrix-Vector Multiplication on  x86-Based Many-Core Processors [2] Erik Saule et al.: Performance Evaluation of Sparse Matrix  Multiplication Kernels on Intel Xeon Phi [3] Konstantinos Krommydas et al.: On the Characterization of  OpenCL Dwarfs on Fixed and Reconfigurable Platforms [4] http://en.wikipedia.org/wiki/Unstructured_grid  [5] http://view.eecs.berkeley.edu/wiki/Unstructured_Grids  [6] https://cfd.gmu.edu/~jcebral/gallery/vis04/index.html  2 8

  29. Credits Credits [7] http://en.wikipedia.org/wiki/File:Airfoil_with_flow.png  [8] I. Z. Reguly et al.: Vectorizing Unstructured Mesh Computations  for Many-core Architecture 2 9

Recommend


More recommend