alternative gpu friendly
play

Alternative GPU friendly assignment algorithms Paul Richmond and - PowerPoint PPT Presentation

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Parallel Computing


  1. Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

  2. Graphics Processing Units (GPUs)

  3. Context: GPU Performance Accelerated Computing Parallel Computing Serial Computing ~40 384 GigaFLOPS GigaFLOPS 8.74 TeraFLOPS 1 core 16 4992 cores cores

  4. 10.0 TFlops 8.74 TeraFLOPS 9.0 TFlops 6 hours CPU time 8.0 TFlops vs. 7.0 TFlops 1 minute GP GPU 6.0 TFlops time 5.0 TFlops 4.0 TFlops 3.0 TFlops 2.0 TFlops 1.0 TFlops ~40 GigaFLOPS 0.0 TFlops 1 CPU Core GPU (4992 cores)

  5. Accelerators • Much of the functionality of CPUs is unused for HPC • Complex Pipelines, Branch prediction, out of order execution, etc. • Ideally for HPC we want: Simple , Low Power and Highly Parallel cores

  6. An accelerated system DRAM GDRAM GPU/ CPU Accelerator PCIe I/O I/O • Co-processor not a CPU replacement

  7. Thinking Parallel • Hardware considerations • High Memory Latency (PCI-e) • Huge Numbers of processing cores • Algorithmic changes required • High level of parallelism required • Data parallel != task parallel “If your problem is not parallel then think again”

  8. Amdahl’s Law 1 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑇 = 𝑄 𝑂 − (1 − 𝑄) 25 20 P = 25% Speedup (S) P = 50% 15 P = 90% 10 P= 95% 5 0 Number of Processors (N) • Speedup of a program is limited by the proportion than can be parallelised • Addition of processing cores gives diminishing returns

  9. SATALL Optimisation

  10. Time per function for Largest Network (LoHAM) 45000 Profile the 97.4% 40000 Application 35000 • 11 hour runtime 30000 • Function A • 97.4% runtime 25000 Time (s) • 2000 calls 20000 • Hardware • Intel Core i7-4770k 3.50GHz 15000 • 16GB DDR3 • Nvidia GeForce Titan X 10000 5000 0.3% 0.3% 0.1% 0.1% 0.1% 0 A B C D E F Function

  11. Function A • Input Distribution of Runtime (LoHAM) • Network (directed weighted graph) 100% • Origin-Destination Matrix 90% 80% • Output 70% • Traffic flow per edge 60% • 2 Distinct Steps 50% 40% 1. Single Source Shortest Path (SSSP) 30% All-or-Nothing Path 20% For each origin in the O-D matrix 10% 2. Flow Accumulation 0% Apply the OD value for each trip to each link on the route Serial Flow SSSP

  12. Single Source Shortest Path For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)

  13. Serial SSSP Algorithm • D’Esopo -Pape (1974) • Maintains a priority queue of vertices to explore Highly Serial • Not a Data-Parallel Algorithm • We must change algorithm to match the hardware Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.

  14. Parallel SSSP Algorithm • Bellman-Ford Algorithm (1956) • Poor serial performance & time complexity • Performs significantly more work • Highly Parallel • Suitable for GPU acceleration Total Edges Considered 30 25 Number of Edges 20 15 10 5 Bellman, Richard. On a routing problem. 1956. 0 Bellman-Ford Desopo-Pape

  15. Implementation • A2 - Naïve Bellman- Time to compute all requied SSSP results per model Ford using Cuda 70.00 • Up to 369x slower 60.00 • Striped bars continue 50.00 Time (seconds) off the scale 40.00 Derby 36.5s 30.00 20.00 CLoHAM 1777.2s 10.00 LoHAM 5712.6s 0.00 Serial A2 Optimisation Derby CLoHAM LoHAM

  16. Implementation • Followed iterative cycle of Time to compute all requied SSSP results per model performance optimisations 70.00 • A3 – Early Termination 60.00 • A4 – Node Frontier 50.00 Time (seconds) • A8 – Multiple origins Concurrently 40.00 • SSSP for each Origin in 30.00 the OD matrix 20.00 • A10 – Improved load 10.00 Balancing 0.00 • Cooperative Thread Array Serial A2 A3 A4 A8 A10 A11 • A11 – Improved array access Optimisation Derby CLoHAM LoHAM

  17. Limiting Factor (Function A) Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial Serial Flow SSSP Flow SSSP

  18. Limiting Factor (Function A) • Limiting Factor has now changed • Need to parallelise Flow Accumulation Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial A11 Serial A11 Flow SSSP Flow SSSP

  19. Flow Accumulation • Shortest Path + OD = Flow-per-link For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited • Parallel problem • But requires synchronised access to shared data structure for all trips (atomic operations) 0 1 2 … 0 0 2 3 … Link 0 1 2 3 … 1 1 0 4 … Flow 9 8 6 9 … 2 2 5 0 … … … … …

  20. Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution 3 2 1 0 Serial A12 Derby CLoHAM LoHAM

  21. Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution • Solutions: 3 • A15 - Reduce number of atomic operations • Solve in batches using parallel reduction 2 • A14 - Use fast hardware-supported single precision atomics • Minimise loss of precision using multiple 32-bit summations 1 • 0.000022% total error 0 Serial A12 A15 A14 (double) (single) Derby CLoHAM LoHAM

  22. Integrated Results

  23. Relative Assignment Runtime Performance vs Serial 45 41.77 Assignment Speedup 40 relative to Serial Serial 35 • LoHAM – 12h 12m Assignment Time relative to serial 30 Double precision • 25 LoHAM – 35m 22s 20.70 18.24 20 Single precision • Reduced loss of precision 15 • LoHAM – 17m 32s 11.24 10 Hardware: 6.81 6.14 5.51 • Intel Core i7-4770K 3.73 3.42 5 • 16GB DDR3 1.00 1.00 1.00 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

  24. Relative Assignment Runtime Performance vs Multicore 7 Assignment Speedup 6.14 relative to Multicore 6 Multicore • LoHAM – 1h 47m Assignment Time relative to multicore 5 Double precision 4 • LoHAM – 35m 22s 3.31 3.04 Single precision 3 • Reduced loss of precision 2.04 • LoHAM – 17m 32s 1.79 2 Hardware: 1.09 1.00 1.00 1.00 • Intel Core i7-4770K 1 • 0.29 16GB DDR3 0.18 0.15 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

  25. GPU Computing at UoS

  26. Expertise at Sheffield • Specialists in GPU Computing and performance optimisation • Complex Systems Simulations via FLAME and FLAME GPU • Visual Simulation, Computer Graphics and Virtual Reality • Training and Education for GPU Computing

  27. Thank You Largest Model (LoHAM) results Speedup Speedup Runtime Serial Multicore Paul Richmond p.richmond@sheffield.ac.uk Serial 12:12:24 1.00 0.15 paulrichmond.shef.ac.uk Multicore 01:47:36 6.81 1.00 Peter Heywood A15 (double 00:35:22 20.70 3.04 p.heywood@sheffield.ac.uk precision) ptheywood.uk A14 (single 00:17:32 41.77 6.14 precision)

  28. Backup Slides

  29. Benchmark Models • 3 Benchmark networks Benchmark model performance 50000 • Range of sizes 45000 • Small to V. Large 40000 35000 • Up to 12 hour runtime 30000 Time (s) 25000 Model Vertices Edges O-D trips 20000 (Nodes) (Links) 15000 547 ² 2700 25385 Derby 10000 5000 2548 ² 15179 132600 CLoHAM 0 5194 ² 18427 192711 Serial Salford Multicore LoHAM Version Derby CLoHAM LoHAM

  30. Edges considered per algorithm Edges Considered Per Iteration Total Edges Considered 7 30 6 25 Number of Edges 5 Number of Edges 20 4 15 3 10 2 5 1 0 0 0 1 2 3 4 Iteration Bellman-Ford Desopo-Pape Bellman-Ford Desopo-Pape

  31. Vertex Frontier (A4) • Only Vertices which were Frontier size for each element for source 1 updated in the previous 3000 iteration can result in an 2500 update Frontier Size 2000 1500 • Much fewer threads 1000 launched per iteration 500 0 • Up to 2500 instead of 18427 Iteration per iteration Derby CLoHAM LoHAM

  32. M ultiple Concurrent Origins (A8) Frontier size for each element for Frontier Size for all concurrent sources source 1 18000000 3000 16000000 14000000 2500 12000000 Frontier Size 2000 Frontier Size 10000000 1500 8000000 6000000 1000 4000000 500 2000000 0 0 Iteration Iteration Derby CLoHAM LoHAM Derby CLoHAM LoHAM

  33. Atomic Contention • Atomic operations are guaranteed Atomic Contestance per Iteration to occur 30000000 • Atomic Contention 25000000 • multiple threads atomically modify Atomic Contestance same address 20000000 • Serialised! 15000000 • atomicAdd(double) not 10000000 implemented in hardware • Not yet 5000000 • Solutions 0 1. Algorithmic change to minimise 0 10 20 30 40 50 60 70 80 90 100 atomic contention Cloham Loham 2. Single precision

Recommend


More recommend