Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Graphics Processing Units (GPUs)
Context: GPU Performance Accelerated Computing Parallel Computing Serial Computing ~40 384 GigaFLOPS GigaFLOPS 8.74 TeraFLOPS 1 core 16 4992 cores cores
10.0 TFlops 8.74 TeraFLOPS 9.0 TFlops 6 hours CPU time 8.0 TFlops vs. 7.0 TFlops 1 minute GP GPU 6.0 TFlops time 5.0 TFlops 4.0 TFlops 3.0 TFlops 2.0 TFlops 1.0 TFlops ~40 GigaFLOPS 0.0 TFlops 1 CPU Core GPU (4992 cores)
Accelerators • Much of the functionality of CPUs is unused for HPC • Complex Pipelines, Branch prediction, out of order execution, etc. • Ideally for HPC we want: Simple , Low Power and Highly Parallel cores
An accelerated system DRAM GDRAM GPU/ CPU Accelerator PCIe I/O I/O • Co-processor not a CPU replacement
Thinking Parallel • Hardware considerations • High Memory Latency (PCI-e) • Huge Numbers of processing cores • Algorithmic changes required • High level of parallelism required • Data parallel != task parallel “If your problem is not parallel then think again”
Amdahl’s Law 1 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑇 = 𝑄 𝑂 − (1 − 𝑄) 25 20 P = 25% Speedup (S) P = 50% 15 P = 90% 10 P= 95% 5 0 Number of Processors (N) • Speedup of a program is limited by the proportion than can be parallelised • Addition of processing cores gives diminishing returns
SATALL Optimisation
Time per function for Largest Network (LoHAM) 45000 Profile the 97.4% 40000 Application 35000 • 11 hour runtime 30000 • Function A • 97.4% runtime 25000 Time (s) • 2000 calls 20000 • Hardware • Intel Core i7-4770k 3.50GHz 15000 • 16GB DDR3 • Nvidia GeForce Titan X 10000 5000 0.3% 0.3% 0.1% 0.1% 0.1% 0 A B C D E F Function
Function A • Input Distribution of Runtime (LoHAM) • Network (directed weighted graph) 100% • Origin-Destination Matrix 90% 80% • Output 70% • Traffic flow per edge 60% • 2 Distinct Steps 50% 40% 1. Single Source Shortest Path (SSSP) 30% All-or-Nothing Path 20% For each origin in the O-D matrix 10% 2. Flow Accumulation 0% Apply the OD value for each trip to each link on the route Serial Flow SSSP
Single Source Shortest Path For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)
Serial SSSP Algorithm • D’Esopo -Pape (1974) • Maintains a priority queue of vertices to explore Highly Serial • Not a Data-Parallel Algorithm • We must change algorithm to match the hardware Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.
Parallel SSSP Algorithm • Bellman-Ford Algorithm (1956) • Poor serial performance & time complexity • Performs significantly more work • Highly Parallel • Suitable for GPU acceleration Total Edges Considered 30 25 Number of Edges 20 15 10 5 Bellman, Richard. On a routing problem. 1956. 0 Bellman-Ford Desopo-Pape
Implementation • A2 - Naïve Bellman- Time to compute all requied SSSP results per model Ford using Cuda 70.00 • Up to 369x slower 60.00 • Striped bars continue 50.00 Time (seconds) off the scale 40.00 Derby 36.5s 30.00 20.00 CLoHAM 1777.2s 10.00 LoHAM 5712.6s 0.00 Serial A2 Optimisation Derby CLoHAM LoHAM
Implementation • Followed iterative cycle of Time to compute all requied SSSP results per model performance optimisations 70.00 • A3 – Early Termination 60.00 • A4 – Node Frontier 50.00 Time (seconds) • A8 – Multiple origins Concurrently 40.00 • SSSP for each Origin in 30.00 the OD matrix 20.00 • A10 – Improved load 10.00 Balancing 0.00 • Cooperative Thread Array Serial A2 A3 A4 A8 A10 A11 • A11 – Improved array access Optimisation Derby CLoHAM LoHAM
Limiting Factor (Function A) Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial Serial Flow SSSP Flow SSSP
Limiting Factor (Function A) • Limiting Factor has now changed • Need to parallelise Flow Accumulation Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial A11 Serial A11 Flow SSSP Flow SSSP
Flow Accumulation • Shortest Path + OD = Flow-per-link For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited • Parallel problem • But requires synchronised access to shared data structure for all trips (atomic operations) 0 1 2 … 0 0 2 3 … Link 0 1 2 3 … 1 1 0 4 … Flow 9 8 6 9 … 2 2 5 0 … … … … …
Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution 3 2 1 0 Serial A12 Derby CLoHAM LoHAM
Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution • Solutions: 3 • A15 - Reduce number of atomic operations • Solve in batches using parallel reduction 2 • A14 - Use fast hardware-supported single precision atomics • Minimise loss of precision using multiple 32-bit summations 1 • 0.000022% total error 0 Serial A12 A15 A14 (double) (single) Derby CLoHAM LoHAM
Integrated Results
Relative Assignment Runtime Performance vs Serial 45 41.77 Assignment Speedup 40 relative to Serial Serial 35 • LoHAM – 12h 12m Assignment Time relative to serial 30 Double precision • 25 LoHAM – 35m 22s 20.70 18.24 20 Single precision • Reduced loss of precision 15 • LoHAM – 17m 32s 11.24 10 Hardware: 6.81 6.14 5.51 • Intel Core i7-4770K 3.73 3.42 5 • 16GB DDR3 1.00 1.00 1.00 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM
Relative Assignment Runtime Performance vs Multicore 7 Assignment Speedup 6.14 relative to Multicore 6 Multicore • LoHAM – 1h 47m Assignment Time relative to multicore 5 Double precision 4 • LoHAM – 35m 22s 3.31 3.04 Single precision 3 • Reduced loss of precision 2.04 • LoHAM – 17m 32s 1.79 2 Hardware: 1.09 1.00 1.00 1.00 • Intel Core i7-4770K 1 • 0.29 16GB DDR3 0.18 0.15 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM
GPU Computing at UoS
Expertise at Sheffield • Specialists in GPU Computing and performance optimisation • Complex Systems Simulations via FLAME and FLAME GPU • Visual Simulation, Computer Graphics and Virtual Reality • Training and Education for GPU Computing
Thank You Largest Model (LoHAM) results Speedup Speedup Runtime Serial Multicore Paul Richmond p.richmond@sheffield.ac.uk Serial 12:12:24 1.00 0.15 paulrichmond.shef.ac.uk Multicore 01:47:36 6.81 1.00 Peter Heywood A15 (double 00:35:22 20.70 3.04 p.heywood@sheffield.ac.uk precision) ptheywood.uk A14 (single 00:17:32 41.77 6.14 precision)
Backup Slides
Benchmark Models • 3 Benchmark networks Benchmark model performance 50000 • Range of sizes 45000 • Small to V. Large 40000 35000 • Up to 12 hour runtime 30000 Time (s) 25000 Model Vertices Edges O-D trips 20000 (Nodes) (Links) 15000 547 ² 2700 25385 Derby 10000 5000 2548 ² 15179 132600 CLoHAM 0 5194 ² 18427 192711 Serial Salford Multicore LoHAM Version Derby CLoHAM LoHAM
Edges considered per algorithm Edges Considered Per Iteration Total Edges Considered 7 30 6 25 Number of Edges 5 Number of Edges 20 4 15 3 10 2 5 1 0 0 0 1 2 3 4 Iteration Bellman-Ford Desopo-Pape Bellman-Ford Desopo-Pape
Vertex Frontier (A4) • Only Vertices which were Frontier size for each element for source 1 updated in the previous 3000 iteration can result in an 2500 update Frontier Size 2000 1500 • Much fewer threads 1000 launched per iteration 500 0 • Up to 2500 instead of 18427 Iteration per iteration Derby CLoHAM LoHAM
M ultiple Concurrent Origins (A8) Frontier size for each element for Frontier Size for all concurrent sources source 1 18000000 3000 16000000 14000000 2500 12000000 Frontier Size 2000 Frontier Size 10000000 1500 8000000 6000000 1000 4000000 500 2000000 0 0 Iteration Iteration Derby CLoHAM LoHAM Derby CLoHAM LoHAM
Atomic Contention • Atomic operations are guaranteed Atomic Contestance per Iteration to occur 30000000 • Atomic Contention 25000000 • multiple threads atomically modify Atomic Contestance same address 20000000 • Serialised! 15000000 • atomicAdd(double) not 10000000 implemented in hardware • Not yet 5000000 • Solutions 0 1. Algorithmic change to minimise 0 10 20 30 40 50 60 70 80 90 100 atomic contention Cloham Loham 2. Single precision
Recommend
More recommend