The Sliding Window Algorithm ● The “Sliding Window” algorithm sums several small sub-matrices of a matrix of values. ● The maximum of these sums is then located and it's coordinates passed out of the program. ● This is used as a calorimetry trigger – used for locating events with high energy jets. – (The following 6 slides have been copied from Matthew's presentation)
Sliding window: serial
Sliding window: serial
Sliding window: serial
Sliding window: serial
Sliding window: serial etc...
Sliding window: parallel (5x5 border around each submatrix – use your imagination)
Two Approaches to the Sliding Window ● CPU Algorithm (Standard) ● GPU Algorithm ● Hybrid Algorithm – Use the GPU to perform the sliding window sum – Transfer the resulting matrix of sums to CPU – Use the CPU to locate the maximum
Motivation : Time Complexity ● The time complexity of the algorithm to locate a maximum on the CPU is linear O(N) ● The time complexity of the best algorithm to do the same on the GPU is O(N log N) – Even if the GPU cores and CPU cores had the same processing speed, more calculations are required by the GPU to perform the same task.
Small Problem : Find Max Note that at ATLAS scale problems (~5,000) this algorithm performs MUCH worse than the CPU version. Matthew wrote a new algorithm that works better at ATLAS scale problems but not as well at extreme values. The speed-up is not fully realized for small window sizes because the GPU finishes the calculation nearly as fast as new calculation commands are issued.
Small Problem : Sliding Window Speed-Up For those concerned : The sudden drops on this plot are because of my testing procedure with rectangular grids of varying dimensions. Threads are issued 1 warp (32 threads) at a time and I declared each block to be a constant 256 threads. Because of this there are problem sizes for which a large number of threads are inactive.
Large Problem : Find Max Even at extremely large sizes the speed up offered by the GPU algorithm pales compared to the speed-up of the sliding window.
Large Problem : Sliding Window Speed-Up Here the algorithm has plateaued. The speed-up for any algorithm is limited by the number of GPU cores which can run simultaneously.
Motivation : Processing Speed vs Copy Speed ● The GPU cores are individually much slower than a CPU core. ● The copy speed from the GPU to CPU is very fast – and the result that needs to be copied is relatively small. – It may be worth the time to copy the memory to the CPU as it can do it much faster.
Small Problem Fraction Plots
Large Problem Fraction Plots
Conclusion ● At ATLAS scale, the speed-up grows fastest and is greatest for the Hybrid algorithm (see ratio plot) ● Beyond the ATLAS scale (a factor of ~10 greater) the purely GPU algorithm becomes better.
Small Problem : Ratio At small problem sizes (current ATLAS size is around 5,000) the Hybrid algorithm provides greater speed-up
Large Problem : Ratio This shows that at extremely large problem sizes the purely GPU based algorithm provides a greater speed-up than the hybrid.
Small Problem Speed Up
Large Problem Speed-Up
Backup Slide : Cuda Card Specs ● 8 SM (streaming multiprocessors) with 192 cores each (1,536 cores) @ ~1000 MHz each ● ~15.75 GB/s bandwidth to host
Recommend
More recommend