The math behind Memory layout Performance improvements Results Roofline plot for RICH pattern detection algorithm on Intels Knights Landing Platform Christina Quast tCSC 2017 June 9, 2017 1 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Intel Xeon Phi Knights Landing Figure: KNL 1 1 https://www.extremetech.com/wp-content/uploads/2016/04/ KnightsLanding.png 2 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Memory rearrangement Theoretical performance 40B input data + 12B output data = 52 B DRAM: ≈ 80 GBps MCDRAM (High Bandwidth Memory on KNL): ≈ 340 GBps Theoreticaly best performance in time per photon: 52 . 0 B / 340 GBps = 0.153 ns 3 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Memory rearrangement Memory layout: AOS to SOA Arrange memory for better unit strides Figure: AOS to SOA 2 2 http://www.spuify.co.uk/?p=645 4 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Performance improvements Memory and Cacheline optimizations Alignment of variables to 64 Byte boundaries (Cacheline size) Vectorization through vectorclass library (basically intrinsics abstraction) Const keyword helps compiler to optimize 5 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Performance improvements Approximate functions Inverse approximate functions up to 10 times faster sqrt() replaced with approx recipr ( approx rsqrt ()) division replaced by approx recip () Instruction uops reciprocal througput VSQRT14PS 18 16 VRSQRT14PS 1 3 VDIVPS 18 32 VRCP28PS 1 3 6 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Performance improvements Mathematical improvements Removed some divisions Extracted multiplication factors (1 − sin 2 ( β )) � Term cos(arcsin(sin( β ))) replaced by Removed cubic root (very expensive) and quartic solver, replaced with Newton 7 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Performance improvements MCDRAM ( 340 GBps) numactl to bind execution to CPUs and MCDRAM Memory Figure: Quadrant Clustering mode [2] 8 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 ns per photon Speedup over baseline Improvement Execution time per photon code with OMP Baseline code without OMP 1000.26 ns - From here: always OpenMP 256 thread Baseline code 7.13 ns - Pinned on MCDRAM (with numactl) 6.63 ns 1.07x Mathematical improvement 4.67 ns 1.53x Vectorization and Memory alignment 0.933 ns 7.64x All three 0.195 ns 36.47x 9 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup Roofline plot Figure: Roofline plot with mathematic improvements 10 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup Speedup and Efficiency 1.1 512 1.0 256 128 0.9 64 efficiency speedup 32 0.8 16 0.7 8 4 0.6 2 1 0.5 1 2 4 8 16 32 64 128 256 0 50 100 150 200 250 #OMP_threads #OMP_threads Figure: Strong scaling speedup (left) and efficiency (right) for 10485760 photons and OMP workgroup size of 128 11 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup R. Forty and O. Schneider. Rich pattern recognition. LHCB/98-040 , 30 April 1998. A. Vladimirov and R. Asai. Clustering modes in knights landing processors: Developer’s guide. Colfax International , May 11, 2016. 12 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup Cherenkov angle Figure: Cherenkov angle calculation[1] 12 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
The math behind Memory layout Performance improvements Results Speedup Struct template <typename T, std::size_t DIM = 16> class PhotonReflection { public: typedef typename XYZPoints<T, DIM>::vec_type vector; public: XYZPoints<T, DIM> emissPnt; XYZPoints<T, DIM> centOfCurv; XYZPoints<T, DIM> virtDetPoint; XYZPoints<T, DIM> sphReflPoint; std::array<T,DIM> radius; }; 13 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights
Recommend
More recommend