rich cherenkov angle status report march 2017
play

RICH Cherenkov angle status report March 2017 Christina Quast March - PowerPoint PPT Presentation

Memory layout Performance improvements RICH Cherenkov angle status report March 2017 Christina Quast March 6, 2017 Christina Quast RICH Cherenkov angle status report March 2017 Memory layout Performance improvements Nanoseconds per photon


  1. Memory layout Performance improvements RICH Cherenkov angle status report March 2017 Christina Quast March 6, 2017 Christina Quast RICH Cherenkov angle status report March 2017

  2. Memory layout Performance improvements Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 For 33554432 photons: Old solver float: 1000.26 Christina Quast RICH Cherenkov angle status report March 2017

  3. Memory layout Performance improvements Memory layout before Christina Quast RICH Cherenkov angle status report March 2017

  4. Memory layout Performance improvements Memory layout before 2 Christina Quast RICH Cherenkov angle status report March 2017

  5. Memory layout Performance improvements Memory layout after Christina Quast RICH Cherenkov angle status report March 2017

  6. Memory layout Performance improvements Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Christina Quast RICH Cherenkov angle status report March 2017

  7. Memory layout Performance improvements --- a/ QuarticSolverCacheline .h +++ b/ QuarticSolverCacheline .h - T reflPointX ; - T reflPointY ; - T reflPointZ ; + T reflPointX __attribute__ (( __aligned__ (64))); + T reflPointY __attribute__ (( __aligned__ (64))); + T reflPointZ __attribute__ (( __aligned__ (64))); reflPointX = ex + CoCX; reflPointY = ey + CoCY; @@ // TODO :align 64 // FIXME: ueberall const dranmachen ? - VECT emissionPointVecX ; + VECT emissionPointVecX __attribute__ (( __aligned__ (64))); emissionPointVecX .load_a (& data.emissPnt.x()[0]); - VECT emissionPointVecY ; + VECT emissionPointVecY __attribute__ (( __aligned__ (64))); emissionPointVecY .load_a (& data.emissPnt.y()[0]); - VECT emissionPointVecZ ; + VECT emissionPointVecZ __attribute__ (( __aligned__ (64))); emissionPointVecZ .load_a (& data.emissPnt.z()[0]); - VECT CoCX; + VECT CoCX __attribute__ (( __aligned__ (64))); CoCX.load_a (& data.centOfCurv .x()[0]); - VECT CoCY; + VECT CoCY __attribute__ (( __aligned__ (64))); CoCY.load_a (& data.centOfCurv .y()[0]); - VECT CoCZ; Christina Quast RICH Cherenkov angle status report March 2017

  8. Memory layout Performance improvements + VECT CoCZ __attribute__ (( __aligned__ (64))); CoCZ.load_a (& data.centOfCurv .z()[0]); @ VECT e2 = evecX*evecX + evecY*evecY + evecZ*evecZ; // vector from mirror centre of curvature to virtual detec - VECT virtDetPointVecX ; + VECT virtDetPointVecX __attribute__ (( __aligned__ (64))); virtDetPointVecX .load_a (& data. virtDetPoint .x()[0]); - VECT virtDetPointVecY ; + VECT virtDetPointVecY __attribute__ (( __aligned__ (64))); virtDetPointVecY .load_a (& data. virtDetPoint .y()[0]); - VECT virtDetPointVecZ ; + VECT virtDetPointVecZ __attribute__ (( __aligned__ (64))); virtDetPointVecZ .load_a (& data. virtDetPoint .z()[0]); // const Vector dvec( virtDetPoint - CoC ); @@ -220,7 +220 ,7 @@ namespace RichCacheline - VECT radius; + VECT radius __attribute__ (( __aligned__ (64))); radius.load_a (& data.radius [0]); - VECT reflPointX ; - VECT reflPointY ; - VECT reflPointZ ; + VECT reflPointX __attribute__ (( __aligned__ (64))); + VECT reflPointY __attribute__ (( __aligned__ (64))); + VECT reflPointZ __attribute__ (( __aligned__ (64))); --- a/main.cpp Christina Quast RICH Cherenkov angle status report March 2017

  9. Memory layout Performance improvements +++ b/main.cpp @@ -227,8 +227 ,8 @@ int main ( int argc , char ** argv) - VECTYPE :: PhotonReflections <float > dataV0_vect ; - VECTYPE :: PhotonReflections <float > dataV1_vect ; + VECTYPE :: PhotonReflections <float > dataV0_vect __attribute__ (( __aligned__ (64))); + VECTYPE :: PhotonReflections <float > dataV1_vect __attribute__ (( __aligned__ (64))); diff --git a/vectype.h b/vectype.h index 75 c05bf ..72 db553 100644 --- a/vectype.h +++ b/vectype.h template <typename T, std :: size_t DIM = 16> - using PhotonReflections = std :: vector <PhotonReflection <T, DIM >>; + using PhotonReflections = std :: vector <PhotonReflection <T, DIM >, aligned_alloca Christina Quast RICH Cherenkov angle status report March 2017

  10. Memory layout Performance improvements Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Christina Quast RICH Cherenkov angle status report March 2017

  11. Memory layout Performance improvements Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545 Christina Quast RICH Cherenkov angle status report March 2017

  12. Memory layout Performance improvements --- a/ QuarticSolverCacheline .h +++ b/ QuarticSolverCacheline .h @@ -81,8 +81 ,8 @@ namespace RichCacheline - const T divnorm = 1.0f/norm; - const T norm_sqrt = sqrt(norm ); + const T divnorm = approx_recipr (norm ); + const T norm_sqrt = approx_recipr ( approx_rsqrt (norm )); nx *= divnorm; ny *= divnorm; nz *= divnorm; @@ - const auto enorm = radius/e; + const auto enorm = radius* approx_recip @@ - VECT cosgamma2 = (evecDvec * evecDvec )/ ed2; + VECT cosgamma2 = (evecDvec * evecDvec) * approx_recipr (ed2 ); - const VECT e = sqrt(e2); - const VECT d = sqrt(d2); + const VECT e = approx_recipr ( approx_rsqrt (e2 )); + const VECT d = approx_recipr ( approx_rsqrt (d2 )); - const VECT singamma = sqrt (1.0f - cosgamma2 )); - const VECT cosgamma = approx_recipr ( approx_rsqrt (cosgamma2 )); + const VECT singamma = approx_recipr ( approx_rsqrt (1.0f - cosgamma2 )); + const VECT cosgamma = approx_recipr ( approx_rsqrt (cosgamma2 )); @@ const VECT maxval = std :: numeric_limits <SKALART >:: max (); - const VECT inv_a0 = ((a0 > 0)? 1.0f/a0: maxval ); + const VECT inv_a0 = ((a0 > 0)? approx_recipr (a0): maxval ); @@ - const auto toberooted = (abs(R) + sqrt(abs(R2 -Q3)) ); + const auto toberooted = (abs(R) + approx_recipr ( approx_rsqrt (abs(R2 -Q3 )))); Christina Quast RICH Cherenkov angle status report March 2017

  13. Memory layout Performance improvements // FIXME: oder zuerst in normales array , dann load? // FIXME: also for double? @@ const auto A = sgnR * rooted; PR(A); - const auto B = Q / A; + const auto B = Q * approx_recipr (A); - const auto u1 = -0.5 * (A + B) - rc / 3.0; + const auto u1 = -0.5 * (A + B) - rc * (1.0f / 3.0f); // FIXME: saturated or not? // const const auto u2 = UU * abs_saturated (A-B); const auto u2 = UU * abs(A-B); - const auto V = sqrt(u1*u1 + u2*u2); + const auto V = approx_recipr ( approx_rsqrt (u1*u1 + u2*u2 )); // std :: complex <TYPE > w3 = ( abs_satured (V) != 0.0 ? (TYPE )( qq * -0.125 ) / V : // std :: complex <TYPE >(0 ,0) ); // FIXME: warum abs saturated when compared to 0.0 ?? - const auto w3r = ((V != 0.0)? (qq * -0.125)/V : 0.0); + const auto w3r = ((V != 0.0)? (qq * -0.125)* approx_recipr (V) : 0.0); // TYPE res = std :: real(w1) + std :: real(w2) + std :: real(w3) - (r4*a); - const auto res = sqrt ((u1+V)*2) + w3r - (r4*a); + const auto res = approx_recipr ( approx_rsqrt ((u1+V)*2)) + w3r - (r4*a); // return the final result // FIXME: std :: move ? const auto r = (( res > 1.0)? 1.0: (( res < -1.0)? -1.0: res )); Christina Quast RICH Cherenkov angle status report March 2017

  14. Memory layout Performance improvements Nanoseconds per photon Theoretical limit: 52 . 0 B / 340 GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545 Approx. functions: 0.851242 Christina Quast RICH Cherenkov angle status report March 2017

  15. Memory layout Performance improvements --- a/ QuarticSolverCacheline .h +++ b/ QuarticSolverCacheline .h @@ -142,6 +142 ,18 @@ namespace RichCacheline { + builtin_prefetch (&(((& data )+0)-> radius [0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.z())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.z())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .z())[0]) , 0, 3); VECT emissionPointVecX __attribute__ (( __aligned__ (64))); emissionPointVecX .load_a (& data.emissPnt.x()[0]); VECT emissionPointVecY __attribute__ (( __aligned__ (64))); @@ + __builtin_prefetch (& data. sphReflPoint .x()[0] , 1, 0); + __builtin_prefetch (& data. sphReflPoint .y()[0] , 1, 0); + __builtin_prefetch (& data. sphReflPoint .z()[0] , 1, 0); reflPointX .store_a (& data. sphReflPoint .x()[0]); reflPointY .store_a (& data. sphReflPoint .y()[0]); Christina Quast RICH Cherenkov angle status report March 2017

Recommend


More recommend