algorithmic time energy and power on candidate hpc
play

Algorithmic time, energy, and power on candidate HPC compute - PowerPoint PPT Presentation

Algorithmic time, energy, and power on candidate HPC compute building blocks Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS14 Contributions Energy roofline (IPDPS13) quantifies relative energy costs of


  1. Algorithmic time, energy, and power on candidate HPC compute building blocks Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc May 20, 2014 Presented at IPDPS’14

  2. Contributions • Energy roofline (IPDPS’13) quantifies relative energy costs of computation to data movement • Power “cap ” limits performance • μ benchmark suite for testing di ff erent levels of the memory hierarchy • Empirical data on systems ranging from server - to mobile -class platforms • Analysis using our methodology

  3. Roofline in energy (IPDPS’13) Slow memory Q (m)ops τ mem = time / (m)op Fast memory (total size = Z ) W (fl)ops xPU τ flop = time / (fl)op

  4. Roofline in energy (IPDPS’13) ¡ GFLOP/s 1 “Roofline” [1] GFLOP/J “Arch line” 1/2 Relative performance 1/4 1/8 1/16 3.6 14 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) [1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡ Commun. ¡ACM, ¡vol. ¡52, ¡no. ¡4, ¡pp. ¡65–76, ¡Apr. ¡2009. ¡[Online]. ¡Available: ¡h6p://doi.acm.org/10.1145/1498765.1498785 ¡

  5. Roofline in energy (IPDPS’13) ¡ GFLOP/s 1 “Roofline” [1] GFLOP/J “Arch line” 1/2 Relative performance 1/4 1/8 1/16 3.6 14 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) [1] ¡S. ¡Williams, ¡A. ¡Waterman, ¡and ¡D. ¡Pa6erson, ¡“Roofline: ¡an ¡insigh>ul ¡visual ¡performance ¡model ¡for ¡mulDcore ¡architectures,” ¡Commun. ¡ ACM, ¡vol. ¡52, ¡no. ¡4, ¡pp. ¡65–76, ¡Apr. ¡2009. ¡[Online]. ¡Available:h6p://doi.acm.org/10.1145/1498765.1498785 ¡

  6. Roofline in energy (IPDPS’13) ¡ Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte) Power dissipated by compute units Power dissipated by memory units

  7. Roofline in energy (IPDPS’13)

  8. Roofline in energy (IPDPS’13) NVIDIA GTX 580 Intel i7 − 950 (GPU − only) (Desktop) 1.4 380 W 1.3 Power (normalized to flop+const) 1.2 180 W 1.1 280 W 160 W ● ● ● ●●●●● ● ● 1 ● ● ● ●●●●● ● ● ● ● ● ● ● 140 W ● ● 0.9 ● ● ● ● ● ●●● ● 220 W 0.8 ● 120 W ● Power ● ● 0.7 0.6 0.5 120 W 0.4 0.3 0.2 0.1 8.2 4.2 0 4.5 5.1 (const=0) 2.1 2.1 (const=0) 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64 Intensity (FLOP : Byte)

  9. Roofline in energy (IPDPS’13) NVIDIA GTX 580 Intel i7 − 950 (GPU − only) (Desktop) 1.4 380 W power cap 1.3 Power (normalized to flop+const) prevents peak 1.2 180 W performance 1.1 280 W 160 W ● ● ● ●●●●● ● ● 1 ● ● ● ●●●●● ● ● ● ● ● ● ● 140 W ● ● 0.9 ● ● ● ● ● ●●● ● 220 W 0.8 ● 120 W ● Power ● ● 0.7 0.6 0.5 120 W 0.4 0.3 0.2 0.1 8.2 4.2 0 4.5 5.1 (const=0) 2.1 2.1 (const=0) 1/4 1/2 1 2 4 8 16 32 64 1/4 1/2 1 2 4 8 16 32 64 Intensity (FLOP : Byte)

  10. Power Cap Power is Performance is determined by limited by performance power

  11. Power Cap Power is Performance is determined by limited by performance power “Usable” power

  12. Power Cap Power is Performance is determined by limited by performance power “Usable” power

  13. Power Cap Power is Performance is determined by limited by performance power “Usable” power ✗

  14. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  15. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  16. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  17. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  18. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  19. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  20. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  21. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  22. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  23. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  24. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  25. Power Cap Power, relative to flop − power 8 3.6 14 5.0 ● ● ● 4.0 ● ● 4 ● 2 ● ● ● ● ● 1.0 ● ● 1 0.5 1 2 4 8 16 32 64 128 256 512 Intensity (flop:byte)

  26. μ benchmark Suite • Intensity • x86 CPU - flops, bytes - Intel, AMD • Cache • ARM CPU - shared • Performance × ¡ × ¡ - A9, A15 memory • Energy • GPU - L1, L2, etc. - NVIDIA, AMD, ARM • Random access • Xeon Phi http://hpcgarage.org/archline

  27. μ benchmark Suite • CPU Intensity vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] μ benchmark for Ivy vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] Bridge vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] – aligned memory loads vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] – 1 MUL and 1 ADD AVX %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 instructions issued per vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 cycle vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 – maximize AVX register vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 usage to increase ILP vaddpd ymm11, ymm11, ymm3 – parallelized over all vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 available cores vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep http://hpcgarage.org/archline

  28. μ benchmark Suite • CPU Intensity vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] μ benchmark for Ivy vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] Bridge vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] – aligned memory loads vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] – 1 MUL and 1 ADD AVX %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 instructions issued per vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 cycle vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 – maximize AVX register vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 usage to increase ILP vaddpd ymm11, ymm11, ymm3 – parallelized over all vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 available cores vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep http://hpcgarage.org/archline

  29. μ benchmark Suite • CPU Intensity vmovapd ymm0, [rdi - 128] vmovapd ymm1, [rdi - 96] μ benchmark for Ivy vmovapd ymm2, [rdi - 64] vmovapd ymm3, [rdi - 32] Bridge vmovapd ymm4, [rdi] vmovapd ymm5, [rdi + 32] – aligned memory loads vmovapd ymm6, [rdi + 64] vmovapd ymm7, [rdi + 96] – 1 MUL and 1 ADD AVX %rep MAD_PER_ELEMENT vmulpd ymm0, ymm0, ymm0 instructions issued per vaddpd ymm8, ymm8, ymm0 vmulpd ymm1, ymm1, ymm1 cycle vaddpd ymm9, ymm9, ymm1 vmulpd ymm2, ymm2, ymm2 – maximize AVX register vaddpd ymm10, ymm10, ymm2 vmulpd ymm3, ymm3, ymm3 usage to increase ILP vaddpd ymm11, ymm11, ymm3 – parallelized over all vmulpd ymm4, ymm4, ymm4 vaddpd ymm12, ymm12, ymm4 available cores vmulpd ymm5, ymm5, ymm5 vaddpd ymm13, ymm13, ymm5 vmulpd ymm6, ymm6, ymm6 vaddpd ymm14, ymm14, ymm6 vmulpd ymm7, ymm7, ymm7 vaddpd ymm15, ymm15, ymm7 %endrep http://hpcgarage.org/archline

Recommend


More recommend