acmp an architecture to handle amdahl s law
play

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - PowerPoint PPT Presentation

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research Group Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel Background Single-thread


  1. ACMP: An Architecture to Handle Amdahl’s Law M. Aater Suleman Advisor: Yale Patt HPS Research Group

  2. Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel

  3. Background • Single-thread performance is power constrained • To leverage CMPs for a single application, it must be parallelized • Many kernels cannot be parallelized completely • Applications likely include both serial and parallel portions • Amdahl’s law is more applicable now than ever

  4. Serial Bottlenecks • Inherently serial kernels For I = 1 to N A[I] = (A[I-1] + A[I])/2 • Parallelization requires effort 1 0.9 Irregular 0.8 Degree of Parallelism code Loops with early 0.7 termination 0.6 0.5 Data-parallel 0.4 Loops 0.3 0.2 0.1 0 Programmer Effort

  5. CMP Architectures • Tile small cores e.g. Sun Niagara, Intel Larrabee – High throughput on the parallel part – Low serial thread performance – Highest performance for completely parallelized applications • Tile large cores e.g. Intel Core2Duo, AMD Barcelona, and IBM Power 5. – High serial thread performance – Lower throughput than Niagara

  6. ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores

  7. ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores

  8. ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores

  9. Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

  10. Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 At low parallelism, 8 ACMP and P6-Tile 6 outperform Niagara 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

  11. Performance vs. Parallelism 18 At high parallelism, Speedup vs. 1 P6-type Core 16 ACMP Niagara 14 Niagara outperforms ACMP 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

  12. Performance vs. Parallelism 18 At medium Speedup vs. 1 P6-type Core 16 ACMP parallelism, ACMP 14 Niagara wins 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

  13. Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP The cut-off point 14 Niagara moves to the right 12 P6-Tile in the future 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

  14. Experimental Methodology • Large core: Out-of-order (similar to P6) • Small Core: 2-wide, In-order • Configuration: – Niagara: 16 small cores – P6-Tile: 4 large cores – ACMP: 1 Large core, 12 small cores • Single ISA, shared memory, private L1 and L2 caches, bi-directional ring interconnect • Simulated existing multi-threaded applications without modification • ACMP Thread Scheduling – Master thread � large core – All additional threads � small cores

  15. Performance Results P6-Tile 1.4 ACMP 1.2 Speedup vs. Niagara 1 0.8 0.6 0.4 0.2 Medium High Low 0 is_nasp ep_nasp art_omp mg_nasp fmm_splash cholesky page convert h.264 ed mcf fft_splash cg_nasp Parallelism Parallelism Parallelism

  16. Summary • ACMP trades peak parallel performance for serial performance • Improves performance for a wide range of applications • Performance is less dependent on length of serial portion • Improves programmer efficiency – Programmers can only parallelize easier-to- parallelize kernels

  17. Future Work • Enhanced ACMP scheduling – Accelerate execution of finer-grain serial portions (critical sections) using the large core – Requires compiler support and minimal hardware • Improved threading decision based on run- time feedback

  18. Thank you

Recommend


More recommend