Performance, Energy, and Thermal Considerations for SMT and CMP Architectures Yingmin Li † , David Brooks ‡ , Zhigang Hu †† , Kevin Skadron † † Dept. of Computer Science, University of Virginia †† IBM T.J. Watson Research Center ‡ Division of Engineering and Applied Sciences, Harvard University { yingmin,skadron } @cs.virginia.edu, zhigangh@us.ibm.com, dbrooks@eecs.harvard.edu Abstract in instructions per cycle (IPC) means increased power dissipation and possibly increased power density. Since the area increase reported for SMT execution is relatively Simultaneous multithreading (SMT) and chip multi- processing (CMP) both allow a chip to achieve greater small (10-20%), thermal behavior and cooling costs are major concerns. throughput, but their relative energy-efficiency and ther- mal properties are still poorly understood. This paper uses Chip multiprocessing (CMP) [7] is another relatively Turandot, PowerTimer, and HotSpot to explore this design new microarchitectural paradigm that has found industrial space for a POWER4/POWER5-like core. For an equal- application [12, 14]. CMP instantiates multiple processor area comparison with this style of core, we find CMP to “cores” on a single die. Typically the cores each have pri- be superior in terms of performance and energy-efficiency vate branch predictors and first-level caches and share a for CPU-bound benchmarks, but SMT to be superior for second-level, on-chip cache. For multi-threaded or multi- memory-bound benchmarks due to a larger L2 cache. Al- programmed workloads, CMP architectures amortize the though both exhibit similar peak operating temperatures cost of a die across two or more processors and allow and thermal management overheads, the mechanism by data sharing within a common L2 cache. Like SMT, the which SMT and CMP heat up are quite different. More promise of CMP is a boost in throughput. The replication specifically, SMT heating is primarily caused by local- of cores means that the area and power overhead to support ized heating in certain key structures, CMP heating is extra threads is much greater with CMP than SMT. For a mainly caused by the global impact of increased energy given die size, a single-core SMT chip will therefore sup- output. Because of this difference in heat up machanism, port a larger L2 size than a multi-core chip. Yet the lack we found that the best thermal management technique is of execution contention between threads typically yields a also different for SMT and CMP. Indeed, non-DVS local- much greater throughput for CMP than SMT [4, 7, 20]. A ized thermal-management can outperform DVS for SMT. side effect is that each additional core on a chip dramati- Finally, we show that CMP and SMT will scale differently cally increases its power dissipation, so thermal behavior as the contribution of leakage power grows, with CMP suf- and cooling costs are also major concerns for CMP. fering from higher leakage due to the second core’s higher Because both paradigms target increased throughput temperature and the exponential temperature-dependence for multi-threaded and multi-programmed workloads, it of subthreshold leakage. is natural to compare them. This paper provides a thor- ough analysis of the performance benefits, energy effi- ciency, and thermal behavior of SMT and CMP in the con- 1. Introduction text of a POWER4-like microarchitecture. In this research we assume POWER4-like cores with similar complexity Simultaneous multithreading (SMT) [27] is a recent mi- for both SMT and CMP except for necessary SMT related croarchitectural paradigm that has found industrial appli- hardware enhancements. Although reducing the CMP core cation [12, 18]. SMT allows instructions from multiple complexity may improve the energy and thermal efficiency threads to be simultaneously fetched and executed in the for CMP, it is cost effective to design a CMP processor by same pipeline, thus amortizing the cost of many microar- reusing an existing core. The POWER5 dual SMT core chitectural structures across more instructions per cycle. processor is an example of this design philosophy. We The promise of SMT is area-efficient throughput enhance- combine IBM’s cycle-accurate Turandot [19] and Power- ment. But even though SMT has been shown energy ef- Timer [3, 9] performance and power modeling tools, mod- ficient for most workloads [17, 21], the significant boost ified to support both SMT and CMP, with University of
Recommend
More recommend