energy reduction via critical path prediction
play

Energy Reduction via Critical Path Prediction Toshinori Sato - PowerPoint PPT Presentation

Energy Reduction via Critical Path Prediction Toshinori Sato Akihiro Chiyonobu Itsujiro Arita Kyushu Institute of Technology WCED'02 The KIT COSMOS Processor 1 Overview Background Power of CMOS circuits Device and circuit design


  1. Energy Reduction via Critical Path Prediction Toshinori Sato Akihiro Chiyonobu Itsujiro Arita Kyushu Institute of Technology WCED'02 The KIT COSMOS Processor 1

  2. Overview � Background � Power of CMOS circuits � Device and circuit design techniques � Architectural-level design techniques � Criticality-based instruction scheduling � Clustered microarchitecture � Summary WCED'02 The KIT COSMOS Processor 2

  3. Nanometer design � Nanometer design poses many challenges for processor designers. � reliability, signal integrity, speed, power … WCED'02 The KIT COSMOS Processor 3

  4. Applications � Current trend of increasing popularity of mobile devices such as smart cell phones is a driving force to investigate high-performance and energy-efficient microprocessors. � 3D video games, flight ticket reservation, mobile banking, mobile trading, digital camera, MP3 player … � As computing power of a mobile device increases, its power consumption is also increasing. � Since mobile devices are battery-operated, energy efficiency is the first class constraint for microprocessors. WCED'02 The KIT COSMOS Processor 4

  5. Active power of CMOS circuit � P active = f × C load × V dd 2 � f : clock frequency � C load : load capacitance � V dd : supply voltage � Supply voltage reduction is the most effective way to lower power consumption. WCED'02 The KIT COSMOS Processor 5

  6. Gate delay of CMOS circuit V dd T pd ∝ � (V dd -V th ) α � V dd : supply voltage � V th : threshold voltage of device � α : factor depending upon carrier velocity saturation � Supply voltage reduction increases gate delay, which results in slower clock frequency. WCED'02 The KIT COSMOS Processor 6

  7. Device & circuit optimizations � We can exploit critical path information � Critical path (CP) is the path which decides processor cycle time (frequency). � Small transistors (tr) on non-CP � Low supply voltage for tr.s on non-CP � Large threshold tr.s on non-CP � Background � Multiple supply voltages (on-chip supplies) � Multiple-threshold CMOS, Variable-threshold CMOS WCED'02 The KIT COSMOS Processor 7

  8. Architectural-level techniques � We can select functional units based on each instruction ’ s criticality. � High-speed and power-hungry units eg. CSA, CLA � Low-speed and power-efficient units eg. RCA � Criticality-based instruction scheduling WCED'02 The KIT COSMOS Processor 8

  9. Criticality-based scheduling � Dispatch policy is based on each instruction ’ s criticality. � Only instructions on critical paths should be dispatched into fast functional units. � Non-critical instructions can use slow functional units. WCED'02 The KIT COSMOS Processor 9

  10. Critical path � Chain of instructions, which determines the number of cycles executing program. I0 I0 I2 I2 I4 I7 I5 I5 I6 I6 I8 I8 Critical path I1 I3 I9 I9 WCED'02 The KIT COSMOS Processor 10

  11. Critical path prediction � Tune ’ s critical path prediction buffer [HPCA ’ 01] � Simple but uses only local information � Fields ’ s token passing CP predictor [ISCA ’ 01] � More accurate due to use of global information, but complex WCED'02 The KIT COSMOS Processor 11

  12. Tune ’ s CPP buffer Counter PC > Th? critical/not WCED'02 The KIT COSMOS Processor 12

  13. Criticality-based scheduling Critical Instruction Fast and power-hungry Non-critical Slow and power-efficient WCED'02 The KIT COSMOS Processor 13

  14. Evaluation � OOO 8-way superscalar processor � Based on SimpleScalar/Alpha tool set � 6 integer units (fast / slow) � 4K-entry CPP buffer, 3-bit counters � + 1 if critical, -1 if not. Threshold = 5. � SPEC2000 benchmark WCED'02 The KIT COSMOS Processor 14

  15. Processor models Baseline model Power-efficient model WCED'02 The KIT COSMOS Processor 15

  16. Vdd and frequency scaling Fast units Slow units Vdd 1.6V 1.15V Frequency 1.0GHz 500MHz WCED'02 The KIT COSMOS Processor 16

  17. %Increase in cycles 3 f a s t / 3 s l o w / 4 K 3 f a s t / 3 s l o w / 4 K 2 5 2 5 2 0 2 0 1 5 1 5 1 0 1 0 5 5 0 0 r r n x 2 n x 2 e e p p o e o e i i s t s t e e r r z r r z a o b a o b p v p v p i p e l i n e d n o n - p i p e l i n e d WCED'02 The KIT COSMOS Processor 17

  18. %Distribution of dispatch N S C S N F C F N S C S N F C F 1 0 0 % 1 0 0 % 8 0 % 8 0 % 6 0 % 6 0 % 4 0 % 4 0 % 2 0 % 2 0 % 0 % 0 % r r n x 2 n x 2 e e p p o e o e i i s t s t e e r r z r r z a o b a o b p v p v p i p e l i n e d n o n - p i p e l i n e d WCED'02 The KIT COSMOS Processor 18

  19. %Energy reduction in FU p p i i p p e e l l i i n n e e d d n n o o n n - - p p i i p p e e l l i i n n e e d d 4 4 5 5 4 4 0 0 3 3 5 5 3 3 0 0 2 2 5 5 2 2 0 0 1 1 5 5 1 1 0 0 5 5 0 0 p p a a r r s s e e r r e e o o n n v v o o r r t t e e x x b b z z i i p p 2 2 WCED'02 The KIT COSMOS Processor 19

  20. Clustered microarchitecture � To further reduce power, we split the instruction queue into a fast and a slow queues. � Fast cluster consists of the fast queue and fast functional units. � Slow cluster consists of the slow queue and slow functional units. � 2 clusters are connected by small FIFOs, if necessary. WCED'02 The KIT COSMOS Processor 20

  21. Clustered datapath WCED'02 The KIT COSMOS Processor 21

  22. Inter-cluster bypassing WCED'02 The KIT COSMOS Processor 22

  23. Evaluation � 16-entry fast and 48-entry slow queues � Every dispatched instructions do not release its corresponding entry, just like RUU. � 36.3% power reduction in the queues. WCED'02 The KIT COSMOS Processor 23

  24. Processor models Non-clustered model Clustered model WCED'02 The KIT COSMOS Processor 24

  25. %Increase in cycles 1 6 f a s t Q / 4 8 s l o w Q 1 6 f a s t Q / 4 8 s l o w Q 8 0 8 0 7 0 7 0 6 0 6 0 5 0 5 0 4 0 4 0 3 0 3 0 2 0 2 0 1 0 1 0 0 0 r r n x 2 n x 2 e e p p o e o e i i s t s t e e r r z r r z a o b a o b p v p v p i p e l i n e d n o n - p i p e l i n e d WCED'02 The KIT COSMOS Processor 25

  26. %Distribution of dispatch N S C F N S C F 1 0 0 % 1 0 0 % 8 0 % 8 0 % 6 0 % 6 0 % 4 0 % 4 0 % 2 0 % 2 0 % 0 % 0 % r r n x 2 n x 2 e e p p o e o e i i s t s t e e r r z r r z a o b a o b p v p v p i p e l i n e d n o n - p i p e l i n e d WCED'02 The KIT COSMOS Processor 26

  27. %Energy reduction in FU p i p e l i n e d n o n - p i p e l i n e d 3 0 2 0 1 0 0 - 1 0 - 2 0 - 3 0 - 4 0 p a r s e r e o n v o r t e x b z i p 2 WCED'02 The KIT COSMOS Processor 27

  28. Summary � Tradeoff between power and performance can be carefully investigated by exploiting critical path information in architectural-level design as well as device and circuit design. � We evaluated a criticality-based scheduling. It reduces energy in FUs by over 30%. � We also evaluated a clustered micro- architecture. Currently, it is not always a good design choice for energy reduction. WCED'02 The KIT COSMOS Processor 28

Recommend


More recommend