An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea
Outline • Motivations • VLIW Instruction Encodings • LOR Problem and Solution • GOR Problem and Solution • Experiment • Conclusions School of CSE 2 Workshop on Complexity-Effective Design Seoul National University
Motivations Many mobile devices are Many mobile devices are In digital CMOS circuits, designed using VLIW In digital CMOS circuits, designed using VLIW switching activity processors for high switching activity processors for high performance, which accounts for over 90% performance, which accounts for over 90% usually consume more of total power usually consume more of total power power than single-issue power than single-issue consumption. consumption. processors. processors. We propose a post-pass optimization We propose a post-pass optimization technique that can reduce switching technique that can reduce switching activity during the instruction fetch activity during the instruction fetch phase in VLIW processors phase in VLIW processors School of CSE 3 Workshop on Complexity-Effective Design Seoul National University
VLIW Instruction Encoding-Uncompressed IntU IntU FpU FpU MemU MemU CmpU BrU Program Functional Unit IADD /*IntU*/ || FADD /*FpU*/ IADD NOP FADD NOP LOAD STORE NOP NOP || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB IMUL NOP NOP NOP NOP NOP NOP ISUB /*IntU*/ IADD NOP NOP NOP NOP NOP NOP BEG || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ NOP IADD NOP FADD STORE LOAD NOP NOP Alternative IMUL ISUB NOP NOP NOP NOP NOP NOP encoding IADD NOP NOP NOP NOP NOP BEG NOP School of CSE 4 Workshop on Complexity-Effective Design Seoul National University
VLIW Instruction Encoding - Compressed Parallel bit Program IADD IntU 1 FADD FpU 1 LOAD MemU 1STORE MemU 0 ISUB IntU 1 IMUL IntU 0 IADD IntU 1 BEG IADD /*IntU*/ 0 BrU || FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ Instruction 1 Instruction 2 Instruction 3 Possible choices = 4! 2! 2! ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ FADD STORE IADD LOAD IMUL ISUB BEG IADD 1 1 1 0 1 0 1 0 FpU MemU IntU MemU IntU IntU BrU IntU Alternative encoding Instruction 1 Instruction 2 Instruction 3 Which encoding is the best for low-power consumption? School of CSE 5 Workshop on Complexity-Effective Design Seoul National University
Machine Model External Memory OP Memory block is fetched from the main memory OP b mem -bit through the b mem -bit width instruction bus width bus OP on cache-miss. Internal Cache FP Because of the compressed encoding Ins Ins Ins format, several VLIW instructions are b cache -bit FP fetched together in a single fetch width bus Ins Ins Ins from the instruction cache. VLIW A fetch packet consists of N operations, Processor Core and b mem = b cache /N School of CSE 6 Workshop on Complexity-Effective Design Seoul National University
Basic Idea Instruction Cache Instruction Cache 00010101 10010101 10011001 00000000 00010101 10010101 10011001 00000000 8 bit transitions 14 bit transitions 10001111 00000011 00011101 01011100 00011101 10001111 01011101 00000010 10 bit transitions 12 bit transitions 10011101 10011001 10010001 11111110 10011101 10011001 11111111 10010000 13 bit transitions 11 bit transitions 10100101 10001111 00011101 00011100 10001111 00011101 10100101 00011100 Total 39 bit transitions Total 29 bit transitions (a) Before operation rearrangement (b) After operation rearrangement The total # of bit changes are reduced by 25% School of CSE 7 Workshop on Complexity-Effective Design Seoul National University
Problem Formulation Problem how to reorder given VLIW instructions to reduce the number of bit transitions between successive instruction fetches. Solutions Local Operation Rearrangement (LOR) : each basic block is independently considered. Global Operation Rearrangement (GOR) : all the basic blocks are simultaneously considered. School of CSE 8 Workshop on Complexity-Effective Design Seoul National University
LOR Problem B B SW = SW cache + α B α α •SW mem α SW = SW cache + α α •SW mem α α α α is the load capacitance ratio of the external α α α is the load capacitance ratio of the external α α α instruction bus to the internal instruction bus. instruction bus to the internal instruction bus. B SW cache is the number of bit changes at the SW cache is the number of bit changes at the internal instruction bus. internal instruction bus. B SW mem is the number of bit changes at the SW mem is the number of bit changes at the external instruction bus. external instruction bus. School of CSE 9 Workshop on Complexity-Effective Design Seoul National University
LOR Problem ... Internal Cache OP 1 OP 2 OP N OP 1 OP 2 External Memory FP 3 SW intra FP FP 2 SW inter FP FP 1 SW mem SW cache VLIW Processor Core SW B = ∑ ∑ ∑ ∑ SW intra + ∑ ∑ ∑ SW inter ∑ FP FP School of CSE 10 Workshop on Complexity-Effective Design Seoul National University
Solution for LOR START B 0 EQ(FP i ) SW intra 0 0 0 FP B B B B FP 1 FP 2 FP 3 FP 4 i , i , i , i , SW inter FP B B B B FP FP FP FP + + + + i 1 , 1 i 1 , 2 i 1 , 3 i 1 , 4 0 0 0 0 END B B EQ(FP i ) : The set of equivalent fetch packets of FP i . School of CSE 11 Workshop on Complexity-Effective Design Seoul National University
Solution for LOR • We find the shortest START path from START to END, which is the solution of operation B B B B FP 1 FP 2 FP 3 FP 4 i , i , i , i , rearrangement to minimize the SW B • A node v i+1 in graph B B B B FP 1 FP 2 FP 3 FP 4 finds the node v i + + + + i 1 , i 1 , i 1 , i 1 , through which the shortest path from START to the node v i+1 END should pass. School of CSE 12 Workshop on Complexity-Effective Design Seoul National University
GOR Problem • All the basic blocks in a program are simultaneously considered – how many times each basic block is executed. – how often each basic block experiences cache misses. – how basic blocks are related each other. SW S = ∑ ∑ ∑ ∑ ∑ ∑ SW inter (bb i ,bb j ) + ∑ ∑ ∑ ∑ ∑ SW intra (bb i ) ∑ BB BB • SW inter and SW intra is represented by SW inter , BB BB FP SW intra , weight of each basic block, and cache FP miss rate. School of CSE 13 Workshop on Complexity-Effective Design Seoul National University
Solution for GOR Shortest Path Shortest Path GOR Problem LOR Algorithm GOR Problem LOR Algorithm Problem Problem Graph Transformation Graph Transformation Graph Solution Graph (branch merging, Solution (branch merging, Construction Construction loop rolling) loop rolling) This method may require an excessive amount of memory and cycles. We need a heuristic solution. School of CSE 14 Workshop on Complexity-Effective Design Seoul National University
Heuristic for GOR • All the basic blocks are not equally treated . – Basic blocks with larger effects on the total switching activity are more thoroughly reordered than ones with smaller effects. • Not all the equivalent basic blocks in EQ(bb i ) are tried to find an optimal solution. – Only N cand equivalent basic blocks are created and included in graph. School of CSE 15 Workshop on Complexity-Effective Design Seoul National University
Experiment TMS320C6201 TMS320C6201 • Fixed-point DSP • Fixed-point DSP • VLIW processor that can specify eight 32-bit • VLIW processor that can specify eight 32-bit operations in a single 256-bit instruction. operations in a single 256-bit instruction. • Use a compressed encoding • Use a compressed encoding VLIW Processor External Bus Core Instruction External Internal FU1 FU5 Cache Memory Bus FU2 FU6 FU3 FU7 32-bit width 256-bit width FU4 FU8 School of CSE 16 Workshop on Complexity-Effective Design Seoul National University
Experiment Results 1 . 2 1 . 0 F I / T 0 . 8 d e f a u l t B e 0 . 6 L OR v i t G OR - H a l 0 . 4 e R 0 . 2 0 . 0 v e c t o r F I R 8 I I R l a t t i c e W_ v e c m i n e r r o r a v e r a g e m u l t i p l y a n a l y s i s B e n c h m a r k P r o g r a m s For our benchmark programs, the bit transitions was reduced by 34% on an average. School of CSE 17 Workshop on Complexity-Effective Design Seoul National University
Conclusions • Described a post-pass optimal operation rearrangement method for low-power VLIW instruction fetch. – The switching activity was reduced by 34% on an average . • Future works – The phase-ordering problem between the operation rearrangement and other compiler optimization steps. – Operation rearrangement problem in super-scalar processors. School of CSE 18 Workshop on Complexity-Effective Design Seoul National University
Recommend
More recommend