Extraction of Efficient Instruction Schedulers from Cycle-true Processor Models Oliver Wahlen, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Gunnar Braun Xiaoning Nie Heinrich Meyr CoWare, Inc. Infineon Technologies RWTH Aachen Institute for Integrated Signal Processing Systems
Motivation: Why ASIPs? Application Specific Instruction-Set Processors Combine advantages of processors and ASICs: • Provide system programmability and reconfigurability • Good tradeoff: performance/power consumption/area • Can easily be integrated into embedded systems efficiency (MIPS/Watt) ASICs ASIPs ASIPs domain specific GPPs flexibility Institute for Integrated Signal Processing Systems
Solution: LISA Processor Design Platform Language for Instruction-Set Architectures Application Application C Compiler EDGE TM Processor Profiler Profiler C Compiler C Compiler C Compiler Profiler Profiler (research) Designer Assembler Assembler Simulator Simulator Simulator Simulator Assembler Assembler Linker Linker LISA 2.0 Architecture Implementation Architecture Exploration Architecture RIM TM Software Designer HUB TM System Integrator Specification Assembler / C-Compiler System on Chip Linker Simulator / Debug. Software Application Design Integration and Verification http://www.coware.com Institute for Integrated Signal Processing Systems
Solution: LISA Processor Design Platform Language for Instruction-Set Architectures Application Application C Compiler EDGE TM Processor Profiler Profiler C Compiler C Compiler C Compiler Profiler Profiler (research) Designer Assembler Assembler Simulator Simulator Simulator Simulator Assembler Assembler Linker Linker LISA 2.0 Architecture Implementation Architecture Exploration Architecture RIM TM Software Designer HUB TM System Integrator Specification Assembler / C-Compiler System on Chip Linker Simulator / Debug. Software Application Design Integration and Verification http://www.coware.com Institute for Integrated Signal Processing Systems
Architecture Exploration Loop application Automatic tool generation: .c • Speeds up design cycles LISA C compiler • Eliminates consistency processor problem model automatic application C – compiler in the loop: generation manual .asm changes • Reduction in implementation assembler no and verification time & linker design • IP reuse check simulator criteria & profiler met? VHDL model yes Institute for Integrated Signal Processing Systems
Compiler Structure and Generation LISA .c processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector architecture Register specific Allocator backend Scheduler .cgd Generation engines Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems
Scheduler Generation LISA [EXPRESSION, .c PEAS-III] processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector Register Allocator Scheduler Scheduler .cgd Generation postpass tool lpacker Emitter Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems
Scheduler Description Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems
Scheduler Description Latency Tables Example: RAW ALU_in MUL_in Elimination of ALU_out 1 1 0: MUL R3,R1,R2 Dataflow MUL_out 2 2 1: NOP Hazards 2: ADD R5,R3,R4 WAW WAR Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems
LISA Description ... OPERATION reg_alu_instr IN pipe.ID OPERATION reg_alu_instr IN pipe.ID { decode { DECLARE { DECLARE { GROUP Opcode = { ADD || SUB }; ... GROUP Opcode = { ADD || SUB }; ... alu control GROUP Rs1, Rs2, Rd = { gp_reg }; GROUP Rs1, Rs2, Rd = { gp_reg }; } } ... ... CODING { Opcode Rs2 Rs1 Rd 0b0[10] } CODING { Opcode Rs2 Rs1 Rd 0b0[10] } imm_alu_instr reg_alu_instr SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } BEHAVIOR { BEHAVIOR { opcode Rd Rs1 Rs2 PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; ... ... ... PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; ADD SUB } } ACTIVATION { Opcode } ACTIVATION { Opcode } } } Institute for Integrated Signal Processing Systems
Scheduler Generation: Operation Schedule Activation DAG: 1 1 1 1 ADD ADD register_ register_ alu_instr alu_instr 1 SUB 0 0 1 0 0 1 1 alu_wb main main fetch fetch decode decode alu_wb 1 0 1 ADDI imm_ 1 alu_instr SUBI 1 Operation Schedule: ALU Cycle Resource usage 0 --- 1 2 x read of GP-register file 2 x read 2 x read 1 x write 1 x write 2 --- 3 1 x write of GP-register file Read Ports Write Port 1 2 3 1 Register File R Institute for Integrated Signal Processing Systems
Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … 0 1 2 3 4 5 6 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … L raw = 3 – 1 + 1 = 3 Institute for Integrated Signal Processing Systems
Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 … ... … GPR 0 1 2 3 4 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 ... ... … GPR L waw = 3 – 3 + 1 = 1 Institute for Integrated Signal Processing Systems
Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) L war ( i , j ) = Max R ( last_read_cycle( j , R ) – first_write_cycle( i , R ) ) Instructions 0 1 2 3 ADD R2, R1, R3 PC ... … ... JMP addr … PC … ... 0 1 2 3 4 ADD R2, R1, R3 PC … ... ... JMP addr ... PC ... ... negative latency = delay slot negative latency = delay slot L war = 0 – 1 = -1 Institute for Integrated Signal Processing Systems
List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 ADD R1 R2 PC: -1 PC: -1 SUB R3 R4 JMP addr Institute for Integrated Signal Processing Systems
List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 SUB R3 R4 JMP addr Cycle Step 1 0 ADD R1 R2 1 2 3 Institute for Integrated Signal Processing Systems
List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 JMP addr PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 0 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 2 3 Institute for Integrated Signal Processing Systems
List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 Step 3 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 2 JMP addr 3 Institute for Integrated Signal Processing Systems
List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr delay slot must be filled delay slot must be filled Cycle Step 1 Step 2 Step 3 Step 4 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 SUB R3 R4 2 JMP addr JMP addr 3 NOP Institute for Integrated Signal Processing Systems
Backtracking Scheduler • Negative latencies can automatically be extracted from the LISA model • They indicate delay slots • Negative weights in dependence DAG cannot be utilized by list schedulers because scheduling decisions need to be revoked Development of a retargetable Backtracking Scheduler Development of a retargetable Backtracking Scheduler [S. G. Abraham, W. Meleis, I. D. Baev, 2000] [S. G. Abraham, W. Meleis, I. D. Baev, 2000] Institute for Integrated Signal Processing Systems
mixedBT Backtracking Scheduler Concept: three scheduling modes 1. normal scheduling: if there is no conflict instructions are scheduled according to their data dependencies 2. displace scheduling: unschedule instructions that have lower priority and are causing a structural hazard 3. force scheduling: if 1 and 2 are not possible unschedule conflicts and force the scheduling of the candidate Institute for Integrated Signal Processing Systems
Recommend
More recommend