extraction of efficient instruction schedulers from cycle
play

Extraction of Efficient Instruction Schedulers from Cycle-true - PowerPoint PPT Presentation

Extraction of Efficient Instruction Schedulers from Cycle-true Processor Models Oliver Wahlen, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Gunnar Braun Xiaoning Nie Heinrich Meyr CoWare, Inc. Infineon Technologies RWTH Aachen


  1. Extraction of Efficient Instruction Schedulers from Cycle-true Processor Models Oliver Wahlen, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Gunnar Braun Xiaoning Nie Heinrich Meyr CoWare, Inc. Infineon Technologies RWTH Aachen Institute for Integrated Signal Processing Systems

  2. Motivation: Why ASIPs? Application Specific Instruction-Set Processors Combine advantages of processors and ASICs: • Provide system programmability and reconfigurability • Good tradeoff: performance/power consumption/area • Can easily be integrated into embedded systems efficiency (MIPS/Watt) ASICs ASIPs ASIPs domain specific GPPs flexibility Institute for Integrated Signal Processing Systems

  3. Solution: LISA Processor Design Platform Language for Instruction-Set Architectures Application Application C Compiler EDGE TM Processor Profiler Profiler C Compiler C Compiler C Compiler Profiler Profiler (research) Designer Assembler Assembler Simulator Simulator Simulator Simulator Assembler Assembler Linker Linker LISA 2.0 Architecture Implementation Architecture Exploration Architecture RIM TM Software Designer HUB TM System Integrator Specification Assembler / C-Compiler System on Chip Linker Simulator / Debug. Software Application Design Integration and Verification http://www.coware.com Institute for Integrated Signal Processing Systems

  4. Solution: LISA Processor Design Platform Language for Instruction-Set Architectures Application Application C Compiler EDGE TM Processor Profiler Profiler C Compiler C Compiler C Compiler Profiler Profiler (research) Designer Assembler Assembler Simulator Simulator Simulator Simulator Assembler Assembler Linker Linker LISA 2.0 Architecture Implementation Architecture Exploration Architecture RIM TM Software Designer HUB TM System Integrator Specification Assembler / C-Compiler System on Chip Linker Simulator / Debug. Software Application Design Integration and Verification http://www.coware.com Institute for Integrated Signal Processing Systems

  5. Architecture Exploration Loop application Automatic tool generation: .c • Speeds up design cycles LISA C compiler • Eliminates consistency processor problem model automatic application C – compiler in the loop: generation manual .asm changes • Reduction in implementation assembler no and verification time & linker design • IP reuse check simulator criteria & profiler met? VHDL model yes Institute for Integrated Signal Processing Systems

  6. Compiler Structure and Generation LISA .c processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector architecture Register specific Allocator backend Scheduler .cgd Generation engines Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems

  7. Scheduler Generation LISA [EXPRESSION, .c PEAS-III] processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector Register Allocator Scheduler Scheduler .cgd Generation postpass tool lpacker Emitter Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems

  8. Scheduler Description Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems

  9. Scheduler Description Latency Tables Example: RAW ALU_in MUL_in Elimination of ALU_out 1 1 0: MUL R3,R1,R2 Dataflow MUL_out 2 2 1: NOP Hazards 2: ADD R5,R3,R4 WAW WAR Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems

  10. LISA Description ... OPERATION reg_alu_instr IN pipe.ID OPERATION reg_alu_instr IN pipe.ID { decode { DECLARE { DECLARE { GROUP Opcode = { ADD || SUB }; ... GROUP Opcode = { ADD || SUB }; ... alu control GROUP Rs1, Rs2, Rd = { gp_reg }; GROUP Rs1, Rs2, Rd = { gp_reg }; } } ... ... CODING { Opcode Rs2 Rs1 Rd 0b0[10] } CODING { Opcode Rs2 Rs1 Rd 0b0[10] } imm_alu_instr reg_alu_instr SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } BEHAVIOR { BEHAVIOR { opcode Rd Rs1 Rs2 PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; ... ... ... PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; ADD SUB } } ACTIVATION { Opcode } ACTIVATION { Opcode } } } Institute for Integrated Signal Processing Systems

  11. Scheduler Generation: Operation Schedule Activation DAG: 1 1 1 1 ADD ADD register_ register_ alu_instr alu_instr 1 SUB 0 0 1 0 0 1 1 alu_wb main main fetch fetch decode decode alu_wb 1 0 1 ADDI imm_ 1 alu_instr SUBI 1 Operation Schedule: ALU Cycle Resource usage 0 --- 1 2 x read of GP-register file 2 x read 2 x read 1 x write 1 x write 2 --- 3 1 x write of GP-register file Read Ports Write Port 1 2 3 1 Register File R Institute for Integrated Signal Processing Systems

  12. Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … 0 1 2 3 4 5 6 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … L raw = 3 – 1 + 1 = 3 Institute for Integrated Signal Processing Systems

  13. Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 … ... … GPR 0 1 2 3 4 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 ... ... … GPR L waw = 3 – 3 + 1 = 1 Institute for Integrated Signal Processing Systems

  14. Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) L war ( i , j ) = Max R ( last_read_cycle( j , R ) – first_write_cycle( i , R ) ) Instructions 0 1 2 3 ADD R2, R1, R3 PC ... … ... JMP addr … PC … ... 0 1 2 3 4 ADD R2, R1, R3 PC … ... ... JMP addr ... PC ... ... negative latency = delay slot negative latency = delay slot L war = 0 – 1 = -1 Institute for Integrated Signal Processing Systems

  15. List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 ADD R1 R2 PC: -1 PC: -1 SUB R3 R4 JMP addr Institute for Integrated Signal Processing Systems

  16. List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 SUB R3 R4 JMP addr Cycle Step 1 0 ADD R1 R2 1 2 3 Institute for Integrated Signal Processing Systems

  17. List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 JMP addr PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 0 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 2 3 Institute for Integrated Signal Processing Systems

  18. List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 Step 3 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 2 JMP addr 3 Institute for Integrated Signal Processing Systems

  19. List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr delay slot must be filled delay slot must be filled Cycle Step 1 Step 2 Step 3 Step 4 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 SUB R3 R4 2 JMP addr JMP addr 3 NOP Institute for Integrated Signal Processing Systems

  20. Backtracking Scheduler • Negative latencies can automatically be extracted from the LISA model • They indicate delay slots • Negative weights in dependence DAG cannot be utilized by list schedulers because scheduling decisions need to be revoked Development of a retargetable Backtracking Scheduler Development of a retargetable Backtracking Scheduler [S. G. Abraham, W. Meleis, I. D. Baev, 2000] [S. G. Abraham, W. Meleis, I. D. Baev, 2000] Institute for Integrated Signal Processing Systems

  21. mixedBT Backtracking Scheduler Concept: three scheduling modes 1. normal scheduling: if there is no conflict instructions are scheduled according to their data dependencies 2. displace scheduling: unschedule instructions that have lower priority and are causing a structural hazard 3. force scheduling: if 1 and 2 are not possible unschedule conflicts and force the scheduling of the candidate Institute for Integrated Signal Processing Systems

Recommend


More recommend