Sorin Manolache Sorin Manolache Outline Specific Issues Related to Embedded Processor Architectures • General approach Sorin Manolache • SIMD example sorma@ida.liu.se • VLIW example 1 2 November 7, 2001 November 7, 2001 Sorin Manolache Sorin Manolache Programmable Architectures of Interest (2) Programmable Architectures of Interest (1) • Multimedia processors • Microcontrollers (MCU) – VLIW – CISC – High performance – High code density – Low code density – Reduced computational resources – High power consumption • RISC processors • Application-Specific Processors (ASIP) – Large file of GP registers – Application-specific data paths – Orthogonal instruction set – Sometimes customizable (reg. file size, reg. width, • DSP word width) – DSP-specific data path architectures (irregular) 3 4 November 7, 2001 November 7, 2001
Sorin Manolache Sorin Manolache Source Level Optimization Optimization Levels • Architecture independent • Standard optimizations – Constant folding – Common subexpression elimination • Source level optimization – Jump optimization • Optimized instruction set mapping • Address code transformations • Assembly level optimizations – Attempt to optimize address generation for array indexes (up to 50% code for addr. gen.) • Loop transformations – Loop unrolling – Loop folding (software pipelining) • Function inlining 5 6 November 7, 2001 November 7, 2001 Sorin Manolache Sorin Manolache Optimized Instruction Set Mapping Tree Pattern Matching (1) • Mapping of machine-independent intermediate representation (IR) to matching instructions SUB • Consider: ADD + SUB – – – – Special-purpose registers – Complex instruction patterns (MAC, SIMD) + + * * ADD – Inter-instruction constraints (resource) + + • Phases: * MAC MAC + – Tree pattern matching (mapping of “functionality”) – Register allocation (mapping of data) – Instruction scheduling (multiprocessor scheduling, with dependency and communication constraints) 7 8 November 7, 2001 November 7, 2001
Sorin Manolache Sorin Manolache Tree Pattern Matching (2) Register Allocation and Instruction Scheduling • Optimal cover in linear time with dynamic programming • Graph colouring problems (heuristics) • There are compiler generators for such problems • Simulated annealing, ILP , CLP • Limitations: • Cost function, number of register spillings – Only for DFT, not DFG • Better results when applied concurrently with instruction sched- – Difficult to apply for irregular data paths uling ( phase coupling ) – Doesn’t apply to SIMD processors • Mutation scheduling, list scheduling • Algebraic transformations on operations 9 10 November 7, 2001 November 7, 2001 Sorin Manolache Sorin Manolache Assembly Level Optimization Even More Problems • Goals: • Compiling for low-power (Siemens reported 60% power saving for a mobile phone in stand-by mode achieved only from soft- – Memory access optimization ware!) – Instruction scheduling optimization • Retargetability • Consider: • Industrial trends: VLIW – Memory architecture (many DSP have two memory banks) – Address generation hardware – Instruction level parallelism • Fewer pipeline stalls • Speed-ups of 24% reported 11 12 November 7, 2001 November 7, 2001
Sorin Manolache Sorin Manolache Compilation for SIMD (2) Compilation for SIMD (1) • Constraints: • Authors go for an ILP approach – Each node in the DFT has to be covered by exactly • Write a grammar for tree covers one machine operation • A derivation in this grammar is a possible cover – Result destination of an operation have to be the • A cost is associated to each rule same as operand sources for result consumers DFT: store (REG, plus (REG, OFFS)) – Common subexpressions REG: plus (REG, REG) – SIMD selection: same operation, correct alignment in REG_LO: plus (REG_LO, REG_LO) memory, correct placing relative to the SIMD register, REG_HI: plus (REG_HI, REG_HI) no dependency • same cost associated to the three rules for covering an ADD. – Data dependency constraints • The problem is to find a derivation with an optimal cost 13 14 November 7, 2001 November 7, 2001 Sorin Manolache Sorin Manolache Code Generation for Irregular Data Paths Compilation for VLIW • Authors go for a CLP approach • Consider only the partitioning (mapping) and scheduling phases (code selection is considered to be done) • Representation by means of a factored machine operation • Done concurrently by feeding back the results of the scheduling (Op, R, [O 1 , O 2 ,..., O n ], ERI, Cons) phase to a new iteration of the mapping phase • Works even if Op is not available on the target processor (alge- • Mapping by means of a simulated annealing approach braic transformations) • Scheduling based on a list scheduling approach • Able to model chained operations (MACs), large class of restric- tions on instruction-level parallelism • For the most complex models, performance improvement to up to 50%, at the expense of sometimes 24 hours of compilation 15 16 November 7, 2001 November 7, 2001
Sorin Manolache Sorin Manolache Scheduling Algorithm (1) The Problems Come from the Architecture mark all nodes as not scheduled S = empty_set while not scheduled nodes A Register File B Register File m = NEXT_READY_NODE(); X2 X1 S = SCHEDULE_NODE(S, m, P); mark m as scheduled L1 S1 M1 D1 D2 M2 S2 L2 • NEXT_READY_NODE selects the node that can be scheduled the earliest addr bus • ties are broken by selecting the node that has the smallest data bus ALAP time (first_fail in CLP) 17 18 November 7, 2001 November 7, 2001 Sorin Manolache Sorin Manolache Scheduling Algorithm (2) Scheduling Algorithm (3) cs = EARLIEST(m) - 1; repeat • Use copies cs = cs + 1; fm = GET_NODE_UNIT(m, cs, P); • If no copies, use cross path rather than inserting a move instruc- tion in an earlier step if (fm == 0) continue; if (m has an argument in a diff. cluster) • Exploit commutativity CHECK_ARG_TRANSFER(); if (at least one transfer impossible) continue ; • Approach best when the ratio of the length of the critical path SCHEDULE_TRANSFERS(); and the number of instructions is low until m is scheduled if (m is a LOAD) DETERMINE_LOAD_PATH(m) if (m is a CSE) INSERT_FORWARD(S, m); 19 20 November 7, 2001 November 7, 2001
Sorin Manolache Conclusions • Generally difficult problem • Additionally, difficulty very much dependent on the particular problem instance => difficult to create a retargetable compiler • Heuristics, but which? General (simulated annealing)? Problem specific (CP)? 21 November 7, 2001
Recommend
More recommend