UNIVERSITAT POLITÈCNICA DE CATALUNYA UPC A C omparative S tudy of M odulo S cheduling T echniques Josep M. Codina, Josep Llosa and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,josepll,antonio}@ac.upc.es
Software Pipelining UPC � Instruction Scheduling for VLIW/Superscalar Processors � VLIW processors in DSP market � EPIC/IPF � Loop Scheduling: Software Pipelining � Loops consume most of the application’ execution time I NTRODUCTION � Software Pipelining a loop is an NP-complete problem � Software Pipelining big family of techniques � Modulo Scheduling based on heuristics
Motivation UPC � Modulo Scheduling is an environment to define techniques � Different factors to take into account � Lot of techniques can fit in the environment. Different ideas � Proposals in the literature evaluated without common � Platform (i.e. compiler) I NTRODUCTION � Benchmarks � Target architectures � Measures � Lack of a thorough comparison
Objectives UPC � Perform a comparison of state-of-the-art MS techniques � Qualitative � Quantitative � The work is target to compiler writers I NTRODUCTION � Is one of the techniques better than the others for all architectures? � Which is the most powerful technique for a given architecture?
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
Basic Ideas UPC Initiation Interval (II) Stage 1 Prolog Stage 2 M ODULO S CHEDULING Stage 3 Stage 2 Stage 1 Kernel Iteration 1 Iteration 2 Epilog Iteration 3 Iteration 4
Basic Scheme UPC Find MII and Set II=MII Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí
Basic Scheme UPC MII depends on Find MII and Set II=MII • Resources • Recurrences Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí
Basic Scheme UPC Find MII and Set II=MII Look for a schedule Look for a • Ordering the nodes schedule M ODULO S CHEDULING • Finding a feasible cycle • Top-Down/Bottom-up No • Bi-directional Increase the II Found it ? • When no feasible cycle • Use of backtracking Sí • Increase the II
Basic Scheme UPC Find MII and Set II=MII Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Can we meet the constraints? • Resources Sí • Dependences
Basic Scheme UPC • The larger the II, the more likely to find a schedule Find MII and Set II=MII • The larger the II, the lower the performance • II lower than the length of a single iteration Look for a schedule M ODULO S CHEDULING No Increase the II Found it ? Sí
Backtracking UPC � Not always beneficial � Can produce better schedules � Can just increase the process of finding a schedule M ODULO S CHEDULING � In some cases, no feasible schedule for a given II Backtracking must be limited � BudgetRatio: Ratio of the maximum number of operation scheduling steps attempted before increasing the II
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique
UPC � Value of the code generated • Effectiveness on exploting ILP � Parallelism • What is the difference between II and MII? � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique
UPC � Value of the code generated � Parallelism • Software pipelining puts high demands on register pressure � Register pressure • How many regs are needed? S ELECTION C RITERIA • How many loops within a given � Code size number of registers? � Execution time � Effectiveness/Cost of the technique
UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA • Crucial in embedded domains � Code size • Stages of a schedule � Execution time � Effectiveness/Cost of the technique
UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time • Main objective � Effectiveness/Cost of the technique
UPC � Value of the code generated � Parallelism � Register pressure S ELECTION C RITERIA � Code size � Execution time � Effectiveness/Cost of the technique • Can all the loops be scheduled? • Compilation time
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
Techniques Stage MS UPC � Modulo Scheduling Techniques • Post-pass that can be applied after a MS technique � Iterative Modulo Scheduling (IMS) • To reduce the Register Pressure T ECHNIQUES C OMPARED � Swing Modulo Scheduling (SMS) � Slack Modulo Scheduling (Slack MS) • Without increasing the II � Integrated Register-sensitive Iterative Software Pipelining method (IRIS) • Moves operations by II � Complementary techniques • Various heuristics. We selected 3UP+RSS heuristic � Stage Modulo Scheduling (Stage MS)
Main Differences UPC IMS SMS Slack MS IRIS •Priority to recurrences •Dynamic Order of T ECHNIQUES C OMPARED Top-Down •No pred. and succ. Top-Down nodes •Based on Slack scheduled in partial schedule •Bi-directional •Close to pred •Bi-directional Finding a Stage MS Top-Down or succ. cycle Heuristics •Close to pred or succ depending on the benefit Backtracking Yes No Yes Yes
Qualitative Comparison UPC IMS SMS Slack MS IRIS •Order Parallelism Backtracking Order Backtracking •Backtracking T ECHNIQUES C OMPARED •Order Register Stage No Bi-directional Pressure Heuristics •Bi-directional Code Size Yes Yes Yes Yes Effectiveness Backtracking Backtracking Backtracking No Cost Backtracking
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
Environment UPC � Platform (i.e. compiler) ICTINEO � Benchmarks S TUDY E NVIRONMENT � SPECfp95 1936 loops � Perfect Club � Target architectures � Some architectures varying the complexity Less Constrained � Low Complexity architecture � Medium Complexity architecture � Complex architecture More Constrained
Architectures Description UPC Low Complexity Medium Complexity Complex Architecture Fully Pipelined Simple ops Fully Pipelined ops Non-Pipelined Complex ops 8-Issue 4-Issue 4 write-ports Unlimited register ports S TUDY E NVIRONMENT 8 read-ports 1936 loops 2 memory ports 2 Int FU and 2 FP FU Latencies Low/Medium Complex MEM 2 3 ADD, SUB, COMP 1 3 INT MUL 2 4 DIV, MOD, SQRT 6 8 ADD, SUB, COMP 3 5 FP MUL 6 8 DIV, MOD, SQRT 18 20
Methodology UPC � Study of the BudgetRatio for each architecture: 1, 2.5, 5 and 10 � Effectiveness � Performance � Cost S TUDY E NVIRONMENT � Measures for each technique with and without Stage MS � Effectiveness and cost � Parallelism � Register pressure � Code size � Execution
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
BudgetRatio Study UPC 1,2 Low Complexity 1,15 Architecture 1,1 Sum II/Sum MI IMS 1,05 IRIS Slack Medium Complex 1 Low Complexity Complexity Architecture 0,95 Performance Effectiveness Cost 0,9 1 2,5 5 10 10 BudgetRatio 5 2.5 450 12 400 10 % non scheduled ops 350 300 8 Total time 250 R ESULTS 6 200 150 4 100 2 50 0 0 1 2,5 5 10 1 2,5 5 10 BudgetRatio BudgetRatio
II vs MII UPC UPC 1,014 1,012 Average (II/MII) 1,01 IMS 1,008 SMS 1,006 IRIS 1,004 Slack 1,002 1 R ESULTS Low Medium Architectures
Register Pressure UPC UPC 1,9 1,8 MaxLive/MinAvg 1,7 IMS 1,6 IMS+ST 1,5 SMS 1,4 SMS+ST 1,3 IRIS IRIS+ST 1,2 Slack 1,1 Slack+ST 1 R ESULTS Low Medium Architectures
Execution Time UPC UPC Low Complexity Architecture Medium Complexity Architecture Millions Millions IMS 28700 IMS 42000 IMS+ST 28600 IMS+ST 40000 SMS 28500 SMS 38000 SMS+ST Cycles 28400 SMS+ST Cycles 36000 IRIS 28300 IRIS 34000 IRIS+ST 28200 IRIS+ST 32000 Slack 28100 R ESULTS Slack 30000 Slack+ST 28000 Slack+ST 28000 Techniques Techniques
Complex Architecture UPC UPC II vs MII 1,8 Millions 1,7 1,07 42000 IMS MaxLive/MinAvg 1,6 1,06 Average (II/MII) 40000 IMS+ST IMS 1,05 1,5 SMS 38000 IMS IMS+ST 1,04 1,4 SMS+ST SMS SMS Cycles 36000 1,03 IRIS 1,3 IRIS SMS+ST 34000 1,02 IRIS+ST Slack IRIS 1,2 32000 Slack 1,01 IRIS+ST 1,1 Slack+ST R ESULTS R ESULTS 30000 1 Slack 1 Slack+ST Techniques 28000 Techniques Techniques
Talk Outline UPC � Modulo Scheduling Background � Selection Criteria � Techniques Compared � Study Environment � Results � Conclusions
Recommend
More recommend