WCED’02, Anchorage, USA WCED’02, Anchorage, USA Power Estimation of a C algorithm on a VLIW Power Estimation of a C algorithm on a VLIW Processor Processor Nathalie Julien, Eric Senn , Johann Laurent, Eric Martin LESTER, University of South Brittany, Lorient, France Eric.Senn@univ-ubs.fr http://lester.univ-ubs.fr E. SENN - LESTER / UBS - WCED'02 1
Context Context C-level estimation WITHOUT compilation test1(int IU, int JU, int KU) { int i, j, k; for(i=0; i<IU; i++) for(j=0; j<JU; j++) { for(k=8; k<KU; k++) P ? P ? A[k] = A[k-8]; B[i][j] = B[i+1][j] + A[i]; } for(i=0; i<IU; i++) for(j=0; j<JU; j++) B[i][j] = B[i][j] + B[i+1][j]; } Complete power model E. SENN - LESTER / UBS - WCED'02 2
Power Estimation Power Estimation Gate level estimation: - very accurate but long simulation time - RTL description needed RTL description not available Instruction Level P. A. Functional Level P. A. - accurate - accurate & fast - limited for VLIW processor - based on architecture analysis - compiler dependent - compiler independent - memories and pipeline - memories and pipeline stalls not taken into account stalls taken into account E. SENN - LESTER / UBS - WCED'02 3
Methodology: Model Definition Methodology: Model Definition Algorithmic parameters Processor Parallelism, Processing units Cache miss, Pipeline Stalls... Consumption in mapped mode α Parameters 3500 3000 FLPA 80 2500 Current (mA) 133 2000 P = a α + b 1500 Configuration Parameters 160 Power 1000 200 500 Model Frequency, Memory Mode... 0 0 0,25 0,5 0,75 1 Parallelism rate (%) Measures P = 4 α + 1 Model Definition E. SENN - LESTER / UBS - WCED'02 4
Methodology: Estimation Process Methodology: Estimation Process C Algorithm Processor Configuration parameters with the application Parameter values Algorithmic parameters α = 0.5 FLPA with prediction models Power P = 4 α + 1 Model Assumptions on the compiler efficiency Measures C-level Power P= 3 W Estimation Model Definition Estimation Process E. SENN - LESTER / UBS - WCED'02 5
TI C6x: Model Definition TI C6x: Model Definition TI TMS320C6201: VLIW processor up to 8 instructions in parallel, deep pipeline (up to 11 stages), 4 memory modes: mapped, bypass, cache and freeze FLPA: Functional-Level Power Analysis β γ Program Data RAM/cache RAM ε α α : parallelism rate EMIF Program/data buses β : number of CPU processing units Program fetch γ : cache miss rate PU Instruction dispatch PSR : pipeline stall rate Instruction decode DMA Side A Side B IMU F : clock frequency 4 processing units 4 processing units MM : memory mode register file register file MMU E. SENN - LESTER / UBS - WCED'02 6
TI C6x: Power Model TI C6x: Power Model α β ALGORITHMIC P core γ PARAMETERS TMS320C6201 PSR POWER MODEL F CONFIGURATION MM PARAMETERS Power consumption rule in mapped mode P core = V DD * ([ a β (1- PSR ) + b m ] F + α (1- PSR ) [ a m F + c m ] + d m ) measurements: a=0.64, a m =5.21, b m =4.19, c m =42.401, and d m =7.6 E. SENN - LESTER / UBS - WCED'02 7
Parameters extraction Parameters extraction X=a+b; X=a+b; Loop nests analysis Y=c+d; Y=c+d; for (i=0;i<10;i++) for (i=0;i<10;i++) Local parameters prediction ( α , β ) y[i]=c[i]*d[i+1]; y[i]=c[i]*d[i+1]; Global parameters ( α , β ) : Z=a+d; Z=a+d; average of for (j=0;j<50;j++) for (j=0;j<50;j++) local values { { Local parameters for(k=0;k<32;k++) for(k=0;k<32;k++) prediction ( α , β ) tab[k]=h[k-1]+l[k+1] tab[k]=h[k-1]+l[k+1] } } E. SENN - LESTER / UBS - WCED'02 8
Parameters extraction Parameters extraction For (i=0; i<512; i++) Y= x[i]*(h[i] + h[i+1] + h[i-1]) + y; Loop body : 8 instructions = 4 LD, 4 OP NFP: Number of Fetch Packets NFP = 1; NPU = 8 NPU: Number of Processing Units 1 NPU NFP ; β = ≤ α = NEP ≤ 1 1 NEP: Number of Execution 8 NEP Packets α , β PREDICTION MODEL EP1 EP2 EP3 EP4 SEQ 8 EP 0.125 MAX 2 LD 2 LD - - 0.5 4 OP MIN 1 LD 1 LD 1 LD 1 LD 0.25 4 OP DATA 2 LD 1 LD 1 LD - 0.33 4 OP E. SENN - LESTER / UBS - WCED'02 9
Prediction models Prediction models Program Data RAM/cache RAM EMIF Program/data buses Data model Max model Min model load/store fully exploitation load/ store never executed in parallel SEQ model of the architecture executed in Program fetch only on different instructions executed parallel Instruction dispatch data sequentially Instruction decode DMA CPU Side B Side A PU1 PU2 PU3 PU4 PU1 PU2 PU3 PU1 PU1 PU2 PU3 PU4 PU1 PU2 PU3 PU4 PU1 PU2 PU3 PU1 PU2 PU3 PU1 PU2 PU3 PU4 PU1 PU2 PU3 PU4 E. SENN - LESTER / UBS - WCED'02 10
Results Results A lgorithm M easures Estim ation vs M easures (% ) A pplication M M IN T/EX T P (W ) SE Q M A X M IN D A TA FIR M M IN T 4.5 -39% +5% -33% +5% M FFT M M IN T 2.65 -11% +12% -3% -2.6% M LM S M M IN T 4.97 +1% +3% +2% +3% B LM S M M IN T 5.67 -55% +5.8% -16% +5.8% C D W T 64*64 M M IN T 3.75 -25% +13% -13% -5.9% M D W T 64*64 M M EX T 2.55 -10% +3% -5.9% -3.5% M D W T 512*512 M M EX T 2.55 -11% +2.4% -7% -3.9% M EFR vocoder M M IN T 5.08 -50% +11% -24% +1% M M PEG decoder M M IN T 5.82 -54% +9.6% -32% -8% M A verage error 32% 7.8% 17% 4.8% • Estimation vs Measures < 8% • Minimum and maximum bounds provided E. SENN - LESTER / UBS - WCED'02 11
Consumption "maps" Consumption "maps" • Consumption maps for the EFR Vocoder 7 6 5 POWER (W) DATA 4 prediction MEASURE 3 2 In cache mode 1 0 0 10 20 30 40 50 60 70 80 90 PSR (%) 8 In mapped mode 90 6 4 POWER (W) 60 2 30 CACHE MISS RATE (%) 0 80 60 40 20 0 0 PSR (%) E. SENN - LESTER / UBS - WCED'02 12
PSR estimation PSR estimation • PSR=NPS / NTC – NPS: number of cycles where the pipeline is stalled – NTC: total number of cycles • NPS=NPS τ +NPSbc+NPS γ – NPS τ : external data access - NEXT - Data Mapping (C- level) – NPSbc: internal data bank conflict - NCONFLICT - Data Mapping (C-level) – NPS γ : program cache misses - NFRAME - Compilation (A-level) E. SENN - LESTER / UBS - WCED'02 13
Complexity reduction Complexity reduction • Only a portion of the code is to be studied • Optimization effort can be focussed # of code lines # of lines studied Application C ASM Number %C FFT 77 408 10 13 LMS 30 408 4 13.3 DWT 64*64 46 714 17 37 EFR 118 1323 37 31.2 MPEG 2267 8488 30 1.3 E. SENN - LESTER / UBS - WCED'02 14
Conclusion Conclusion • Original and general approach validated on a VLIW DSP architecture • Estimation of minimum and maximum bounds of an algorithm power consumption • Fast and accurate power estimation at the C-level (error max = 8%) • Refining at the assembly level (error max = 3%) – but compilation is needed then E. SENN - LESTER / UBS - WCED'02 15
Conclusion Conclusion • Co-design HW/SW, SOC • High level abstraction decision – no compilation – no physical measurements – no development tools and evaluation boards • Fast feedback on software performances – hot spots – pieces of code not suitable for compilation yet • Complexity reduction E. SENN - LESTER / UBS - WCED'02 16
Current and Future works Current and Future works • Development of an automatic tool in progress (available on-line before 2003) • Extension of the power model library in progress (TI C55, ARM7) • Execution time estimation for energy consumption • Generic model for external memories E. SENN - LESTER / UBS - WCED'02 17
Recommend
More recommend