� � SCOPES 2003 Outline Tailoring Software Pipelining For 1. Low-power DSP 16000 and ZOLB Effective Exploitation Of Zero 2. Compiler Mission Overhead Loop Buffer 3. Conventional Approach 4. Alternatative approach 5. Intermediate Results 6. Conclusion Gang-Ryung Uh CS Department Boise State University Signal Processing Algorithm DSP (Digital Signal Processor) � Programmable processor for mathematical operations to manipulate signals with , f F I R : y = ¢ ² b x o r n = 0 , ..,N k n k - n 1.High performance, , f F F T : y = ¢² w j k x o r j = 0 , .., N - 1 , w h e r e w = e -2 i¥ð /N 2.Minimal power consumption k j 2 3.Minimal memory footprint 2 ¢² D - D C T : F ( u ,v ) = 1 / N ¢ ² n f ( m ,n ) c o s[ ( 2 m + 1 ) u ¥ð / 2 N ] c o s[ ( 2 n + 1 ) u ¥ð / 2 N ] m , f o r m ,n = 0 ,..,N -1 �� �� � �� �� � � �� �� � �� � �� � I. Heavy arithmetic computations �� �� �� �� ��� ���� ��� �� �� �� ���� ���� II. Can be easily programmed into �� �� Tight Small Loops Finite Impulse Response (FIR) Finite Impulse Response (FIR) Lucent DSP16000 Architecture Features 1. Havard Architecture At the least, compute one Tap in a 2. Separate AGU from DALU for rich addressing modes Single Cycle 3. Zero-wait State High Speed Memory 1
� Lucent DSP16000 Architecture Lucent DSP16000 Instruction Set Features (cont) Design 4. Compiler (Programmer) Controlled On-Core In order to achieve performance & higher code density Instruction Cache – ZOLB (Zero Overhead Loop Buffer) to support high performace high performace with minimal minimal power dissipation A0 = A0 + P0 P0 = Xh*Yh P1 = Xl * Yl Y = *R0++ X = *PT0++ power dissipation Instruction .... 16 bit word instruction buffer cloop d o c lo o p { in str u c tio n 1 .... Instruction 1 k .... r � Permissible order of operations is very limited Instruction 2 ed o k � The register usage is restricted to only a few different cstate in str u c tio n n .... ... registers ... zolbpc n } ... Instruction 31 Compiler Mission! Experience with Iterative Modulo Where are the compound/complex Scheduling Techniques instructions? EDN Benchmark: FIR Filter A2=0 // EDN Benchmarks j=a4 fir(const short array1[ ], fir(const short array1[ ], const short coeff[], short output[]) const short coeff[], { do 50 { short output[]) int i,j,sum; /* inst 1 */ xh = *(r0 + j) { int i,j,sum; /* inst 2 */ yh = *r3++ for(i=0;i < N-ORDER;i++){ /* inst 3 */ r4 = j sum=0; for(i=0;i < N-ORDER;i++){ a2=0 /* inst 4 */ p0 = xh*yh p1 = xl*yl for(j=0; j < ORDER; j++){ sum=0; j=a4 /* inst 5 */ a2 = a2+p0 for(j=0; j < ORDER; j++){ sum += array1[i+j]*coeff[j]; sum += array1[i+j]*coeff[j]; /* inst 6 */ j = r4+1 } do 50 { } output[i]=sum>>15; } /* inst 1 */ xh = *(r0 + j) output[i]=sum>>15; } } /* inst 2 */ yh = *r3++ } } /* inst 3 */ r4 = j /* inst 4 */ p0 = xh*yh p1 = xl*yl /* inst 5 */ a2 = a2+p0 /* inst 6 */ j = r4+1 A0 = A0 + P0 P0 = Xh*Yh P1 = Xl * Yl Y = *R0++ X = *PT0++ } Step 2: Recurrence Initiation Step 1: Resource Inition Interval Interval MII = MAX(RecII, ResII) MII = MAX(RecII, ResII) ResII : Smallest Loop Initiation Interval RecII : Smallest Integer Loop Initiation Inst 1 Inst 1 to meet the system resource requirement Interval to meet all the deadlines imposed by data dependence circuits. do 50 { Inst 2 /* inst 1 */ xh = *(r0 + j) Inst 2 (1,1) do 50 { /* inst 2 */ yh = *r3++ /* inst 1 */ xh = *(r0 + j) Inst 3 /* inst 3 */ r4 = j /* inst 2 */ yh = *r3++ Inst 3 (1,0) (0,1) /* inst 4 */ p0 = xh*yh p1 = xl*yl (0,1) /* inst 3 */ r4 = j /* inst 5 */ a2 = a2+p0 Inst 4 /* inst 4 */ p0 = xh*yh p1 = xl*yl (1,0) /* inst 6 */ j = r4+1 /* inst 5 */ a2 = a2+p0 Inst 4 } (1,1) /* inst 6 */ j = r4+1 (0,1) Inst 5 } ResII : Resource Initiation (1,0) Inst 5 Interval ? 2 True Dependence Inst 6 (0,1) Inst 6 Output Dependence Anti Dependence 2
Step 2: RecII (cont) Step 2: Compute MinDIST Matrix Floyd Algorithm: Start (0,0) MinDist[i,i] � 0 Adjacency Matrix Inst 1 Adjacency Matrix with II (Initiation Interval) 2 S t a r t I n s t - 1 I n s t - 2 I n s t - 3 I n s t - 4 n I s t -5 I n s t - 6 E n d S t a r t I n s t - 1 I n t s - 2 I n s t - 3 I n s - t 4 I n s t - 5 n I s t - 6 E n d S t a t r I n s - t 1 I n s t - 2 I s n t 3 - n I s - t 4 I n s - t 5 I n s t - 6 E n d Inst 2 X 0 0 0 1 2 1 2 S t a r t X (0 ,0 ) ( 0 , 0 ) (0 ,0 ) ( 0 ,0 ) ( 0 , 0 ) ( 0 ,0 ) 0 ( ,0 ) S t a r t X ( 0 , 0 ) ( 0 0 , ) 0 ( ,0 ) ( 0 ,0 ) 0 ( ,0 ) ( 0 0 , ) ( 0 0 , ) S a t r t X - 1 X 1 2 X 2 Inst 3 I n s t - 1 X X X X ( ,1 0 ) X X ( 0 , 0 ) - 1 n I s - t 1 I n s t - 1 X X X X ( 0 ,1 ) X X ( 0 ,0 ) X - 1 X 1 2 X 2 - 1 n I s t - 2 X X X X ( 0 ,1 ) X X ( 0 0 , ) n I s - t 2 I n s t - 2 X X X X ( 0 ,1 ) X X ( 0 ,0 ) Inst 4 X 0 - 1 1 2 1 2 0 I n s t - 3 X X X X X X ( 0 1 , ) ( 0 0 , ) I n s - t 3 I n s t - 3 X X X X X X ( 0 ,1 ) ( 0 ,0 ) X - 2 - 2 X 1 X 1 I n s t - 4 X ( 1 0 , ) ( ( 1 ,0 ) X X 0 ( ,1 ) X ( 0 0 , ) 1 - I n s t - 4 Inst 5 I n s t - 4 X (1 ,0 ) ( ( 1 ,0 ) X X ( 0 , 1 ) X 0 ( ,0 ) X - 4 - 4 X - 2 X 0 - 1 I n s t - 5 X X X X ( ,0 1 ) X X ( 0 , 0 ) I n s t - 5 X - 1 - 2 - 1 0 1 1 0 I n s t - 5 X X X X ( 1 ,0 ) X X ( 0 ,0 ) n I s t - 6 X ( 1 , 1 ) X ( 1 ,1 ) X X X ( 0 , 0 ) I n s - t 6 Inst 6 X X X X X X X X E n d I n s t - 6 X (1 ,1 ) X (1 ,1 ) X X X 0 ( ,0 ) End Step 3: Slack Scheduling by Why Modulo Scheduling is not computing Estart and Lstart suitable? Floyd Algorithm: Legal Partial Schedule Legal Partial Schedule MinDist[i,i] � 0 based on Estart and based on SLACK with II (Initiation Interval) 2 Lstart Operation Slack ssue T I ime Estart L start I nst-1 0 1 0 S t a r t n I s - t 1 I n s t 2 - I n s - t 3 n I s t - 4 n I s - t 5 n I s - t 6 E n d I nst-2 0 1 1 O peration Slack ssue T I im e X 0 0 0 1 2 1 2 I nst-3 0 1 0 S a t r t I nst-4 1 1 1 E start L start X - 1 X 1 2 X 2 I nst-5 0 1 1 0 I n s t - 1 I nst-6 1 1 1 I nst-1 0 1 0 X - 1 X 1 2 X 2 0 I n s t - 2 I nst-2 0 1 1 // inst-1 && inst-3 No Legal Encoding X 0 - 1 1 2 1 2 0 n I s - t 3 xh=*(r0+j) r4=j I nst-3 0 1 0 X - 2 - 2 X 1 X 1 I n s t - 4 0 I nst-4 1 1 1 X - 4 - 4 X - 2 X 0 n I s t - 5 0 I nst-5 0 1 1 // inst-2 && inst-4 && inst-5 && inst-6 yh=*r3++ p0=xh*yh p1=xl*yl a2=a2+p0 j=r4+1 X - 1 - 2 - 1 0 1 1 I nst-6 1 1 1 0 n I s - t 6 X X X X X X X X E n d Why Modulo Scheduling is not How to Overcome? suitable? � Software pipelining optimization must be Due to limited encoding space Due to limited encoding space , DSP16000 sensitive to Instruction Selection compound instructions that account for {Inst {Inst- -i, i, � This requires that the Instruction selection Inst Inst- -j, Inst j, Inst- -k} k} ,but there is NO legal encoding to performs the following tasks in a demand capture any subset of {Inst {Inst- -i,Inst i,Inst- -j,Inst j,Inst- -k} k} driven manner � proactively perform Register Renaming � proactively introduce additional micro- operations on the fly 3
Recommend
More recommend