I am trying to optimize the inner loop of a filter algoritmn written in C/C++, using Code Composer 2.21 and a TI C6713 target system. After trying several optimizing techniques, the compiler generates to following loop kernel - see below. This code seems pretty good to me: the compiler generates 2 ADD's and 2 MPY's for almost every block of parallel instructions - this is the best a C6713 can do. Ideally, this loop kernel would execute in 6 clock cycles (I am discarding external memory delays here). However, I am wondering how the C67xx executes this code, and how to minimize possible pipeline stalls. For example: in line [2] a LDW instruction loads a value from memory in B0. In the next block instructions a MPYSP instruction uses B0 [line 7]. Since a LDW instruction needs 4 delay slots to complete, I wonder how the processor handles this. The compiler does not generate a NOP 4 instruction to allow the LDW to complete. Does this sequence of instructions cause a pipeline stall? I.e. does the DSP wait for the LDW to complete before executing the MPYSP instruction? Or does it continue execution by using a value PREVIOUSLY loaded in B0, and use the newly loaded value in line [2] next time in the loop? The same question applies to the MPYSP instructions, which need 5 delay slots to complete (although a new MPYSP can be started every next clock cycle). Any input? Rene --------------------------------------------------------------------- Loop kernel generated by CCS 2.21: L16: ; PIPED LOOP KERNEL [1] [ A1] B .S2 L16 ; |233| <0,31> [2] || LDW .D2T2 *+B9(28),B0 ; |232| <2,19> [3] || ADDSP .L2 B1,B2,B1 ; |232| <2,19> [4] || MPYSP .M1 A11,A0,A3 ; |232| <4,7> [5] || LDW .D1T1 *A5,A4 ; |232| <5,1> [6] ADDSP .L2X B1,A13,B2 ; |232| <0,32> [7] || MPYSP .M2 B4,B0,B3 ; |232| <2,20> [8] || MPYSP .M1 A8,A6,A3 ; |232| <2,20> [9] || ADDSP .L1 A0,A4,A6 ; |232| <2,20> [10] || LDW .D2T2 *+B9(28),B0 ; |232| <3,14> [11] || LDW .D1T1 *+A5(4),A12 ; |232| <5,2> [12] MPYSP .M1 A7,A6,A3 ; |232| <2,21> [13] || LDW .D1T1 *+A5(4),A6 ; |232| <3,15> [14] || MPYSP .M2 B5,B3,B1 ; |232| <3,15> [15] || ADDSP .L2 B1,B2,B2 ; |232| <3,15> [16] || ADDSP .L1 A4,A0,A4 ; |232| <3,15> [17] || LDW .D2T2 *+B9(28),B3 ; |232| <4,9> [18] ADDSP .L2 B3,B2,B1 ; |232| <1,28> [19] || ADDSP .L1 A3,A6,A13 ; |232| <1,28> [20] || LDW .D1T1 *+A5(8),A6 ; |232| <3,16> [21] || MPYSP .M1 A9,A0,A0 ; |232| <3,16> [22] || MPYSP .M2 B6,B1,B2 ; |232| <4,10> [23] || LDW .D2T2 *+B9(28),B0 ; |232| <5,4> [25] LDW .D1T1 *+A5(4),A0 ; |232| <4,11> [26] || MPYSP .M2 B8,B0,B1 ; |232| <4,11> [27] || MPYSP .M1 A2,A12,A4 ; |232| <4,11> [28] || ADDSP .L1 A4,A3,A0 ; |232| <4,11> [29] || LDW .D2T2 *+B9(24),B1 ; |232| <5,5> [30] STW .D2T2 B2,*B9++ ; |232| <0,36> [31] || [ A1] SUB .S1 A1,1,A1 ; |233| <1,30> [32] || MPYSP .M2 B7,B0,B3 ; |232| <2,24> [33] || ADDSP .L2 B3,B1,B2 ; |232| <2,24> [34] || ADDSP .L1 A3,A6,A6 ; |232| <2,24> [35] || MPYSP .M1 A10,A4,A4 ; |232| <5,6> [36] || LDW .D1T1 *A5++,A0 ; |232| <6,0>
TI C67xx pipeline question
Started by ●April 5, 2004
Reply by ●April 5, 20042004-04-05
In article <hcc2705eho40snvrh10a0a48365lcomgio@4ax.com>, Rene Kellenbach <me@nowhere> wrote: | For example: in line [2] a LDW instruction loads a value from memory in B0. | In the next block instructions a MPYSP instruction uses B0 [line 7]. | Since a LDW instruction needs 4 delay slots to complete, I wonder how | the processor handles this. The compiler does not generate a NOP 4 | instruction to allow the LDW to complete. | | Does this sequence of instructions cause a pipeline stall? I.e. does | the DSP wait for the LDW to complete before executing the MPYSP instruction? | Or does it continue execution by using a value PREVIOUSLY loaded in B0, | and use the newly loaded value in line [2] next time in the loop? No, the sequence you showed does not stall; it uses previous values as you thought. The C6x architecture has an "exposed pipeline" which doesn't perform dependency interlocking -- the code must insert explicit NOPs to "stall" the pipeline, if required. However, this does allow software loop pipelining techniques which can perform read references to previous values in a register during the update latency of a prior instruction which is writing that register. -- Tim Olson
Reply by ●April 5, 20042004-04-05
Tim Olson <ogailx502@NOSPAMsneakemail.com> wrote:>| Does this sequence of instructions cause a pipeline stall? I.e. does >| the DSP wait for the LDW to complete before executing the MPYSP instruction? >| Or does it continue execution by using a value PREVIOUSLY loaded in B0, >| and use the newly loaded value in line [2] next time in the loop? > >No, the sequence you showed does not stall; it uses previous values as >you thought. > >The C6x architecture has an "exposed pipeline" which doesn't perform >dependency interlocking -- the code must insert explicit NOPs to "stall" >the pipeline, if required. However, this does allow software loop >pipelining techniques which can perform read references to previous >values in a register during the update latency of a prior instruction >which is writing that register.Great! Thanks, Tim. Rene