I am trying to optimize the inner loop of a filter algoritmn written in C/C++,
using Code Composer 2.21 and a TI C6713 target system.
After trying several optimizing techniques, the compiler generates
to following loop kernel - see below.
This code seems pretty good to me: the compiler generates 2 ADD's
and 2 MPY's for almost every block of parallel instructions - this is the
best a C6713 can do.
Ideally, this loop kernel would execute in 6 clock cycles (I am discarding
external memory delays here).
However, I am wondering how the C67xx executes this code, and how
to minimize possible pipeline stalls.
For example: in line [2] a LDW instruction loads a value from memory in B0.
In the next block instructions a MPYSP instruction uses B0 [line 7].
Since a LDW instruction needs 4 delay slots to complete, I wonder how
the processor handles this. The compiler does not generate a NOP 4
instruction to allow the LDW to complete.
Does this sequence of instructions cause a pipeline stall? I.e. does
the DSP wait for the LDW to complete before executing the MPYSP instruction?
Or does it continue execution by using a value PREVIOUSLY loaded in B0,
and use the newly loaded value in line [2] next time in the loop?
The same question applies to the MPYSP instructions, which need
5 delay slots to complete (although a new MPYSP can be started
every next clock cycle).
Any input?
Rene
---------------------------------------------------------------------
Loop kernel generated by CCS 2.21:
L16: ; PIPED LOOP KERNEL
[1] [ A1] B .S2 L16 ; |233| <0,31>
[2] || LDW .D2T2 *+B9(28),B0 ; |232| <2,19>
[3] || ADDSP .L2 B1,B2,B1 ; |232| <2,19>
[4] || MPYSP .M1 A11,A0,A3 ; |232| <4,7>
[5] || LDW .D1T1 *A5,A4 ; |232| <5,1>
[6] ADDSP .L2X B1,A13,B2 ; |232| <0,32>
[7] || MPYSP .M2 B4,B0,B3 ; |232| <2,20>
[8] || MPYSP .M1 A8,A6,A3 ; |232| <2,20>
[9] || ADDSP .L1 A0,A4,A6 ; |232| <2,20>
[10] || LDW .D2T2 *+B9(28),B0 ; |232| <3,14>
[11] || LDW .D1T1 *+A5(4),A12 ; |232| <5,2>
[12] MPYSP .M1 A7,A6,A3 ; |232| <2,21>
[13] || LDW .D1T1 *+A5(4),A6 ; |232| <3,15>
[14] || MPYSP .M2 B5,B3,B1 ; |232| <3,15>
[15] || ADDSP .L2 B1,B2,B2 ; |232| <3,15>
[16] || ADDSP .L1 A4,A0,A4 ; |232| <3,15>
[17] || LDW .D2T2 *+B9(28),B3 ; |232| <4,9>
[18] ADDSP .L2 B3,B2,B1 ; |232| <1,28>
[19] || ADDSP .L1 A3,A6,A13 ; |232| <1,28>
[20] || LDW .D1T1 *+A5(8),A6 ; |232| <3,16>
[21] || MPYSP .M1 A9,A0,A0 ; |232| <3,16>
[22] || MPYSP .M2 B6,B1,B2 ; |232| <4,10>
[23] || LDW .D2T2 *+B9(28),B0 ; |232| <5,4>
[25] LDW .D1T1 *+A5(4),A0 ; |232| <4,11>
[26] || MPYSP .M2 B8,B0,B1 ; |232| <4,11>
[27] || MPYSP .M1 A2,A12,A4 ; |232| <4,11>
[28] || ADDSP .L1 A4,A3,A0 ; |232| <4,11>
[29] || LDW .D2T2 *+B9(24),B1 ; |232| <5,5>
[30] STW .D2T2 B2,*B9++ ; |232| <0,36>
[31] || [ A1] SUB .S1 A1,1,A1 ; |233| <1,30>
[32] || MPYSP .M2 B7,B0,B3 ; |232| <2,24>
[33] || ADDSP .L2 B3,B1,B2 ; |232| <2,24>
[34] || ADDSP .L1 A3,A6,A6 ; |232| <2,24>
[35] || MPYSP .M1 A10,A4,A4 ; |232| <5,6>
[36] || LDW .D1T1 *A5++,A0 ; |232| <6,0>