Reply by Rene Kellenbach April 5, 20042004-04-05
Tim Olson <ogailx502@NOSPAMsneakemail.com> wrote:

>| Does this sequence of instructions cause a pipeline stall? I.e. does >| the DSP wait for the LDW to complete before executing the MPYSP instruction? >| Or does it continue execution by using a value PREVIOUSLY loaded in B0, >| and use the newly loaded value in line [2] next time in the loop? > >No, the sequence you showed does not stall; it uses previous values as >you thought. > >The C6x architecture has an "exposed pipeline" which doesn't perform >dependency interlocking -- the code must insert explicit NOPs to "stall" >the pipeline, if required. However, this does allow software loop >pipelining techniques which can perform read references to previous >values in a register during the update latency of a prior instruction >which is writing that register.
Great! Thanks, Tim. Rene
Reply by Tim Olson April 5, 20042004-04-05
In article <hcc2705eho40snvrh10a0a48365lcomgio@4ax.com>,
 Rene Kellenbach <me@nowhere> wrote:

|  For example: in line [2] a LDW instruction loads a value from memory in B0.
|  In the next block instructions a MPYSP instruction uses B0 [line 7].
|  Since a LDW instruction needs 4 delay slots to complete, I wonder how
|  the processor handles this. The compiler does not generate a NOP 4
|  instruction to allow the LDW to complete.
|  
|  Does this sequence of instructions cause a pipeline stall? I.e. does
|  the DSP wait for the LDW to complete before executing the MPYSP instruction?
|  Or does it continue execution by using a value PREVIOUSLY loaded in B0,
|  and use the newly loaded value in line [2] next time in the loop?

No, the sequence you showed does not stall; it uses previous values as 
you thought.

The C6x architecture has an "exposed pipeline" which doesn't perform 
dependency interlocking -- the code must insert explicit NOPs to "stall" 
the pipeline, if required.  However, this does allow software loop 
pipelining techniques which can perform read references to previous 
values in a register during the update latency of a prior instruction 
which is writing that register.

   -- Tim Olson
Reply by Rene Kellenbach April 5, 20042004-04-05
I am trying to optimize the inner loop of a filter algoritmn written in C/C++,
using Code Composer 2.21 and a TI C6713 target system.

After trying several optimizing techniques, the compiler generates
to following loop kernel - see below.
This code seems pretty good to me: the compiler generates 2 ADD's
and 2 MPY's for almost every block of parallel instructions - this is the
best a C6713 can do.

Ideally, this loop kernel would execute in 6 clock cycles (I am discarding
external memory delays here).

However, I am wondering how the C67xx executes this code, and how
to minimize possible pipeline stalls.

For example: in line [2] a LDW instruction loads a value from memory in B0.
In the next block instructions a MPYSP instruction uses B0 [line 7].
Since a LDW instruction needs 4 delay slots to complete, I wonder how
the processor handles this. The compiler does not generate a NOP 4
instruction to allow the LDW to complete.

Does this sequence of instructions cause a pipeline stall? I.e. does
the DSP wait for the LDW to complete before executing the MPYSP instruction?
Or does it continue execution by using a value PREVIOUSLY loaded in B0,
and use the newly loaded value in line [2] next time in the loop?

The same question applies to the MPYSP instructions, which need
5 delay slots to complete (although a new MPYSP can be started
every next clock cycle).

Any input?

Rene

---------------------------------------------------------------------
Loop kernel generated by CCS 2.21:

L16:    ; PIPED LOOP KERNEL

[1]      [ A1]   B       .S2     L16               ; |233| <0,31> 
[2]      ||         LDW     .D2T2   *+B9(28),B0       ; |232| <2,19> 
[3]      ||         ADDSP   .L2     B1,B2,B1          ; |232| <2,19> 
[4]      ||         MPYSP   .M1     A11,A0,A3         ; |232| <4,7> 
[5]      ||         LDW     .D1T1   *A5,A4            ; |232| <5,1> 

[6]                 ADDSP   .L2X    B1,A13,B2         ; |232| <0,32> 
[7]      ||         MPYSP   .M2     B4,B0,B3          ; |232| <2,20> 
[8]      ||         MPYSP   .M1     A8,A6,A3          ; |232| <2,20> 
[9]      ||         ADDSP   .L1     A0,A4,A6          ; |232| <2,20> 
[10]    ||         LDW     .D2T2   *+B9(28),B0       ; |232| <3,14> 
[11]    ||         LDW     .D1T1   *+A5(4),A12       ; |232| <5,2> 

[12]              MPYSP   .M1     A7,A6,A3          ; |232| <2,21> 
[13]   ||         LDW     .D1T1   *+A5(4),A6        ; |232| <3,15> 
[14]   ||         MPYSP   .M2     B5,B3,B1          ; |232| <3,15> 
[15]   ||         ADDSP   .L2     B1,B2,B2          ; |232| <3,15> 
[16]   ||         ADDSP   .L1     A4,A0,A4          ; |232| <3,15> 
[17]   ||         LDW     .D2T2   *+B9(28),B3       ; |232| <4,9> 

[18]              ADDSP   .L2     B3,B2,B1          ; |232| <1,28> 
[19]   ||         ADDSP   .L1     A3,A6,A13         ; |232| <1,28> 
[20]   ||         LDW     .D1T1   *+A5(8),A6        ; |232| <3,16> 
[21]   ||         MPYSP   .M1     A9,A0,A0          ; |232| <3,16> 
[22]   ||         MPYSP   .M2     B6,B1,B2          ; |232| <4,10> 
[23]   ||         LDW     .D2T2   *+B9(28),B0       ; |232| <5,4> 

[25]              LDW     .D1T1   *+A5(4),A0        ; |232| <4,11> 
[26]   ||         MPYSP   .M2     B8,B0,B1          ; |232| <4,11> 
[27]   ||         MPYSP   .M1     A2,A12,A4         ; |232| <4,11> 
[28]   ||         ADDSP   .L1     A4,A3,A0          ; |232| <4,11> 
[29]   ||         LDW     .D2T2   *+B9(24),B1       ; |232| <5,5> 

[30]              STW     .D2T2   B2,*B9++          ; |232| <0,36> 
[31]   || [ A1]  SUB     .S1     A1,1,A1           ; |233| <1,30> 
[32]   ||         MPYSP   .M2     B7,B0,B3          ; |232| <2,24> 
[33]   ||         ADDSP   .L2     B3,B1,B2          ; |232| <2,24> 
[34]   ||         ADDSP   .L1     A3,A6,A6          ; |232| <2,24> 
[35]   ||         MPYSP   .M1     A10,A4,A4         ; |232| <5,6> 
[36]   ||         LDW     .D1T1   *A5++,A0          ; |232| <6,0>