A lot of my experience has been with fixed point DSP's C62x
and more so C64x DSP. However I wanted to write out some code
and prove to myself that the tools are indeed the best way to
get there, that too for an example as simple as an FIR.
Here is my first pass stab at a single sample FIR C code.
This code models merely the FIR part of the dot-product,
it does not model the circular buffer. I shall address
this in the latter part of my e-mail.
#include <stdio.h>
#include <stdlib.h>
float fir(float *restrict x, float *restrict h, int N)
{
int i;
float sum;
/*-----------*/
/* Initialize FIR accumulator to zero, prior */
/* to the start of the computation. */
/*-----------*/
sum = 0;
/*-----------*/
/* If this is a C6000 build, then inform the */
/* compiler about any safe assumptions that */
/* can be made. In this case we assume that */
/* input array is word aligned and filter */
/* arrays is double word aligned. */
/* word aligned. I am assuming that your */
/* filter has at least 16 taps and is a */
/* multiple of 8. */
/*-----------*/
#ifdef TMS320C6X
_nassert((int)(x)%4 == 0);
_nassert((int)(h)%8 == 0);
_nassert((int)(N)%8 == 0);
_nassert((int)(N >= 16));
#endif
/*-----------*/
/* The following loop iterates over N filter */
/* taps computing the FIR sum, one tap at a */
/* time. */
/*-----------*/
for ( i = 0; i < N; i++)
{
/*--------*/
/* Compute sum of products over all filter */
/* taps accumulating the result into sum. */
/*--------*/
sum += x[i] * h[i];
}
/*-----------*/
/* Return accumulated single sample FIR. */
/*-----------*/
return sum;
}
I compiled using the current shipping 4.31 tools. I used
the following flags for my compile:
cl6x -k -o2 -mwtx -mv6700 -mh -dTMS320C6X fir.c
The compiler produced the following output in which two
multiplies are issued every cycle. Although the code is
written in a straight forward way, tthe odd and even taps
in parallel and accumulates them into seperate accumulators.
It finally adds the seperate accumulators prior to returning.
I will now reproduce the assembler output, which you can
reproduce as well {Isnt that nice ?}.
.sect ".text"
.global _fir
;******************************************************************************
;* FUNCTION NAME:
_fir *
;*
*
;* Regs Modified :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Regs Used :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Local Frame Size : 0 Args + 0 Auto + 24 Save = 24
byte *
;******************************************************************************
_fir:
;**
--*
MV .S1X SP,A9 ; |5|
|| STW .D2T1 A10,*SP--(24) ; |5|
STW .D2T2 B13,*+SP(20)
MVK .S2 32,B5
|| STW .D2T2 B12,*+SP(16)
SUB .L2X A4,B5,B5
|| STW .D2T2 B11,*+SP(12)
MV .S1X DP,A10 ; save dp
|| STW .D1T1 A14,*-A9(20)
|| MV .L2 B4,B11
|| STW .D2T2 B10,*+SP(8)
|| MVC .S2 CSR,B4
ADD .L2 4,B5,B12
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <0,1>
|| MV .S1X B4,A8
|| AND .S2 -2,B4,B4
SHR .S1 A6,3,A4 ; |47|
|| MV .L1X B5,A6
|| MVC .S2 B4,CSR ; interrupts off
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <0,0>
LDW .D1T1 *++A6(32),A3 ; |47| (P) <0,2>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| (P) <0,2>
;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 40
;* Loop opening brace source line : 41
;* Loop closing brace source line : 48
;* Loop Unroll Multiple : 8x
;* Known Minimum Trip Count : 2
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 4
;* Unpartitioned Resource Bound : 6
;* Partitioned Resource Bound(*) : 6
;* Resource Partition:
;* A-side B-side
;* .L units 2 6*
;* .S units 1 0
;* .D units 6* 6*
;* .M units 2 6*
;* .X cross paths 2 4
;* .T address paths 6* 6*
;* Long read paths 0 0
;* Long write paths 0 4
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D
unit)
;* Bound(.L .S .LS) 2 3
;* Bound(.L .S .D .LS .LSD) 4 4
;*
;* Searching for software pipeline schedule at ...
;* ii = 6 Schedule found with 4 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |**** * | ** **** *** |
;* 1: |***** ** | ****** *** |
;* 2: |******** | *** ******* |
;* 3: | ******* |* *** ***** * |
;* 4: | **** * |** **** **** * |
;* 5: | *** * |*** ***** ** * |
;* +---------------------------------+
;*
;* Done
;*
;* Epilog not entirely removed
;* Collapsed epilog stages : 2
;*
;* Prolog not entirely removed
;* Collapsed prolog stages : 1
;*
;* Minimum required memory pad : 64 bytes
;*
;* Minimum safe trip count : 1 (after unrolling)
;*----*
;* SETUP CODE
;*
;* MV A6,B12
;* ADD 4,B12,B12
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C23:
;* 0 LDW .D2T2 *++B12(32),DP ; |47|
;* 1 LDDW .D2T2 *B11++(32),B7:B6 ; |47|
;* 2 LDW .D1T1 *++A6(32),A3 ; |47|
;* || LDDW .D2T2 *-B11(24),B5:B4 ; |47|
;* 3 LDW .D2T2 *+B12(4),B6 ; |47|
;* || LDW .D1T1 *+A6(12),A3 ; |47|
;* 4 LDDW .D2T2 *-B11(16),B7:B6 ; |47|
;* || LDW .D1T1 *+A6(16),A4 ; |47|
;* 5 LDW .D1T1 *+A6(20),A4 ; |47|
;* || LDDW .D2T2 *-B11(8),B9:B8 ; |47|
;* 6 LDW .D1T1 *+A6(24),A5 ; |47|
;* 7 MPYSP .M1X B6,A3,A4 ; |47|
;* || MPYSP .M2 B7,DP,B4 ; |47|
;* || LDW .D1T1 *+A6(28),A4 ; |47|
;* 8 MPYSP .M2 B4,B6,B4 ; |47|
;* 9 MPYSP .M2X B6,A4,B8 ; |47|
;* 10 MPYSP .M2X B7,A4,B7 ; |47|
;* 11 ADDSP .L1 A4,A7,A7 ; |47| ^
;* || ADDSP .L2 B4,B3,B3 ; |47| ^
;* || MPYSP .M2X B8,A5,B4 ; |47|
;* 12 ADDSP .L2 B4,B10,B10 ; |47| ^
;* || MPYSP .M2X B5,A3,B4 ; |47|
;* || MPYSP .M1X B9,A4,A5 ; |47|
;* 13 ADDSP .L2 B8,B0,B0 ; |47| ^
;* || [ A1] SUB .S1 A1,1,A1 ; |48|
;* 14 ADDSP .L2 B7,B1,B1 ; |47| ^
;* || [ A1] B .S1 C23 ; |48|
;* 15 ADDSP .L2 B4,B2,B2 ; |47| ^
;* 16 ADDSP .L2 B4,B13,B13 ; |47| ^
;* || ADDSP .L1 A5,A0,A0 ; |47| ^
;* 17 NOP 3
;* ; BRANCH OCCURS ; |48|
;*----*
L1: ; PIPED LOOP PROLOG
LDW .D1T1 *+A6(12),A3 ; |47| (P) <0,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| (P) <0,3>
LDW .D1T1 *+A6(16),A4 ; |47| (P) <0,4>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| (P) <0,4>
ZERO .S2 B0 ; |47|
|| ZERO .L2 B2 ; |47|
|| ZERO .S1 A7 ; |47|
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| (P) <0,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| (P) <0,5>
ZERO .S2 B1 ; |47|
|| ZERO .L1 A0 ; |47|
|| ZERO .L2 B13 ; |47|
|| MV .S1X B3,A14
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <1,0>
|| LDW .D1T1 *+A6(24),A5 ; |47| (P) <0,6>
MVK .S1 0x1,A2 ; init prolog collapse
predicate
|| SUB .L1 A4,1,A1
|| ZERO .S2 B3 ; |47|
|| ZERO .L2 B10 ; |47|
|| MPYSP .M1X B6,A3,A4 ; |47| (P) <0,7>
|| MPYSP .M2 B7,DP,B4 ; |47| (P) <0,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <1,1>
|| LDW .D1T1 *+A6(28),A4 ; |47| (P) <0,7>
;**
--*
L2: ; PIPED LOOP KERNEL
[ A1] B .S1 L2 ; |48| <0,14>
|| [!A2] ADDSP .L2 B7,B1,B1 ; |47| <0,14> ^
|| MPYSP .M2 B4,B6,B4 ; |47| <1,8>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| <2,2>
|| LDW .D1T1 *++A6(32),A3 ; |47| <2,2>
[!A2] ADDSP .L2 B4,B2,B2 ; |47| <0,15> ^
|| MPYSP .M2X B6,A4,B8 ; |47| <1,9>
|| LDW .D1T1 *+A6(12),A3 ; |47| <2,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| <2,3>
[!A2] ADDSP .L2 B4,B13,B13 ; |47| <0,16> ^
|| [!A2] ADDSP .L1 A5,A0,A0 ; |47| <0,16> ^
|| MPYSP .M2X B7,A4,B7 ; |47| <1,10>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| <2,4>
|| LDW .D1T1 *+A6(16),A4 ; |47| <2,4>
ADDSP .L1 A4,A7,A7 ; |47| <1,11> ^
|| MPYSP .M2X B8,A5,B4 ; |47| <1,11>
|| ADDSP .L2 B4,B3,B3 ; |47| <1,11> ^
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| <2,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| <2,5>
[ A2] SUB .S1 A2,1,A2 ; <0,18>
|| MPYSP .M2X B5,A3,B4 ; |47| <1,12>
|| MPYSP .M1X B9,A4,A5 ; |47| <1,12>
|| ADDSP .L2 B4,B10,B10 ; |47| <1,12> ^
|| LDW .D1T1 *+A6(24),A5 ; |47| <2,6>
|| LDW .D2T2 *++B12(32),DP ; |47| <3,0>
[ A1] SUB .S1 A1,1,A1 ; |48| <1,13>
|| ADDSP .L2 B8,B0,B0 ; |47| <1,13> ^
|| MPYSP .M1X B6,A3,A4 ; |47| <2,7>
|| LDW .D1T1 *+A6(28),A4 ; |47| <2,7>
|| MPYSP .M2 B7,DP,B4 ; |47| <2,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| <3,1>
;**
--*
L3: ; PIPED LOOP EPILOG
MV .S1X SP,A9 ; |55|
|| ADDSP .L2 B7,B1,B1 ; |47| (E) <3,14> ^
ADDSP .L2 B4,B2,B2 ; |47| (E) <3,15> ^
ADDSP .L1 A5,A0,A0 ; |47| (E) <3,16> ^
|| ADDSP .L2 B4,B13,B13 ; |47| (E) <3,16> ^
LDDW .D2T2 *+SP(8),B11:B10 ; |55|
|| MV .S1X B10,A8
|| MV .S2X A8,B4
MV .S2X A10,DP ; restore dp
MVC .S2 B4,CSR ; interrupts on
LDDW .D2T2 *+SP(16),B13:B12 ; |55|
|| ADDSP .L1X A8,B13,A6 ; |54|
NOP 3
LDW .D1T1 *+A9(4),A14 ; |55|
|| MV .S2X A14,B3 ; |55|
|| ADDSP .L1X B3,A6,A3 ; |54|
LDW .D2T1 *++SP(24),A10 ; |55|
NOP 2
ADDSP .L1 A7,A3,A3 ; |54|
NOP 3
ADDSP .L1 A0,A3,A0 ; |54|
NOP 3
ADDSP .L1X B2,A0,A0 ; |54|
NOP 3
ADDSP .L1X B1,A0,A0 ; |54|
NOP 1
RET .S2 B3 ; |55|
NOP 1
ADDSP .L1X B0,A0,A4 ; |54|
NOP 3
; BRANCH OCCURS ; |55|
Extra Comments
---------------
a. This code decrements the stack frame by 16-bytes to store
3 words A10, B10, B11. Even though 12 bytes of stack storage
would have been adequate, it needs to decrement 16 bytes,
in order to leave the incoming double word aligned stack
frame double word aligned at the end of the transaction.
b. You only need to save A10-A15 and B10-B15 if these are being
modified by your code. You need not worry about other registers
you are modifying.
c. Also notice the comments "interrupts off". This is where the
compiler truns off the "GIE" bit off CSR to guarantee that
interrupts dont mess you up, now that you are no longer in
single register assignment mode.
d. Also notice how two prolog and epilog stages have been
collapsed to achieve code-size reductions.
e. The compiler finds a 6 cycle loop in which 8 multiplies
are performed to achieve 1.3 multiplies/cycle. The reason,
this is not a 4 cycle loop, is because the input array
can only be assumed to be word-aligned, as only one
output fir sample is computed at a time. the filter
array can be assumed to be double word aligned.
f. Notice from the compiler feedback {-mw} flag, how all
units are maxed out.
g. Also, if one could compute even two fir output samples
at a time. In the C code, specifying
_nassert((int)(x)%8 == 0);
and computing two output samples in parallel would give
a better multiplier utilization.
Circular buffer
---------------
The delay line is modeled by keeping an array of input samples,
in the input array of size KN and copying the last N -1 samples,
after computing (K-1)N output samples, to the head of the array.
Since the memcpy is done once, for every (K-1)N output samples,
it is inexpensive, as opposed to maintaining the delay line manually.
This avoids explictly having to incorporate circular buffering in
your code.
X:
<-N input samples->|<N input samples->|..............|<N input
samples>|
If you still want to use circular buffering you could use serial
assembly to do so.
Regards
Jagadeesh Sankaran
|