DSPRelated.com
Forums

sample-by-sample FIR optimized

Started by Wojciech Rewers September 17, 2003
--- Arius - Rick Collins <> wrote:
> One final suggestion and I will let you get on with
> it. You can optimize
> the code to use the LDDW if you copy your input data
> into *two* buffers,
> one odd aligned and one even aligned. Alternate
> between them on each
> sample and the LDDW should be in the right place
> each time.

hm... well... I'm affraid it's not gonna work as I'm
using circular addressing - so there data block has to
be aligned to the block size boundary...

__________________________________



At 03:31 AM 9/22/2003, you wrote:
>--- Arius - Rick Collins <> wrote:
> > One final suggestion and I will let you get on with
> > it. You can optimize
> > the code to use the LDDW if you copy your input data
> > into *two* buffers,
> > one odd aligned and one even aligned. Alternate
> > between them on each
> > sample and the LDDW should be in the right place
> > each time.
>
>hm... well... I'm affraid it's not gonna work as I'm
>using circular addressing - so there data block has to
>be aligned to the block size boundary...


You need to stop looking at the limitations and start looking at the ways
to make it work. Circular addressing does not need to match the buffer
size to the block size. The block size just needs to be as large or
larger. If you are tight on buffer space, use 254 taps instead of 256; or
use 512 words for your blocks.

Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX



Rick-

> One final suggestion and I will let you get on with it. You can optimize
> the code to use the LDDW if you copy your input data into *two* buffers,
> one odd aligned and one even aligned. Alternate between them on each
> sample and the LDDW should be in the right place each time.

Yeah, that's a good idea -- the function could still pass one buffer, and the
"copy
split" could be done at function start, transparently to the caller. Especially
if
Wojciech makes the algorithm do more per tap (LMS) then the extra copy loop
would
become less significant.

-Jeff

> At 03:15 AM 9/22/2003, you wrote:
> >--- Arius - Rick Collins <> wrote:
> > > I took a look at the code Andrew Elder posted and I
> > > think I understand what they are doing.
> >
> >yeah... I actually realized I should digest that code
> >even before Andrew advised that... so I did... at
> >first I couldn't understand any of it, but after a
> >while I realized what's going on ;-) god it's simple!
> >and very neat too...
> >
> > > I think this only works from internal memory).
> >
> >and why would that be?
> >
> > > The final adds account for
> > > the fact that since the pipeline in the mpy and add
> > > are 4 clocks, there are
> > > really four MACs going on in each register, so they
> > > have to be summed to
> > > get the final result.
> >
> >yeah... understanding most of that algorithm was easy
> >- the ADDSPs was the most difficult part as I couldn't
> >figure out how you can accumulate into just one
> >register if the ADDSP takes 4 cycles... in fact - even
> >though only one register is used - still there are
> >founr accumulation going on - each for every fourth
> >cycle - and the results are kept somewhere in the
> >pipeline I guess - so then those four sums for each
> >register have to be added together... the spru198
> >explains the whole process quite well... so - you know
> >how it is - if all possibilities fail - read the
> >manual ;-)
> >
> > > As you can see, it takes a lot to write code like
> > > this.
> >
> >well - again I was proved, there are no difficult
> >things in life - only new ones... at first when I
> >looked at that dotp function - it was all black magic
> >to me - now when I look at it all so simple and neat -
> >I think that functions shows excellently the beauty
> >and power of C6x devices...
> >
> > > It is much better to use library code if possible.
> >
> >well - nobody likes to reinvent the wheel...
> >especially when the deadline is close...
> >
> > > So this code will do what you want if you
> > > can change the addressing for a circular buffer.
> > > Can you figure out how to do that?
> >
> >and what do you think I was doing when you were making
> >out with Isabelle? ;-)
> >
> >I had to give up half the speed though... since I need
> >"sample-by-sample" algotithm - I can't have any
> >constraints on the location of samples in the memory -
> >and using LDDW imposes on you that the samples are
> >aligned on odd-even addresses or something like
> >that...
> >
> >so what I did is that I changed LDDW into LDW, skipped
> >the parallel MPYSPs and ADDSPs plus I use circular
> >buffer for data... if anyone is interested - the code
> >is here:
> >
> >http://www.wrewers.karolin.pl/firc.asm
> >
> >maybe somebody can even find a bug or something -
> >please do let me know...
> >
> >well - I started from thousands of cycles first to
> >filter the data (using library code) then to shift all
> >the samples in the buffer by one - now I have a code
> >that is performing "the same" in around 500 cycles -
> >it's enough for me as for now - but the interesting
> >aspect is that the same thing can be done for so many
> >different ways ;-) the beauty of DSPs I guess...
> >
> >anyway - I have that FIR now - now I will only need a
> >LMS to adapt FIR's coeficients - but I think I'll
> >figure something out...
> >
> >Wojciech Rewers (still looking for a job)
> >
> >__________________________________
> >
> Rick Collins >
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX



--- Arius - Rick Collins <> wrote:
> You need to stop looking at the limitations and
> start looking at the ways to make it work.

well... as I said before I'm fine with the results I
obrained... sure there is some space for optimization,
but it's not gonna help my application to spend _even_
_more_ time waiting... or is it?

so - instead of spending my time for further
optimizations - I moved on and I'm working on other
critical parts of the project... if I finish before my
deadline - I may even come back to that optimization
problem - we'll see...

> Circular addressing does not need to match the
buffer size to the block size.

I know that, however - my algorithm is filtering each
coming sample 4 times with 4 different filters - the
first has 256 coefs, the second and third are 32 taps
long and the last one is 64 coefs - they are all power
of two length - that's all I need... and with the code
I have I'm able to apply all those filters plus some
other minor calculations on every single sample taken
with 16kHz rate - that's all I need...

Wojciech Rewers

__________________________________


Wojciech Rewers wrote:

Hey nice work. Every so often there is a vigorous debate on the group about
whether
CCS linear assembly can produce code as efficient as hand-optimized. If you
search
the C6x archive for the thread "C-Coding" in the Nov/Dec 2002 time-frame, you
will
see some interesting discussion. There was one TI person in particular,
Jagadeesh
Sankaran, who argued strongly that hand-optimization should be avoided. He made
some
excellent suggestions for optimization within CCS.

But when I see some excellent work like yours and the dramatic speed increase,
it
tends to reinforce the idea that humans can still be pretty useful.

-Jeff

> --- Arius - Rick Collins <> wrote:
> > I took a look at the code Andrew Elder posted and I
> > think I understand what they are doing.
>
> yeah... I actually realized I should digest that code
> even before Andrew advised that... so I did... at
> first I couldn't understand any of it, but after a
> while I realized what's going on ;-) god it's simple!
> and very neat too...
>
> > I think this only works from internal memory).
>
> and why would that be?
>
> > The final adds account for
> > the fact that since the pipeline in the mpy and add
> > are 4 clocks, there are
> > really four MACs going on in each register, so they
> > have to be summed to
> > get the final result.
>
> yeah... understanding most of that algorithm was easy
> - the ADDSPs was the most difficult part as I couldn't
> figure out how you can accumulate into just one
> register if the ADDSP takes 4 cycles... in fact - even
> though only one register is used - still there are
> founr accumulation going on - each for every fourth
> cycle - and the results are kept somewhere in the
> pipeline I guess - so then those four sums for each
> register have to be added together... the spru198
> explains the whole process quite well... so - you know
> how it is - if all possibilities fail - read the
> manual ;-)
>
> > As you can see, it takes a lot to write code like
> > this.
>
> well - again I was proved, there are no difficult
> things in life - only new ones... at first when I
> looked at that dotp function - it was all black magic
> to me - now when I look at it all so simple and neat -
> I think that functions shows excellently the beauty
> and power of C6x devices...
>
> > It is much better to use library code if possible.
>
> well - nobody likes to reinvent the wheel...
> especially when the deadline is close...
>
> > So this code will do what you want if you
> > can change the addressing for a circular buffer.
> > Can you figure out how to do that?
>
> and what do you think I was doing when you were making
> out with Isabelle? ;-)
>
> I had to give up half the speed though... since I need
> "sample-by-sample" algotithm - I can't have any
> constraints on the location of samples in the memory -
> and using LDDW imposes on you that the samples are
> aligned on odd-even addresses or something like
> that...
>
> so what I did is that I changed LDDW into LDW, skipped
> the parallel MPYSPs and ADDSPs plus I use circular
> buffer for data... if anyone is interested - the code
> is here:
>
> http://www.wrewers.karolin.pl/firc.asm
>
> maybe somebody can even find a bug or something -
> please do let me know...
>
> well - I started from thousands of cycles first to
> filter the data (using library code) then to shift all
> the samples in the buffer by one - now I have a code
> that is performing "the same" in around 500 cycles -
> it's enough for me as for now - but the interesting
> aspect is that the same thing can be done for so many
> different ways ;-) the beauty of DSPs I guess...
>
> anyway - I have that FIR now - now I will only need a
> LMS to adapt FIR's coeficients - but I think I'll
> figure something out...
>
> Wojciech Rewers (still looking for a job)




Jagadeesh-

Thanks! I had a feeling you were still there, listening to all of this :-)

-Jeff

Jagadeesh Sankaran wrote:
>
> First of all, I would still strongly advise folks! not to
> give up hopes on tools and automatic code generation in a
> flash. I will illustrate the perils, based on the code that
> has been developed at:
>
> http://www.wrewers.karolin.pl/firc.asm
>
> This code has several issues. I am not trying to nit-pick.
> Developing hand-optimized VLIW code has its perils. Let the
> tools do their job. I will list some of the bugs that immediately
> catch my eye. I cannot vouch that I have caught all of them.
> BTW all these bugs can be avoided by using tools, so that
> one does not have to become intricately familiar to do code
> development of a simple FIR.
>
> a. The stack pointer needs to be double word aligned, at all
> times. The stores to the stack need to be done by pre-decrementing
> the stack frame and leaving it double word aligned.
>
> b. The save on entry registers A10-A15 and B10-B15 need to be saved
> for sure, upon entry to a function.
>
> c. This code is not single register assignment, and hence you need to
> turn off interrupts while you are in this code.
>
> d. This single cycle loop as shown could have been achieved with the
> tools,
> without a doubt. Further, if you allow at least two output samples to be
> computed in parallel, you can get 100% multiplier utilization.
>
> e. Most fir implementations are written to perform block processing,
> which is why the serial ports are buffered McBSP, with the B for
> Buffering.
>
> f. Also the delay line is implemented once as a memcpy, by moving the
> block of samples required for overlap, without explictly doing it in
> the kernel. This removes the need for circular buffering.
>
> g. Take a look at codec_edma.c under the DSK directory for an example of
> block based interaction with the serial port.
>
> By the way I have not seen anything spectacular to give up on the tools
> yet!
>
> Regards
> Jagadeesh Sankaran



First of all, I would still strongly advise folks! not to
give up hopes on tools and automatic code generation in a
flash. I will illustrate the perils, based on the code that
has been developed at:

http://www.wrewers.karolin.pl/firc.asm

This code has several issues. I am not trying to nit-pick.
Developing hand-optimized VLIW code has its perils. Let the
tools do their job. I will list some of the bugs that immediately
catch my eye. I cannot vouch that I have caught all of them.
BTW all these bugs can be avoided by using tools, so that
one does not have to become intricately familiar to do code
development of a simple FIR.

a. The stack pointer needs to be double word aligned, at all
times. The stores to the stack need to be done by pre-decrementing
the stack frame and leaving it double word aligned.

b. The save on entry registers A10-A15 and B10-B15 need to be saved
for sure, upon entry to a function.

c. This code is not single register assignment, and hence you need to
turn off interrupts while you are in this code.

d. This single cycle loop as shown could have been achieved with the
tools,
without a doubt. Further, if you allow at least two output samples to be
computed in parallel, you can get 100% multiplier utilization.

e. Most fir implementations are written to perform block processing,
which is why the serial ports are buffered McBSP, with the B for
Buffering.

f. Also the delay line is implemented once as a memcpy, by moving the
block of samples required for overlap, without explictly doing it in
the kernel. This removes the need for circular buffering.

g. Take a look at codec_edma.c under the DSK directory for an example of
block based interaction with the serial port.

By the way I have not seen anything spectacular to give up on the tools
yet!

Regards
Jagadeesh Sankaran



Wojciech-

> but what do you mean that the stack pointer needs to
> be double word aligned? isn't the stack pointer B15?
> how can that be double word aligned? or do you mean
> that I should push/pop always double words? hm... is
> it even possible? could you please elaborate on that?

Sub 8 and AND with 0xfffffff8 before doing any additional push/pop. Watch out
for
alignment upon return.

-Jeff



A lot of my experience has been with fixed point DSP's C62x
and more so C64x DSP. However I wanted to write out some code
and prove to myself that the tools are indeed the best way to
get there, that too for an example as simple as an FIR.

Here is my first pass stab at a single sample FIR C code.
This code models merely the FIR part of the dot-product,
it does not model the circular buffer. I shall address
this in the latter part of my e-mail.

#include <stdio.h>
#include <stdlib.h>

float fir(float *restrict x, float *restrict h, int N)
{
int i;
float sum;

/*-----------*/
/* Initialize FIR accumulator to zero, prior */
/* to the start of the computation. */
/*-----------*/

sum = 0;

/*-----------*/
/* If this is a C6000 build, then inform the */
/* compiler about any safe assumptions that */
/* can be made. In this case we assume that */
/* input array is word aligned and filter */
/* arrays is double word aligned. */
/* word aligned. I am assuming that your */
/* filter has at least 16 taps and is a */
/* multiple of 8. */
/*-----------*/ #ifdef TMS320C6X
_nassert((int)(x)%4 == 0);
_nassert((int)(h)%8 == 0);
_nassert((int)(N)%8 == 0);
_nassert((int)(N >= 16));
#endif

/*-----------*/
/* The following loop iterates over N filter */
/* taps computing the FIR sum, one tap at a */
/* time. */
/*-----------*/

for ( i = 0; i < N; i++)
{
/*--------*/
/* Compute sum of products over all filter */
/* taps accumulating the result into sum. */
/*--------*/

sum += x[i] * h[i];
}

/*-----------*/
/* Return accumulated single sample FIR. */
/*-----------*/

return sum;
}

I compiled using the current shipping 4.31 tools. I used
the following flags for my compile:

cl6x -k -o2 -mwtx -mv6700 -mh -dTMS320C6X fir.c

The compiler produced the following output in which two
multiplies are issued every cycle. Although the code is
written in a straight forward way, tthe odd and even taps
in parallel and accumulates them into seperate accumulators.
It finally adds the seperate accumulators prior to returning.
I will now reproduce the assembler output, which you can
reproduce as well {Isnt that nice ?}.

.sect ".text"
.global _fir

;******************************************************************************
;* FUNCTION NAME:
_fir *
;*
*
;* Regs Modified :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Regs Used :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Local Frame Size : 0 Args + 0 Auto + 24 Save = 24
byte *
;******************************************************************************
_fir:
;**
--*

MV .S1X SP,A9 ; |5|
|| STW .D2T1 A10,*SP--(24) ; |5|

STW .D2T2 B13,*+SP(20)

MVK .S2 32,B5
|| STW .D2T2 B12,*+SP(16)

SUB .L2X A4,B5,B5
|| STW .D2T2 B11,*+SP(12)

MV .S1X DP,A10 ; save dp
|| STW .D1T1 A14,*-A9(20)
|| MV .L2 B4,B11
|| STW .D2T2 B10,*+SP(8)
|| MVC .S2 CSR,B4

ADD .L2 4,B5,B12
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <0,1>
|| MV .S1X B4,A8
|| AND .S2 -2,B4,B4

SHR .S1 A6,3,A4 ; |47|
|| MV .L1X B5,A6
|| MVC .S2 B4,CSR ; interrupts off
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <0,0>

LDW .D1T1 *++A6(32),A3 ; |47| (P) <0,2>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| (P) <0,2>

;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 40
;* Loop opening brace source line : 41
;* Loop closing brace source line : 48
;* Loop Unroll Multiple : 8x
;* Known Minimum Trip Count : 2
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 4
;* Unpartitioned Resource Bound : 6
;* Partitioned Resource Bound(*) : 6
;* Resource Partition:
;* A-side B-side
;* .L units 2 6*
;* .S units 1 0
;* .D units 6* 6*
;* .M units 2 6*
;* .X cross paths 2 4
;* .T address paths 6* 6*
;* Long read paths 0 0
;* Long write paths 0 4
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D
unit)
;* Bound(.L .S .LS) 2 3
;* Bound(.L .S .D .LS .LSD) 4 4
;*
;* Searching for software pipeline schedule at ...
;* ii = 6 Schedule found with 4 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |**** * | ** **** *** |
;* 1: |***** ** | ****** *** |
;* 2: |******** | *** ******* |
;* 3: | ******* |* *** ***** * |
;* 4: | **** * |** **** **** * |
;* 5: | *** * |*** ***** ** * |
;* +---------------------------------+
;*
;* Done
;*
;* Epilog not entirely removed
;* Collapsed epilog stages : 2
;*
;* Prolog not entirely removed
;* Collapsed prolog stages : 1
;*
;* Minimum required memory pad : 64 bytes
;*
;* Minimum safe trip count : 1 (after unrolling)
;*----*
;* SETUP CODE
;*
;* MV A6,B12
;* ADD 4,B12,B12
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C23:
;* 0 LDW .D2T2 *++B12(32),DP ; |47|
;* 1 LDDW .D2T2 *B11++(32),B7:B6 ; |47|
;* 2 LDW .D1T1 *++A6(32),A3 ; |47|
;* || LDDW .D2T2 *-B11(24),B5:B4 ; |47|
;* 3 LDW .D2T2 *+B12(4),B6 ; |47|
;* || LDW .D1T1 *+A6(12),A3 ; |47|
;* 4 LDDW .D2T2 *-B11(16),B7:B6 ; |47|
;* || LDW .D1T1 *+A6(16),A4 ; |47|
;* 5 LDW .D1T1 *+A6(20),A4 ; |47|
;* || LDDW .D2T2 *-B11(8),B9:B8 ; |47|
;* 6 LDW .D1T1 *+A6(24),A5 ; |47|
;* 7 MPYSP .M1X B6,A3,A4 ; |47|
;* || MPYSP .M2 B7,DP,B4 ; |47|
;* || LDW .D1T1 *+A6(28),A4 ; |47|
;* 8 MPYSP .M2 B4,B6,B4 ; |47|
;* 9 MPYSP .M2X B6,A4,B8 ; |47|
;* 10 MPYSP .M2X B7,A4,B7 ; |47|
;* 11 ADDSP .L1 A4,A7,A7 ; |47| ^
;* || ADDSP .L2 B4,B3,B3 ; |47| ^
;* || MPYSP .M2X B8,A5,B4 ; |47|
;* 12 ADDSP .L2 B4,B10,B10 ; |47| ^
;* || MPYSP .M2X B5,A3,B4 ; |47|
;* || MPYSP .M1X B9,A4,A5 ; |47|
;* 13 ADDSP .L2 B8,B0,B0 ; |47| ^
;* || [ A1] SUB .S1 A1,1,A1 ; |48|
;* 14 ADDSP .L2 B7,B1,B1 ; |47| ^
;* || [ A1] B .S1 C23 ; |48|
;* 15 ADDSP .L2 B4,B2,B2 ; |47| ^
;* 16 ADDSP .L2 B4,B13,B13 ; |47| ^
;* || ADDSP .L1 A5,A0,A0 ; |47| ^
;* 17 NOP 3
;* ; BRANCH OCCURS ; |48|
;*----*
L1: ; PIPED LOOP PROLOG

LDW .D1T1 *+A6(12),A3 ; |47| (P) <0,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| (P) <0,3>

LDW .D1T1 *+A6(16),A4 ; |47| (P) <0,4>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| (P) <0,4>

ZERO .S2 B0 ; |47|
|| ZERO .L2 B2 ; |47|
|| ZERO .S1 A7 ; |47|
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| (P) <0,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| (P) <0,5>

ZERO .S2 B1 ; |47|
|| ZERO .L1 A0 ; |47|
|| ZERO .L2 B13 ; |47|
|| MV .S1X B3,A14
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <1,0>
|| LDW .D1T1 *+A6(24),A5 ; |47| (P) <0,6>

MVK .S1 0x1,A2 ; init prolog collapse
predicate
|| SUB .L1 A4,1,A1
|| ZERO .S2 B3 ; |47|
|| ZERO .L2 B10 ; |47|
|| MPYSP .M1X B6,A3,A4 ; |47| (P) <0,7>
|| MPYSP .M2 B7,DP,B4 ; |47| (P) <0,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <1,1>
|| LDW .D1T1 *+A6(28),A4 ; |47| (P) <0,7>

;**
--*
L2: ; PIPED LOOP KERNEL

[ A1] B .S1 L2 ; |48| <0,14>
|| [!A2] ADDSP .L2 B7,B1,B1 ; |47| <0,14> ^
|| MPYSP .M2 B4,B6,B4 ; |47| <1,8>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| <2,2>
|| LDW .D1T1 *++A6(32),A3 ; |47| <2,2>

[!A2] ADDSP .L2 B4,B2,B2 ; |47| <0,15> ^
|| MPYSP .M2X B6,A4,B8 ; |47| <1,9>
|| LDW .D1T1 *+A6(12),A3 ; |47| <2,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| <2,3>

[!A2] ADDSP .L2 B4,B13,B13 ; |47| <0,16> ^
|| [!A2] ADDSP .L1 A5,A0,A0 ; |47| <0,16> ^
|| MPYSP .M2X B7,A4,B7 ; |47| <1,10>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| <2,4>
|| LDW .D1T1 *+A6(16),A4 ; |47| <2,4>

ADDSP .L1 A4,A7,A7 ; |47| <1,11> ^
|| MPYSP .M2X B8,A5,B4 ; |47| <1,11>
|| ADDSP .L2 B4,B3,B3 ; |47| <1,11> ^
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| <2,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| <2,5>

[ A2] SUB .S1 A2,1,A2 ; <0,18>
|| MPYSP .M2X B5,A3,B4 ; |47| <1,12>
|| MPYSP .M1X B9,A4,A5 ; |47| <1,12>
|| ADDSP .L2 B4,B10,B10 ; |47| <1,12> ^
|| LDW .D1T1 *+A6(24),A5 ; |47| <2,6>
|| LDW .D2T2 *++B12(32),DP ; |47| <3,0>

[ A1] SUB .S1 A1,1,A1 ; |48| <1,13>
|| ADDSP .L2 B8,B0,B0 ; |47| <1,13> ^
|| MPYSP .M1X B6,A3,A4 ; |47| <2,7>
|| LDW .D1T1 *+A6(28),A4 ; |47| <2,7>
|| MPYSP .M2 B7,DP,B4 ; |47| <2,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| <3,1>

;**
--*
L3: ; PIPED LOOP EPILOG

MV .S1X SP,A9 ; |55|
|| ADDSP .L2 B7,B1,B1 ; |47| (E) <3,14> ^

ADDSP .L2 B4,B2,B2 ; |47| (E) <3,15> ^

ADDSP .L1 A5,A0,A0 ; |47| (E) <3,16> ^
|| ADDSP .L2 B4,B13,B13 ; |47| (E) <3,16> ^

LDDW .D2T2 *+SP(8),B11:B10 ; |55|
|| MV .S1X B10,A8
|| MV .S2X A8,B4

MV .S2X A10,DP ; restore dp
MVC .S2 B4,CSR ; interrupts on

LDDW .D2T2 *+SP(16),B13:B12 ; |55|
|| ADDSP .L1X A8,B13,A6 ; |54|

NOP 3

LDW .D1T1 *+A9(4),A14 ; |55|
|| MV .S2X A14,B3 ; |55|
|| ADDSP .L1X B3,A6,A3 ; |54|

LDW .D2T1 *++SP(24),A10 ; |55|
NOP 2
ADDSP .L1 A7,A3,A3 ; |54|
NOP 3
ADDSP .L1 A0,A3,A0 ; |54|
NOP 3
ADDSP .L1X B2,A0,A0 ; |54|
NOP 3
ADDSP .L1X B1,A0,A0 ; |54|
NOP 1
RET .S2 B3 ; |55|
NOP 1
ADDSP .L1X B0,A0,A4 ; |54|
NOP 3
; BRANCH OCCURS ; |55| Extra Comments
---------------

a. This code decrements the stack frame by 16-bytes to store
3 words A10, B10, B11. Even though 12 bytes of stack storage
would have been adequate, it needs to decrement 16 bytes,
in order to leave the incoming double word aligned stack
frame double word aligned at the end of the transaction.

b. You only need to save A10-A15 and B10-B15 if these are being
modified by your code. You need not worry about other registers
you are modifying.

c. Also notice the comments "interrupts off". This is where the
compiler truns off the "GIE" bit off CSR to guarantee that
interrupts dont mess you up, now that you are no longer in
single register assignment mode.

d. Also notice how two prolog and epilog stages have been
collapsed to achieve code-size reductions.

e. The compiler finds a 6 cycle loop in which 8 multiplies
are performed to achieve 1.3 multiplies/cycle. The reason,
this is not a 4 cycle loop, is because the input array
can only be assumed to be word-aligned, as only one
output fir sample is computed at a time. the filter
array can be assumed to be double word aligned.

f. Notice from the compiler feedback {-mw} flag, how all
units are maxed out.

g. Also, if one could compute even two fir output samples
at a time. In the C code, specifying

_nassert((int)(x)%8 == 0);

and computing two output samples in parallel would give
a better multiplier utilization.

Circular buffer
---------------

The delay line is modeled by keeping an array of input samples,
in the input array of size KN and copying the last N -1 samples,
after computing (K-1)N output samples, to the head of the array.

Since the memcpy is done once, for every (K-1)N output samples,
it is inexpensive, as opposed to maintaining the delay line manually.
This avoids explictly having to incorporate circular buffering in
your code.

X:
<-N input samples->|<N input samples->|..............|<N input
samples>|

If you still want to use circular buffering you could use serial
assembly to do so. Regards
Jagadeesh Sankaran


There is a way to get 1.6 multiplies/cycle for single sample
FIR. This involves carrying two versions, one for the even
output samples where both the input and the filter arrays
are double word aligned, and one for the odd output samples
where the input is word aligned and filter array is double
word aligned. In this case you compute even output samples
at the rate of 2 multiplies/cycle and odd output samples
at the rate of 1.2 multiplies/cycle. Since both these
versions are going to be called for an equal number of
times, in steady state you will get an averaging effect
of computing at the rate of 1.6 multiplies/cycle.

double -word aligned: are addresses that end in 0x0 and 0x8.
word-aligned : are addresses that end in 0x0, 0x4, 0x8, 0xC

All double-word addresses are word_aligned. Not all word
aligned addresses are double word aligned.

You can request alignment from C for the linker to use
by saying:

#pragma DATA_ALIGN(x, 8)
float x[100];

These statements align the start of array x, or x itself
to be double-word aligned from C.

Regards
Jagadeesh Sankaran