Reply by Jeff Brower September 24, 20032003-09-24
Jagadeesh-

> Agreed!. It is hard to know the expertise level of the person
> from an e-mail, so I erred in the direction of providing more
> information. But, let's move on!.
>
> I do look forward to stimulating discussions on optimizations
> for C62x, C64x and C67x. I will keep my answers fairly general,
> so that everybody in the user's group can benefit.

You know of course that any time the discussion focuses on optimization, that I
will
drag out your name and refer to your original Nov/Dec '02 posts, which I have
memorized.

That's what you get for being an expert :-)

-Jeff



Reply by Bhooshan iyer September 24, 20032003-09-24

Hi Wojciech Rewers,

>well... as I said before - I do respect all participants of this group and all people in
>general... I admit - my remark was a bit "cynical" but  spitful?

Am sorry, you have misunderstood the word i used here! I meant spiteful(showing a disposition or inclination to annoy or hurt) i guess its a lot different from spitful!!!!!!

anyways,i will have to concede, even spitful was probably over the top.So sorry abt that! Just wanted to prove a point!

>well... I'm not going to mow anybody down or  anything... plus - I don't consider the goups >as a number 1 source of knowledge - I have been parcicipating in the group for many >months now and I  was trying to help much more often than I actually asked about anything...

I agree, you are trying to be an active member,No qualms there either.

>I didn't mow down  Rick when he tried to help me - we both agreed that
>neither of us is an expert on C6x assembly...

Well, even there i was slightly uneasy with your tongue in cheek comment about whether rick had ever programmed c6000 at all...but ill let that one pass...

>sum up - I don't think I'm spitful towards anybody -

Not spiteful perhaps,may be more like cynical, if you will...

>and even if I was, here it goes:
>I do apologize to Jagadeesh and anybody else that I ever offended on this group...

Thats a nice gesture,just sankaran nobody else...

>I'm sorry - it's just my sense of humour - I can't
>resist... is it about mr Sankaran or TI engineers? ;-)

Am there with you on that. :)

>well - this will not be about mr Sankaran or anyone else in particular - rather a general >statement about being clever... I think being a good teacher means
>giving clever one-line answers wherever you can! 

Not always,definetly not all the time?

>believe my help on this group is like that I try not to give lectures to people

 
Some people, even if they try, cant give good lectures.And some people give great lectures.
Its an altogether different issue whether you like lectures or not!

>well... as I said - I don't hold anything against anybody on this group - on the contrary - I >appreciate a good discussion as it's just another stimulus for my brain... that's all...

Couldnt agree with you more(notwithstanding my desire to disagree!) on that.Your current thread has been a great stimulus alrigh,It has opened up my eyes to so many new things

>so - now I wish everybody a nice day ;-) I know it's morning out there in the US ;-)

Good day to you to Mr.Rewers. But hey, i live in India.Its night here.So,good night will do fine.

Ps:You are a good engineer and you  seem to have all the right things.Iam sure you will make it in the field for sure.Come to india??!!
 
Bhooshan


Get personal loans. It's hassle-free. It's approved instantly.

Reply by Jagadeesh Sankaran September 24, 20032003-09-24
Agreed!. It is hard to know the expertise level of the person
from an e-mail, so I erred in the direction of providing more
information. But, let's move on!.

I do look forward to stimulating discussions on optimizations
for C62x, C64x and C67x. I will keep my answers fairly general,
so that everybody in the user's group can benefit.

Regards
Jagadeesh Sankaran


Reply by Wojciech Rewers September 24, 20032003-09-24
> Hi all, and Wojciech Rewers,

hi all...

>> yeah - all squares are rectangles, but not all
rectangles are squares - logic taught to a 10 year
old...

> May be iam slightly old fashioned here...but are'nt
we supposed to be respectful to people who are trying
to help us? and not be cynical/spiteful?

well... as I said before - I do respect all
participants of this group and all people in
general... I admit - my remark was a bit "cynical" but
spitful? hm... after all - I only rephrased what
Jagadeesh said and I quote:
All double-word addresses are word_aligned. Not all
word aligned addresses are double word aligned.

well - for me this is a logic taught to a 10 year old
and there is nothing spitful in it... and my comment
was not to insult Jagadeesh or anybody else - it was
rather in the tone of "yeah - let's skip the obvious
and get to the point"...

> Wow, it makes me wonder what will happen if someone
gives a wrong answer to your question,you would mow
them down? intellectually,perhaps? what with all your
"i need know why am doing what am doing,or ill kill
you attitude"?

well... I'm not going to mow anybody down or
anything... plus - I don't consider the goups as a
number 1 source of knowledge - I have been
parcicipating in the group for many months now and I
was trying to help much more often than I actually
asked about anything... now I asked about this
sample-by-sample FIR and with the help from the group
I managed to develop the code I needed - I appreciate
the group's help... that's all... but my point is - I
don't demand any help from anybody - if you can help
me - or have something to add in the subject - go
ahead and do so - but I'm not going to mow you down if
your contribution is of no value... I didn't mow down
Rick when he tried to help me - we both agreed that
neither of us is an expert on C6x assembly... so - to
sum up - I don't think I'm spitful towards anybody -
and even if I was, here it goes:
I do apologize to Jagadeesh and anybody else that I
ever offended on this group...

> BTW,Mr.sankaran has been one of the brightest and
most erudite ti engineer to participate in the group.

I'm sorry - it's just my sense of humour - I can't
resist... is it about mr Sankaran or TI engineers? ;-)

> And if you notice he doesnt give CLEVER one line
answers-He is a great teacher who PROVES most things
he says.

well - this will not be about mr Sankaran or anyone
else in particular - rather a general statement about
being clever... I think being a good teacher means
giving clever one-line answers wherever you can! I
believe my help on this group is like that - I try not
to give lectures to people - rather point to the
source or even just give one hint from which the
person in need can solve his problem... because after
all - I'm not going to solve anybody's problems, but I
do offer a hint wherever I can...

> Anyways, to your credit,(i hope...) i have to say,
you seem(?) to be the kind who could take as much as
you can give!
> So,friends? adults?
> So,Dont bother apologising, or may be you should!

well... as I said - I don't hold anything against
anybody on this group - on the contrary - I appreciate
a good discussion as it's just another stimulus for my
brain... that's all...

so - now I wish everybody a nice day ;-) I know it's
morning out there in the US ;-)

Wojciech Rewers

__________________________________


Reply by Bhooshan iyer September 24, 20032003-09-24

Hi all, and Wojciech Rewers,

>yeah - all squares are rectangles, but not all

>rectangles are squares - logic taught to a 10 year old...

May be iam slightly old fashioned here...but are'nt we supposed to be respectful to people who are trying to help us? and not be cynical/spiteful?

Wow, it makes me wonder what will happen if someone gives a wrong answer to your question,you would mow them down? intellectually,perhaps? what with all your "i need know why am doing what am doing,or ill kill you attitude"?

Btw,Mr.sankaran has been one of the brightest and most erudite ti engineer to participate in the group.And if you notice he doesnt give CLEVER one line answers-He is a great teacher who PROVES most things he says.I agree, he dint initially fall into the "sample-by-sample" drift right away, but cmon man...cut him some slack here!

Anyways, to your credit,(i hope...) i have to say, you seem(?) to be the kind who could take as much as you can give!

So,friends? adults?

So,Dont bother apologising, or may be you should!

Hah!

 Bhooshan

 



MSN Hotmail now on your Mobile phone. Click here.


Reply by Jagadeesh Sankaran September 23, 20032003-09-23
My comments were in general, and to be clear and not directed
at the understanding or lack of, for any one individual in
particular. This was not my aim. I was tring to be clear.
Hope I did not annoy anybody. At the same point I did not
want to leave anybody out. First of all appologies to all,
Wojciech Rewers in particular, if I came across as being so.

The reason that the stack pointer needs to be double word
aligned, is to facilitate running code across multiple
platforms in particular C62x code on C67x, the reason being
that C67x can perform LDDW to load registers from the stack.

Even then the correct way to do it is to pre-decrement the
stack by the number of words you intend to use upfront, and
not decrement it one at a time. However the code shown on
Page 8-12 is correct because within an ISR unless you re-enable
GIE {shown on next page of PRG} you do not respond to interrupts.
Further, since an even number of registers are pushed and popped,
if the stack pointer is double-word aligned to begin with,
it will be double-word aligned at the end as well. So, this
example "happens" to work.

However the "best" way to do it would be the way the C compiler
does it {yet another reason why tools are better }. I will show
the assembly statements used by the C compiler to accomplish
maintenance of the stack. Notice that even the loads from the
stack are being done using load double words. Notice the pre-
decrement and post increment.

Saves to stack:

MV .S1X SP,A9 ; |5|
|| STW .D2T1 A10,*SP--(24) ; |5|

STW .D2T2 B13,*+SP(20)

MVK .S2 32,B5
|| STW .D2T2 B12,*+SP(16)

SUB .L2X A4,B5,B5
|| STW .D2T2 B11,*+SP(12)

MV .S1X DP,A10 ; save dp
|| STW .D1T1 A14,*-A9(20)
|| MV .L2 B4,B11
|| STW .D2T2 B10,*+SP(8)
|| MVC .S2 CSR,B4

ADD .L2 4,B5,B12
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <0,1>
|| MV .S1X B4,A8
|| AND .S2 -2,B4,B4

SHR .S1 A6,3,A4 ; |47|
|| MV .L1X B5,A6
|| MVC .S2 B4,CSR ; interrupts off
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <0,0>

Restores from the stack:

MV .S1X SP,A9 ; |55|
|| ADDSP .L2 B7,B1,B1 ; |47| (E) <3,14> ^

ADDSP .L2 B4,B2,B2 ; |47| (E) <3,15> ^

ADDSP .L1 A5,A0,A0 ; |47| (E) <3,16> ^
|| ADDSP .L2 B4,B13,B13 ; |47| (E) <3,16> ^

LDDW .D2T2 *+SP(8),B11:B10 ; |55|
|| MV .S1X B10,A8
|| MV .S2X A8,B4

MV .S2X A10,DP ; restore dp
MVC .S2 B4,CSR ; interrupts on

LDDW .D2T2 *+SP(16),B13:B12 ; |55|
|| ADDSP .L1X A8,B13,A6 ; |54|

NOP 3

LDW .D1T1 *+A9(4),A14 ; |55|
|| MV .S2X A14,B3 ; |55|
|| ADDSP .L1X B3,A6,A3 ; |54|

LDW .D2T1 *++SP(24),A10 ; |55|

Regards
Jagadeesh Sankaran



Reply by Jagadeesh Sankaran September 23, 20032003-09-23
There is a way to get 1.6 multiplies/cycle for single sample
FIR. This involves carrying two versions, one for the even
output samples where both the input and the filter arrays
are double word aligned, and one for the odd output samples
where the input is word aligned and filter array is double
word aligned. In this case you compute even output samples
at the rate of 2 multiplies/cycle and odd output samples
at the rate of 1.2 multiplies/cycle. Since both these
versions are going to be called for an equal number of
times, in steady state you will get an averaging effect
of computing at the rate of 1.6 multiplies/cycle.

double -word aligned: are addresses that end in 0x0 and 0x8.
word-aligned : are addresses that end in 0x0, 0x4, 0x8, 0xC

All double-word addresses are word_aligned. Not all word
aligned addresses are double word aligned.

You can request alignment from C for the linker to use
by saying:

#pragma DATA_ALIGN(x, 8)
float x[100];

These statements align the start of array x, or x itself
to be double-word aligned from C.

Regards
Jagadeesh Sankaran



Reply by Jagadeesh Sankaran September 22, 20032003-09-22
A lot of my experience has been with fixed point DSP's C62x
and more so C64x DSP. However I wanted to write out some code
and prove to myself that the tools are indeed the best way to
get there, that too for an example as simple as an FIR.

Here is my first pass stab at a single sample FIR C code.
This code models merely the FIR part of the dot-product,
it does not model the circular buffer. I shall address
this in the latter part of my e-mail.

#include <stdio.h>
#include <stdlib.h>

float fir(float *restrict x, float *restrict h, int N)
{
int i;
float sum;

/*-----------*/
/* Initialize FIR accumulator to zero, prior */
/* to the start of the computation. */
/*-----------*/

sum = 0;

/*-----------*/
/* If this is a C6000 build, then inform the */
/* compiler about any safe assumptions that */
/* can be made. In this case we assume that */
/* input array is word aligned and filter */
/* arrays is double word aligned. */
/* word aligned. I am assuming that your */
/* filter has at least 16 taps and is a */
/* multiple of 8. */
/*-----------*/ #ifdef TMS320C6X
_nassert((int)(x)%4 == 0);
_nassert((int)(h)%8 == 0);
_nassert((int)(N)%8 == 0);
_nassert((int)(N >= 16));
#endif

/*-----------*/
/* The following loop iterates over N filter */
/* taps computing the FIR sum, one tap at a */
/* time. */
/*-----------*/

for ( i = 0; i < N; i++)
{
/*--------*/
/* Compute sum of products over all filter */
/* taps accumulating the result into sum. */
/*--------*/

sum += x[i] * h[i];
}

/*-----------*/
/* Return accumulated single sample FIR. */
/*-----------*/

return sum;
}

I compiled using the current shipping 4.31 tools. I used
the following flags for my compile:

cl6x -k -o2 -mwtx -mv6700 -mh -dTMS320C6X fir.c

The compiler produced the following output in which two
multiplies are issued every cycle. Although the code is
written in a straight forward way, tthe odd and even taps
in parallel and accumulates them into seperate accumulators.
It finally adds the seperate accumulators prior to returning.
I will now reproduce the assembler output, which you can
reproduce as well {Isnt that nice ?}.

.sect ".text"
.global _fir

;******************************************************************************
;* FUNCTION NAME:
_fir *
;*
*
;* Regs Modified :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Regs Used :
A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,*
;*
B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;* Local Frame Size : 0 Args + 0 Auto + 24 Save = 24
byte *
;******************************************************************************
_fir:
;**
--*

MV .S1X SP,A9 ; |5|
|| STW .D2T1 A10,*SP--(24) ; |5|

STW .D2T2 B13,*+SP(20)

MVK .S2 32,B5
|| STW .D2T2 B12,*+SP(16)

SUB .L2X A4,B5,B5
|| STW .D2T2 B11,*+SP(12)

MV .S1X DP,A10 ; save dp
|| STW .D1T1 A14,*-A9(20)
|| MV .L2 B4,B11
|| STW .D2T2 B10,*+SP(8)
|| MVC .S2 CSR,B4

ADD .L2 4,B5,B12
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <0,1>
|| MV .S1X B4,A8
|| AND .S2 -2,B4,B4

SHR .S1 A6,3,A4 ; |47|
|| MV .L1X B5,A6
|| MVC .S2 B4,CSR ; interrupts off
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <0,0>

LDW .D1T1 *++A6(32),A3 ; |47| (P) <0,2>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| (P) <0,2>

;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 40
;* Loop opening brace source line : 41
;* Loop closing brace source line : 48
;* Loop Unroll Multiple : 8x
;* Known Minimum Trip Count : 2
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 4
;* Unpartitioned Resource Bound : 6
;* Partitioned Resource Bound(*) : 6
;* Resource Partition:
;* A-side B-side
;* .L units 2 6*
;* .S units 1 0
;* .D units 6* 6*
;* .M units 2 6*
;* .X cross paths 2 4
;* .T address paths 6* 6*
;* Long read paths 0 0
;* Long write paths 0 4
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D
unit)
;* Bound(.L .S .LS) 2 3
;* Bound(.L .S .D .LS .LSD) 4 4
;*
;* Searching for software pipeline schedule at ...
;* ii = 6 Schedule found with 4 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |**** * | ** **** *** |
;* 1: |***** ** | ****** *** |
;* 2: |******** | *** ******* |
;* 3: | ******* |* *** ***** * |
;* 4: | **** * |** **** **** * |
;* 5: | *** * |*** ***** ** * |
;* +---------------------------------+
;*
;* Done
;*
;* Epilog not entirely removed
;* Collapsed epilog stages : 2
;*
;* Prolog not entirely removed
;* Collapsed prolog stages : 1
;*
;* Minimum required memory pad : 64 bytes
;*
;* Minimum safe trip count : 1 (after unrolling)
;*----*
;* SETUP CODE
;*
;* MV A6,B12
;* ADD 4,B12,B12
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C23:
;* 0 LDW .D2T2 *++B12(32),DP ; |47|
;* 1 LDDW .D2T2 *B11++(32),B7:B6 ; |47|
;* 2 LDW .D1T1 *++A6(32),A3 ; |47|
;* || LDDW .D2T2 *-B11(24),B5:B4 ; |47|
;* 3 LDW .D2T2 *+B12(4),B6 ; |47|
;* || LDW .D1T1 *+A6(12),A3 ; |47|
;* 4 LDDW .D2T2 *-B11(16),B7:B6 ; |47|
;* || LDW .D1T1 *+A6(16),A4 ; |47|
;* 5 LDW .D1T1 *+A6(20),A4 ; |47|
;* || LDDW .D2T2 *-B11(8),B9:B8 ; |47|
;* 6 LDW .D1T1 *+A6(24),A5 ; |47|
;* 7 MPYSP .M1X B6,A3,A4 ; |47|
;* || MPYSP .M2 B7,DP,B4 ; |47|
;* || LDW .D1T1 *+A6(28),A4 ; |47|
;* 8 MPYSP .M2 B4,B6,B4 ; |47|
;* 9 MPYSP .M2X B6,A4,B8 ; |47|
;* 10 MPYSP .M2X B7,A4,B7 ; |47|
;* 11 ADDSP .L1 A4,A7,A7 ; |47| ^
;* || ADDSP .L2 B4,B3,B3 ; |47| ^
;* || MPYSP .M2X B8,A5,B4 ; |47|
;* 12 ADDSP .L2 B4,B10,B10 ; |47| ^
;* || MPYSP .M2X B5,A3,B4 ; |47|
;* || MPYSP .M1X B9,A4,A5 ; |47|
;* 13 ADDSP .L2 B8,B0,B0 ; |47| ^
;* || [ A1] SUB .S1 A1,1,A1 ; |48|
;* 14 ADDSP .L2 B7,B1,B1 ; |47| ^
;* || [ A1] B .S1 C23 ; |48|
;* 15 ADDSP .L2 B4,B2,B2 ; |47| ^
;* 16 ADDSP .L2 B4,B13,B13 ; |47| ^
;* || ADDSP .L1 A5,A0,A0 ; |47| ^
;* 17 NOP 3
;* ; BRANCH OCCURS ; |48|
;*----*
L1: ; PIPED LOOP PROLOG

LDW .D1T1 *+A6(12),A3 ; |47| (P) <0,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| (P) <0,3>

LDW .D1T1 *+A6(16),A4 ; |47| (P) <0,4>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| (P) <0,4>

ZERO .S2 B0 ; |47|
|| ZERO .L2 B2 ; |47|
|| ZERO .S1 A7 ; |47|
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| (P) <0,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| (P) <0,5>

ZERO .S2 B1 ; |47|
|| ZERO .L1 A0 ; |47|
|| ZERO .L2 B13 ; |47|
|| MV .S1X B3,A14
|| LDW .D2T2 *++B12(32),DP ; |47| (P) <1,0>
|| LDW .D1T1 *+A6(24),A5 ; |47| (P) <0,6>

MVK .S1 0x1,A2 ; init prolog collapse
predicate
|| SUB .L1 A4,1,A1
|| ZERO .S2 B3 ; |47|
|| ZERO .L2 B10 ; |47|
|| MPYSP .M1X B6,A3,A4 ; |47| (P) <0,7>
|| MPYSP .M2 B7,DP,B4 ; |47| (P) <0,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <1,1>
|| LDW .D1T1 *+A6(28),A4 ; |47| (P) <0,7>

;**
--*
L2: ; PIPED LOOP KERNEL

[ A1] B .S1 L2 ; |48| <0,14>
|| [!A2] ADDSP .L2 B7,B1,B1 ; |47| <0,14> ^
|| MPYSP .M2 B4,B6,B4 ; |47| <1,8>
|| LDDW .D2T2 *-B11(24),B5:B4 ; |47| <2,2>
|| LDW .D1T1 *++A6(32),A3 ; |47| <2,2>

[!A2] ADDSP .L2 B4,B2,B2 ; |47| <0,15> ^
|| MPYSP .M2X B6,A4,B8 ; |47| <1,9>
|| LDW .D1T1 *+A6(12),A3 ; |47| <2,3>
|| LDW .D2T2 *+B12(4),B6 ; |47| <2,3>

[!A2] ADDSP .L2 B4,B13,B13 ; |47| <0,16> ^
|| [!A2] ADDSP .L1 A5,A0,A0 ; |47| <0,16> ^
|| MPYSP .M2X B7,A4,B7 ; |47| <1,10>
|| LDDW .D2T2 *-B11(16),B7:B6 ; |47| <2,4>
|| LDW .D1T1 *+A6(16),A4 ; |47| <2,4>

ADDSP .L1 A4,A7,A7 ; |47| <1,11> ^
|| MPYSP .M2X B8,A5,B4 ; |47| <1,11>
|| ADDSP .L2 B4,B3,B3 ; |47| <1,11> ^
|| LDDW .D2T2 *-B11(8),B9:B8 ; |47| <2,5>
|| LDW .D1T1 *+A6(20),A4 ; |47| <2,5>

[ A2] SUB .S1 A2,1,A2 ; <0,18>
|| MPYSP .M2X B5,A3,B4 ; |47| <1,12>
|| MPYSP .M1X B9,A4,A5 ; |47| <1,12>
|| ADDSP .L2 B4,B10,B10 ; |47| <1,12> ^
|| LDW .D1T1 *+A6(24),A5 ; |47| <2,6>
|| LDW .D2T2 *++B12(32),DP ; |47| <3,0>

[ A1] SUB .S1 A1,1,A1 ; |48| <1,13>
|| ADDSP .L2 B8,B0,B0 ; |47| <1,13> ^
|| MPYSP .M1X B6,A3,A4 ; |47| <2,7>
|| LDW .D1T1 *+A6(28),A4 ; |47| <2,7>
|| MPYSP .M2 B7,DP,B4 ; |47| <2,7>
|| LDDW .D2T2 *B11++(32),B7:B6 ; |47| <3,1>

;**
--*
L3: ; PIPED LOOP EPILOG

MV .S1X SP,A9 ; |55|
|| ADDSP .L2 B7,B1,B1 ; |47| (E) <3,14> ^

ADDSP .L2 B4,B2,B2 ; |47| (E) <3,15> ^

ADDSP .L1 A5,A0,A0 ; |47| (E) <3,16> ^
|| ADDSP .L2 B4,B13,B13 ; |47| (E) <3,16> ^

LDDW .D2T2 *+SP(8),B11:B10 ; |55|
|| MV .S1X B10,A8
|| MV .S2X A8,B4

MV .S2X A10,DP ; restore dp
MVC .S2 B4,CSR ; interrupts on

LDDW .D2T2 *+SP(16),B13:B12 ; |55|
|| ADDSP .L1X A8,B13,A6 ; |54|

NOP 3

LDW .D1T1 *+A9(4),A14 ; |55|
|| MV .S2X A14,B3 ; |55|
|| ADDSP .L1X B3,A6,A3 ; |54|

LDW .D2T1 *++SP(24),A10 ; |55|
NOP 2
ADDSP .L1 A7,A3,A3 ; |54|
NOP 3
ADDSP .L1 A0,A3,A0 ; |54|
NOP 3
ADDSP .L1X B2,A0,A0 ; |54|
NOP 3
ADDSP .L1X B1,A0,A0 ; |54|
NOP 1
RET .S2 B3 ; |55|
NOP 1
ADDSP .L1X B0,A0,A4 ; |54|
NOP 3
; BRANCH OCCURS ; |55| Extra Comments
---------------

a. This code decrements the stack frame by 16-bytes to store
3 words A10, B10, B11. Even though 12 bytes of stack storage
would have been adequate, it needs to decrement 16 bytes,
in order to leave the incoming double word aligned stack
frame double word aligned at the end of the transaction.

b. You only need to save A10-A15 and B10-B15 if these are being
modified by your code. You need not worry about other registers
you are modifying.

c. Also notice the comments "interrupts off". This is where the
compiler truns off the "GIE" bit off CSR to guarantee that
interrupts dont mess you up, now that you are no longer in
single register assignment mode.

d. Also notice how two prolog and epilog stages have been
collapsed to achieve code-size reductions.

e. The compiler finds a 6 cycle loop in which 8 multiplies
are performed to achieve 1.3 multiplies/cycle. The reason,
this is not a 4 cycle loop, is because the input array
can only be assumed to be word-aligned, as only one
output fir sample is computed at a time. the filter
array can be assumed to be double word aligned.

f. Notice from the compiler feedback {-mw} flag, how all
units are maxed out.

g. Also, if one could compute even two fir output samples
at a time. In the C code, specifying

_nassert((int)(x)%8 == 0);

and computing two output samples in parallel would give
a better multiplier utilization.

Circular buffer
---------------

The delay line is modeled by keeping an array of input samples,
in the input array of size KN and copying the last N -1 samples,
after computing (K-1)N output samples, to the head of the array.

Since the memcpy is done once, for every (K-1)N output samples,
it is inexpensive, as opposed to maintaining the delay line manually.
This avoids explictly having to incorporate circular buffering in
your code.

X:
<-N input samples->|<N input samples->|..............|<N input
samples>|

If you still want to use circular buffering you could use serial
assembly to do so. Regards
Jagadeesh Sankaran


Reply by Jeff Brower September 22, 20032003-09-22
Wojciech-

> but what do you mean that the stack pointer needs to
> be double word aligned? isn't the stack pointer B15?
> how can that be double word aligned? or do you mean
> that I should push/pop always double words? hm... is
> it even possible? could you please elaborate on that?

Sub 8 and AND with 0xfffffff8 before doing any additional push/pop. Watch out
for
alignment upon return.

-Jeff



Reply by Jagadeesh Sankaran September 22, 20032003-09-22
First of all, I would still strongly advise folks! not to
give up hopes on tools and automatic code generation in a
flash. I will illustrate the perils, based on the code that
has been developed at:

http://www.wrewers.karolin.pl/firc.asm

This code has several issues. I am not trying to nit-pick.
Developing hand-optimized VLIW code has its perils. Let the
tools do their job. I will list some of the bugs that immediately
catch my eye. I cannot vouch that I have caught all of them.
BTW all these bugs can be avoided by using tools, so that
one does not have to become intricately familiar to do code
development of a simple FIR.

a. The stack pointer needs to be double word aligned, at all
times. The stores to the stack need to be done by pre-decrementing
the stack frame and leaving it double word aligned.

b. The save on entry registers A10-A15 and B10-B15 need to be saved
for sure, upon entry to a function.

c. This code is not single register assignment, and hence you need to
turn off interrupts while you are in this code.

d. This single cycle loop as shown could have been achieved with the
tools,
without a doubt. Further, if you allow at least two output samples to be
computed in parallel, you can get 100% multiplier utilization.

e. Most fir implementations are written to perform block processing,
which is why the serial ports are buffered McBSP, with the B for
Buffering.

f. Also the delay line is implemented once as a memcpy, by moving the
block of samples required for overlap, without explictly doing it in
the kernel. This removes the need for circular buffering.

g. Take a look at codec_edma.c under the DSK directory for an example of
block based interaction with the serial port.

By the way I have not seen anything spectacular to give up on the tools
yet!

Regards
Jagadeesh Sankaran