DSPRelated.com
Forums

sample-by-sample FIR optimized

Started by Wojciech Rewers September 17, 2003
Two days ago I issued a question here about
sample-by-sample FIR implementation...

J G referenced me to this place:
http://www.integrated-dsp.com/software/c6xdsk/C6711/fira/

at first I thought "great - I have what I was looking
for" - but then - I looked closer to the code and
figured, that it's not only producing incorrect
results, but also it's not optimized that well...

so I fixed both problems... from 11+13n I got 16+7n
cycles where n is the number of taps calculated...

while checking the clock counts for 256 taps long FIR
I get results:
fira: 3941 counts (and incorrect results!)
fir_my: 2215 counts (and correct results)
that is almost 50% time reduction!

however - my profiler is giving me absurd results... I
have never really worked with the profiler so maybe
I'm doing something wrong - are there any "gotchas"
I'm not aware of?

anyway - here is the code I wrote if anybody needs it:
www.wrewers.karolin.pl/fira_my.asm

Wojciech Rewers
PS. I'm a young (26) male with MSEE degree
(specialized in DSP) with 15 monts of work experience
as embedded systems engineer looking for a job in
DSP/embedded field - ready to relocate anywhere. __________________________________



Perhaps I am missing something. But it is a poor algorithm indeed that
can't process a FIR filter with 1 clock per tap! I have not taken the time
to look at the code since I am packing up now to travel closer to Isabel's
path. I am thinking there is a disconnect in how you are counting your
clock cycles. Are your figures for just one data sample input?

The MAC on every DSP can process one tap in one clock cycle. The C6x
processors should be able to fetch two operands, perform a multiply and add
into the accumulator on each clock cycle. This would give you some fixed
number of startup clock cycles plus N cycles where N is the number of
taps. There should be no coefficient for the N unless you perform two
inputs at once in which case the coefficient could be 0.5.

Instead of working with the clearly inefficient fira code, perhaps you
should start with some other optimized code to see how to do it right. The
entire filter loop should be a single parallel operation. Of course you
will have a bit of messy setup and exit code, but the core loop should be
one set of parallel ops.

0) Setup pointers, counts...
1) fetch two inputs (one coeff, one data)
2) fetch two inputs (one coeff, one data) ||
multiply operands
3) fetch two inputs (one coeff, one data) ||
multiply operands ||
add to accumulator ||
loop on count to (3)
4) multiply operands ||
add to accumulator
5) add to accumulator

6) done!

This is the form your code should have.

Am I out to lunch on this? If you can't get to k + N cycles, it might be
faster running on an MCU! At 05:34 PM 9/17/2003, Wojciech Rewers wrote:
>Two days ago I issued a question here about
>sample-by-sample FIR implementation...
>
>J G referenced me to this place:
>http://www.integrated-dsp.com/software/c6xdsk/C6711/fira/
>
>at first I thought "great - I have what I was looking
>for" - but then - I looked closer to the code and
>figured, that it's not only producing incorrect
>results, but also it's not optimized that well...
>
>so I fixed both problems... from 11+13n I got 16+7n
>cycles where n is the number of taps calculated...
>
>while checking the clock counts for 256 taps long FIR
>I get results:
>fira: 3941 counts (and incorrect results!)
>fir_my: 2215 counts (and correct results)
>that is almost 50% time reduction!
>
>however - my profiler is giving me absurd results... I
>have never really worked with the profiler so maybe
>I'm doing something wrong - are there any "gotchas"
>I'm not aware of?
>
>anyway - here is the code I wrote if anybody needs it:
>www.wrewers.karolin.pl/fira_my.asm

Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX




--- Arius - Rick Collins <> wrote:
> Perhaps I am missing something.

clearly one os us is ;-)

> But it is a poor algorithm indeed that
> can't process a FIR filter with 1 clock per tap!

hm...

> I have not taken the time
> to look at the code since I am packing up now to
> travel closer to Isabel's path.

I though the stream of people is flowing the other
way?

> I am thinking there is a disconnect in how
> you are counting your
> clock cycles. Are your figures for just one data
> sample input?

? one data sample input? hm... I'm not sure what you
mean by that... my figure was for "one data sample
output" - meaning - the input parameters are: dly
(delay line of input samples), h (vector of
coefitients), nh (number of coeficients)... and as a
result I get one output sample (from calculating
dotprod as you wish)

> The MAC on every DSP can process one tap in one
> clock cycle.

hm... that's what I remember from my DSP classes and
labs where we were using C5x Starter Kits...

> The C6x
> processors should be able to fetch two operands,
> perform a multiply and add
> into the accumulator on each clock cycle.

hm... well - from what I understand so far C6711 has
this LDW instruction which can fetch a float number
but it takes 5 cycles to do so... then the MPYSP takes
4 cycles to multiply two float operands and then the
ADDSP takes 4 cycles to accumulate the result...

but then... I'm not sure myself whether the ADDSP can
directly follow MPYSP - from what I figured so far
from single-stepping some sample code - it cannot
because only after 3 delay slots the results are ready
from MPYSP... but then - I also took a look at the
assembly code published by TI for block FIR filter
where the "main loop" includes 4 cycles with many
parallel instructions in each - I didn't really
analyze all that though...

well - as I said - I'm not an expert here in assembly
coding - when J G gave referenced me to that
Integrated-DSP code I analyzed that and since it's
much more straightforward I stack with it and
optimized it by almost 50% - I thought it's OK and I
should move on to some other stuff I have on my to do
list... but now you're telling me it sucks? ;-)

> This would give you some fixed
> number of startup clock cycles plus N cycles where N
> is the number of taps.

well... as I said - I would expect that on C5x, but
I'm affraid you're wrong here about C67x and
floating-point operations... or is it really me who is
missing something?

> Instead of working with the clearly inefficient fira
> code, perhaps you
> should start with some other optimized code to see
> how to do it right. The
> entire filter loop should be a single parallel
> operation. Of course you
> will have a bit of messy setup and exit code, but
> the core loop should be
> one set of parallel ops.

well... I'm really not buying it for now - have you
actually done some assembly coding on C6x?

> Am I out to lunch on this? If you can't get to k +
> N cycles, it might be
> faster running on an MCU!

well... I'm not sure if you're "out to lunch" or not -
for me you are - but maybe someone else could be the
judge?

Wojciech Rewers (still looking for a job)

__________________________________



Yes, I see that neither of us are experts in C6x assembly programming. ;)

I know that you can get to 1 clock per tap, but to do so requires that you
interleave all the pipelined instructions so that you are feeding a new
pair of operands in on every clock, starting a new multiply on every clock
and starting a new add on every clock. Even though it may take more than
one clock to complete each of these steps, a new operation can be started
on each clock cycle. You don't have to wait for one multiply to finish
before you start the next. This is the same on the C6x and the C5x and all
other high speed DSPs, they are all pipelined, which means it takes more
than one clock to complete an operation, but you can start a new one on
every clock.

The TI code for a dot product should be a good place to start, assuming
that they achieve the k + N timing value. I would be surprised if they
don't. This code only needs to be modified for circular buffering to be
what you need.

BTW, my comment about processing one input sample meant one *new* input
sample. The idea is that you want a new output sample every time you get a
new input sample, rather than waiting until you have a new block of input
samples to yield a new block of output samples. Unless you are decimating
or interpolating, input and output size will be the same. At 12:23 PM 9/18/2003, you wrote:

>--- Arius - Rick Collins <> wrote:
> > Perhaps I am missing something.
>
>clearly one os us is ;-)
>
> > But it is a poor algorithm indeed that
> > can't process a FIR filter with 1 clock per tap!
>
>hm...
>
> > I have not taken the time
> > to look at the code since I am packing up now to
> > travel closer to Isabel's path.
>
>I though the stream of people is flowing the other
>way?
>
> > I am thinking there is a disconnect in how
> > you are counting your
> > clock cycles. Are your figures for just one data
> > sample input?
>
>? one data sample input? hm... I'm not sure what you
>mean by that... my figure was for "one data sample
>output" - meaning - the input parameters are: dly
>(delay line of input samples), h (vector of
>coefitients), nh (number of coeficients)... and as a
>result I get one output sample (from calculating
>dotprod as you wish)
>
> > The MAC on every DSP can process one tap in one
> > clock cycle.
>
>hm... that's what I remember from my DSP classes and
>labs where we were using C5x Starter Kits...
>
> > The C6x
> > processors should be able to fetch two operands,
> > perform a multiply and add
> > into the accumulator on each clock cycle.
>
>hm... well - from what I understand so far C6711 has
>this LDW instruction which can fetch a float number
>but it takes 5 cycles to do so... then the MPYSP takes
>4 cycles to multiply two float operands and then the
>ADDSP takes 4 cycles to accumulate the result...
>
>but then... I'm not sure myself whether the ADDSP can
>directly follow MPYSP - from what I figured so far
>from single-stepping some sample code - it cannot
>because only after 3 delay slots the results are ready
>from MPYSP... but then - I also took a look at the
>assembly code published by TI for block FIR filter
>where the "main loop" includes 4 cycles with many
>parallel instructions in each - I didn't really
>analyze all that though...
>
>well - as I said - I'm not an expert here in assembly
>coding - when J G gave referenced me to that
>Integrated-DSP code I analyzed that and since it's
>much more straightforward I stack with it and
>optimized it by almost 50% - I thought it's OK and I
>should move on to some other stuff I have on my to do
>list... but now you're telling me it sucks? ;-)
>
> > This would give you some fixed
> > number of startup clock cycles plus N cycles where N
> > is the number of taps.
>
>well... as I said - I would expect that on C5x, but
>I'm affraid you're wrong here about C67x and
>floating-point operations... or is it really me who is
>missing something?
>
> > Instead of working with the clearly inefficient fira
> > code, perhaps you
> > should start with some other optimized code to see
> > how to do it right. The
> > entire filter loop should be a single parallel
> > operation. Of course you
> > will have a bit of messy setup and exit code, but
> > the core loop should be
> > one set of parallel ops.
>
>well... I'm really not buying it for now - have you
>actually done some assembly coding on C6x?
>
> > Am I out to lunch on this? If you can't get to k +
> > N cycles, it might be
> > faster running on an MCU!
>
>well... I'm not sure if you're "out to lunch" or not -
>for me you are - but maybe someone else could be the
>judge?
>
>Wojciech Rewers (still looking for a job)
>
>__________________________________
>
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX



At 12:28 PM 9/18/2003, Wojciech Rewers wrote:

>--- Arius - Rick Collins <> wrote:
>and one more thing... could someone please explain me
>why the profiler shows different clock cycles count
>than obtained from the equations I formulated? (11+13n
>and 16+7n)? I guessed it could have something to do
>with cache operation, but I'm certainly not an expert
>in that field, so I didn't investigate that... but I'm
>just curious... anybody knows an explanation?


Which way are the results off? If the profiler is giving very high numbers
it is likely because your code is in external memory and it takes multiple
clock cycles to load each instruction, and multiple instructions are needed
for each execution cycle. Internal memory runs in one clock cycle and
works with a 256 bit wide bus to allow maximum execution speeds.

Cache is ok if you are running large code which is not optimized, but if
your code fits the 90/10 rule (90% of your execution time is spent in 10%
of your code) an internal memory model will work better. Also the fetches
of data can slow the execution if they are from external
memory. Instructions will be cached after the first time through the loop,
but the data will always be slow. So put the data in internal memory. If the profiler is giving low numbers, you may be seeing optimization by
the software tools. It has been a long time since I have coded in assembly
for this chip, but I seem to recall that there were two forms; one for ease
of coding where the tool would do as much parallelism as it could figure
out, and one where you had to do all the work, but the tool would not
change anything you coded. Which are you working with? Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX



Wojciech,

Below is TI's dot product asm code, which is close to what you want.
Big problem with it is that the LDDW operation requires both arrays to be DWORD
aligned. This would preclude stepping by a single sample. Anyway, you get the
drift of what is happening here.

- Andrew E. *===============================================================================
*
* TEXAS INSTRUMENTS, INC.
*
* C67X FLOATING POINT DOT PRODUCT (HAND-CODED ASSEMBLY VERSION)
*
* Revision Date: 03/11/98
*
* USAGE This routine is C Callable and can be called as:
*
* float dotp(const float *x, const float *y, const int count);
*
* x is pointer to array holding the first floating point vector
* y is pointer to array holding the second floating point vector
* count is the number of values in the x & y vectors
*
* If the routine is not to be used as a C callable function,
* then you need to initialize values for all of the parameters
* passed to the function since these are assumed to be in
* registers as defined by the calling convention of the
* compiler, (refer to the TMS320C6x Optimizing C Compiler
* User's Guide).
*
* C CODE
* This is the C equivalent for the assembly code. Note that
* the assembly code is hand optimized and restrictions may
* apply.
*
* float dotp(const float *x, const float *y, const int count)
* {
* int i;
* float sum = 0;
*
* for (i=0; i < count; i++)
* {
* sum += x[i] * y[i];
* }
* return sum;
* }
*
* DESCRIPTION
*
* This routine calculates the dot product of 2 vectors.
*
* TECHNIQUES
*
* 1. LDDW instructions are used to load two SP floating point
* values at a time for the x and y arrays.
* 2. The loop is unrolled once and software pipelined.
* 3. Since the ADDSP and MPYSP instructions take 4 cycles,
* A8, B8, A0, and B0 multiplex different variables to save
* on register usage.
* This multiple assignment is possible since the variables
* are always read just once on the first cycle that they
* are availble.
* 4. The loop is primed to reduce the prolog by 4 cycles
* (14 words) with no increase in cycle time.
* 5. The load counter is used as the loop counter which
* requires a 3 cycle (6 word) epilog to finish the
* calculations. This does not increase the cycle time.
* The advantage is that the routine does not perform
* extraneous loads.
*
* ASSUMPTIONS
*
* 1. Little Endian is assumed for LDDW instructions.
* 2. The value of count must be greater than or equal
* to 10 and even (10, 12, 14, ...).
* 3. Since single assignment of registers is not used,
* interrupts should be disabled before this function is
* called.
*
* MEMORY NOTE
*
* The arrays x and y should be aligned on opposite
* (even and odd) double word boundaries to avoid memory
* bank hits. This routine does not perform extraneous loads.
*
* ARGUMENTS PASSED
*
* x -> A4 = ptr_x
* y -> B4 = ptr_y
* count -> A6
*
* CYCLES
*
* N/2 + 24 with C overhead
* N/2 + 24 without C overhead
*
*===============================================================================
.global _dotp
.text
_dotp:

*** BEGIN Benchmark Timing ***
B_START:
* Prolog Begins ****************************************************************
ZERO .L1 A5 ; prod0 = 0
|| ZERO .L2 B5 ; prod1 = 0
|| LDDW .D1 *A4++[1],A7:A6 ; load x1:x0 from memory
|| LDDW .D2 *B4++[1],B7:B6 ; load y1:y0 from memory
|| B .S1 LOOP ; branch to loop

ZERO .L1 A8 ; sum0 = 0
|| ZERO .L2 B8 ; sum1 = 0
|| LDDW .D1 *A4++[1],A7:A6 ; @ load x1:x0 from memory
|| LDDW .D2 *B4++[1],B7:B6 ; @ load y1:y0 from memory
|| SUB .S2X A6,10,B0 ; lcntr = count - 10
|| B .S1 LOOP ; branch to loop

LDDW .D1 *A4++[1],A7:A6 ; @@ load x1:x0 from memory
|| LDDW .D2 *B4++[1],B7:B6 ; @@ load y1:y0 from memory
|| B .S1 LOOP ; branch to loop

LDDW .D1 *A4++[1],A7:A6 ; @@@ load x1:x0 from memory
|| LDDW .D2 *B4++[1],B7:B6 ; @@@ load y1:y0 from memory
|| B .S1 LOOP ; branch to loop

LDDW .D1 *A4++[1],A7:A6 ; @@@@ load x1:x0 from memory
|| LDDW .D2 *B4++[1],B7:B6 ; @@@@ load y1:y0 from memory
|| B .S1 LOOP ; branch to loop
****** Loop Begins *************************************************************
LOOP:
[B0] LDDW .D1 *A4++[1],A7:A6 ; @@@@@ if(lcntr) load x1:x0 from memory
||[B0] LDDW .D2 *B4++[1],B7:B6 ; @@@@@ if(lcntr) load y1:y0 from memory
|| MPYSP .M1X A6,B6,A5 ; prod0 = x0 * y0
|| MPYSP .M2X A7,B7,B5 ; prod1 = x1 * y1
|| ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0
|| ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1
||[B0] B .S1 LOOP ; if(lcntr) branch to loop
||[B0] SUB .S2 B0,2,B0 ; if(lcntr) lcntr -= 2
****** Epilog Begins ***********************************************************
ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0
|| ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1

ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0
|| ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1

ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0
|| ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1
* Epilog Ends ******************************************************************
ADDSP .L1X A8,B8,A0 ; A0 = sum0 + sum1
ADDSP .L2X A8,B8,B0 ; B0 = sum0 + sum1
ADDSP .L1X A8,B8,A0 ; A0 = sum0 + sum1
ADDSP .L2X A8,B8,B0 ; B0 = sum0 + sum1
NOP ; wait for B0
ADDSP .L1X A0,B0,A5 ; A5 = A0 + B0
NOP ; wait for next B0
ADDSP .L2X A0,B0,B5 ; B5 = A0 + B0
NOP ; wait for B5
B .S2 B3 ; return from function
NOP ; wait for B5
ADDSP .L1X A5,B5,A4 ; return A4
NOP 3 ; wait for A4 and branch
B_END:
*** END Benchmark Timing *** At 09:23 AM 9/18/2003 -0700, Wojciech Rewers wrote:

>--- Arius - Rick Collins <> wrote:
>> Perhaps I am missing something.
>
>clearly one os us is ;-)
>
>> But it is a poor algorithm indeed that
>> can't process a FIR filter with 1 clock per tap!
>
>hm...
>
>> I have not taken the time
>> to look at the code since I am packing up now to
>> travel closer to Isabel's path.
>
>I though the stream of people is flowing the other
>way?
>
>> I am thinking there is a disconnect in how
>> you are counting your
>> clock cycles. Are your figures for just one data
>> sample input?
>
>? one data sample input? hm... I'm not sure what you
>mean by that... my figure was for "one data sample
>output" - meaning - the input parameters are: dly
>(delay line of input samples), h (vector of
>coefitients), nh (number of coeficients)... and as a
>result I get one output sample (from calculating
>dotprod as you wish)
>
>> The MAC on every DSP can process one tap in one
>> clock cycle.
>
>hm... that's what I remember from my DSP classes and
>labs where we were using C5x Starter Kits...
>
>> The C6x
>> processors should be able to fetch two operands,
>> perform a multiply and add
>> into the accumulator on each clock cycle.
>
>hm... well - from what I understand so far C6711 has
>this LDW instruction which can fetch a float number
>but it takes 5 cycles to do so... then the MPYSP takes
>4 cycles to multiply two float operands and then the
>ADDSP takes 4 cycles to accumulate the result...
>
>but then... I'm not sure myself whether the ADDSP can
>directly follow MPYSP - from what I figured so far
>from single-stepping some sample code - it cannot
>because only after 3 delay slots the results are ready
>from MPYSP... but then - I also took a look at the
>assembly code published by TI for block FIR filter
>where the "main loop" includes 4 cycles with many
>parallel instructions in each - I didn't really
>analyze all that though...
>
>well - as I said - I'm not an expert here in assembly
>coding - when J G gave referenced me to that
>Integrated-DSP code I analyzed that and since it's
>much more straightforward I stack with it and
>optimized it by almost 50% - I thought it's OK and I
>should move on to some other stuff I have on my to do
>list... but now you're telling me it sucks? ;-)
>
>> This would give you some fixed
>> number of startup clock cycles plus N cycles where N
>> is the number of taps.
>
>well... as I said - I would expect that on C5x, but
>I'm affraid you're wrong here about C67x and
>floating-point operations... or is it really me who is
>missing something?
>
>> Instead of working with the clearly inefficient fira
>> code, perhaps you
>> should start with some other optimized code to see
>> how to do it right. The
>> entire filter loop should be a single parallel
>> operation. Of course you
>> will have a bit of messy setup and exit code, but
>> the core loop should be
>> one set of parallel ops.
>
>well... I'm really not buying it for now - have you
>actually done some assembly coding on C6x?
>
>> Am I out to lunch on this? If you can't get to k +
>> N cycles, it might be
>> faster running on an MCU!
>
>well... I'm not sure if you're "out to lunch" or not -
>for me you are - but maybe someone else could be the
>judge?
>
>Wojciech Rewers (still looking for a job)
>
>__________________________________ >_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com >">http://docs.yahoo.com/info/terms/




> Even though it may take more than one clock to
complete each of these steps, a new operation can be
started on each clock cycle. You don't have to wait
for one multiply to finish before you start the next.
This is the same on the C6x and the C5x and all other
high speed DSPs, they are all pipelined, which means
it takes more than one clock to complete an operation,
but you can start a new one on every clock.

well... all that you wrote was also my general
knowledge... sure I've heard what the pipeline is -
but never actually played with it in details... and
now I tried... unfortunately - it's not as you are
describing it... I wrote the code:

.global _MAC
.text
_MAC:
MPYSP.m1 A4, A6, A5
NOP 3
ADDSP.l1 A5, B4, A7
NOP 2
B .s2 b3
MV .l1 a7,a4
NOP 4

and I'm calling it like this:

float MAC(float, float, float);
a = 2; b = 45; c = 3;

d = MAC(a, b, c);

and it only gives the correct result when there are at
least 3 NOPs after MPYSP instruction (because MPYSP
has 3 delay slots)... so it's not as you said that you
can start a new instruction before waiting for the
previous to finish - in this example MPYSP has 3 delay
slots and only after those 3 NOPS the result is ready
for the ADDSP instruction... then I thought - maybe
it's because those commands are not executed by the
same units? so I changed the original m1x to m1 (to be
at least on the same side without any crosspaths) -
but then - the MPYSP instruction can only operate on
either M1 or M2 units and the ADDSP can only operate
on either L1 or L2 unites - so in the end - I really
don't know how ADDSP can directly follow MPYSP (unless
they are operating on exclusive sets of registers)...
or am I missing something here? anyone? please...

> The TI code for a dot product should be a good place
to start, assuming that they achieve the k + N timing
value. I would be surprised if they don't. This code
only needs to be modified for circular buffering to be
what you need.

right... well - I was really surprised because I
started with the integratedDSP's code and there were
comments in that code as though it was based on TI's
dotprod function - well - it certainly is not! unless
they meant some other function than DSPF_sp_dotprod()
function from C67x DSPLIB... I briefly examined this
one and it's really "complex" with lots of parallelism
- nothing even slightly similar to IntegratedDSP's
code...

> my comment about processing one input sample meant
one *new* input sample. The idea is that you want a
new output sample every time you get a new input
sample, rather than waiting until you have a new block
of input samples to yield a new block of output
samples. Unless you are decimating or interpolating,
input and output size will be the same.

right - we're synchronized on this one... as I said in
the topic - I need a "sample-by-sample" (as the
opposite of block FIR) algorithm...

now about the profiler:

> Which way are the results off?

well - to begin with - the profiler is giving absurd
results - hundreds of thousands of cycles in the
"average" collumn but then 0 cycles in minimum and
maximum collumns - I really don't know how to
interpret those... but as I said - I have never really
worked with the profiler...

the values I published before were from the observing
the "clock count" while running the code from
breakpoint to breakpoint - there are those options of:
enable clock/view clock in the profiler menu... so -
that's what I used...

> If the profiler is giving very high numbers it is
likely because your code is in external memory and it
takes
multiple clock cycles to load each instruction, and
multiple instructions are needed for each execution
cycle. Internal memory runs in one clock cycle and
works with a 256 bit wide bus to allow maximum
execution speeds.

right... well - I've heard all that before... I mean -
those "general theoretical explanations" - but then -
getting down with bits - I still don't understand the
whole process... right now I don't have the time to
investigate that, but I will when I have some free
time...

> Cache is ok if you are running large code which is
not optimized, but if your code fits the 90/10 rule
(90% of your execution time is spent in 10% of your
code) an internal memory model will work better. Also
the fetches of data can slow the execution if they are
from external memory. Instructions will be cached
after the first time through the
loop, but the data will always be slow. So put the
data in internal memory.

right... well - I will do some reading on that as soon
as I find some more time...

> If the profiler is giving low numbers, you may be
seeing optimization by the software tools. It has
been a long time since I have coded in assembly for
this chip, but I seem to recall that there were two
forms; one for ease of coding where the tool would do
as much parallelism as it could figure out, and one
where you had to do all the work, but the tool would
not change anything you coded. Which are you working
with?

well - you lost me here again - I'm writing assembly -
not linear assembly which is "optimized or not" - I
believe what I write is the exact instruction that the
device is executing - and that's what I see in the
debugger while single-stepping...

anyway - the posts got long ;-) but hey - everyone -
if anybody sees something incorect in what I wrote -
please do let me know...

so - how is Isabelle treating you guys? ;-)

Wojciech Rewers

__________________________________



Let's start with the important stuff, Isabel... I headed to my lake house
in VA since that was right on the storm track. Turns out the winds were
not all that bad and I had no damage, but the power went off Thursday night
and had not come back on Sunday, so I bailed. I am back in Frederick
where I can use my computer again. Seems a lot of power lines in VA have
tall trees right next to them and when one topples, it can take down two or
three poles and sections of line. It takes a long time to fix all that. At 04:48 PM 9/18/2003, you wrote:
>well... all that you wrote was also my general
>knowledge... sure I've heard what the pipeline is -
>but never actually played with it in details... and
>now I tried... unfortunately - it's not as you are
>describing it... I wrote the code:
>
> .global _MAC
> .text
>_MAC:
> MPYSP.m1 A4, A6, A5
> NOP 3
> ADDSP.l1 A5, B4, A7
> NOP 2
> B .s2 b3
> MV .l1 a7,a4
> NOP 4
>
>and I'm calling it like this:
>
>float MAC(float, float, float);
>a = 2; b = 45; c = 3;
>
>d = MAC(a, b, c);
>
>and it only gives the correct result when there are at
>least 3 NOPs after MPYSP instruction (because MPYSP
>has 3 delay slots)... so it's not as you said that you
>can start a new instruction before waiting for the
>previous to finish - in this example MPYSP has 3 delay
>slots and only after those 3 NOPS the result is ready
>for the ADDSP instruction... then I thought - maybe
>it's because those commands are not executed by the
>same units? so I changed the original m1x to m1 (to be
>at least on the same side without any crosspaths) -
>but then - the MPYSP instruction can only operate on
>either M1 or M2 units and the ADDSP can only operate
>on either L1 or L2 unites - so in the end - I really
>don't know how ADDSP can directly follow MPYSP (unless
>they are operating on exclusive sets of registers)...
>or am I missing something here? anyone? please...

That is because the result from one operation is not out for several
clocks. Only the core of the loop can do all the ops in one clock
cycle. You have to lead into it with code that starts each part at the
right time. This is sort of like singing a "round", like "row, row, row
your boat" if you know what that is.

I took a look at the code Andrew Elder posted and I think I understand what
they are doing. They use a "prolog" to start the fetches from
memory. This must take 5 clocks since they start 5 of them before they
reach the loop. In the loop, they start the multiply and also the
add. The add should not be started until the mpy results are ready, but it
works out ok since the mpy result registers are zeroed in the
beginning. So the extra adds in the loop do nothing. The epilog runs more
adds to handle the outputs of the multiplies and then adds the two sums (I
think they were loading double words so they could keep two MACs busy, I
think this only works from internal memory). The final adds account for
the fact that since the pipeline in the mpy and add are 4 clocks, there are
really four MACs going on in each register, so they have to be summed to
get the final result. The NOPs are for timing.

As you can see, it takes a lot to write code like this. It is much better
to use library code if possible. So this code will do what you want if you
can change the addressing for a circular buffer. Can you figure out how to
do that? > > The TI code for a dot product should be a good place
>to start, assuming that they achieve the k + N timing
>value. I would be surprised if they don't. This code
>only needs to be modified for circular buffering to be
>what you need.
>
>right... well - I was really surprised because I
>started with the integratedDSP's code and there were
>comments in that code as though it was based on TI's
>dotprod function - well - it certainly is not! unless
>they meant some other function than DSPF_sp_dotprod()
>function from C67x DSPLIB... I briefly examined this
>one and it's really "complex" with lots of parallelism
>- nothing even slightly similar to IntegratedDSP's
>code...

The parallelism is how you get a MAC (or two) on every clock! Other DSPs
hide a lot of the details from you and have shorter pipelines. If you want
to write optimized assembly for the C6x, you will need to master this.

I can't help you a lot on the profiler since I have never used it. Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX



--- Arius - Rick Collins <> wrote:
> I took a look at the code Andrew Elder posted and I
> think I understand what they are doing.

yeah... I actually realized I should digest that code
even before Andrew advised that... so I did... at
first I couldn't understand any of it, but after a
while I realized what's going on ;-) god it's simple!
and very neat too...

> I think this only works from internal memory).

and why would that be?

> The final adds account for
> the fact that since the pipeline in the mpy and add
> are 4 clocks, there are
> really four MACs going on in each register, so they
> have to be summed to
> get the final result.

yeah... understanding most of that algorithm was easy
- the ADDSPs was the most difficult part as I couldn't
figure out how you can accumulate into just one
register if the ADDSP takes 4 cycles... in fact - even
though only one register is used - still there are
founr accumulation going on - each for every fourth
cycle - and the results are kept somewhere in the
pipeline I guess - so then those four sums for each
register have to be added together... the spru198
explains the whole process quite well... so - you know
how it is - if all possibilities fail - read the
manual ;-)

> As you can see, it takes a lot to write code like
> this.

well - again I was proved, there are no difficult
things in life - only new ones... at first when I
looked at that dotp function - it was all black magic
to me - now when I look at it all so simple and neat -
I think that functions shows excellently the beauty
and power of C6x devices...

> It is much better to use library code if possible.

well - nobody likes to reinvent the wheel...
especially when the deadline is close...

> So this code will do what you want if you
> can change the addressing for a circular buffer.
> Can you figure out how to do that?

and what do you think I was doing when you were making
out with Isabelle? ;-)

I had to give up half the speed though... since I need
"sample-by-sample" algotithm - I can't have any
constraints on the location of samples in the memory -
and using LDDW imposes on you that the samples are
aligned on odd-even addresses or something like
that...

so what I did is that I changed LDDW into LDW, skipped
the parallel MPYSPs and ADDSPs plus I use circular
buffer for data... if anyone is interested - the code
is here:

http://www.wrewers.karolin.pl/firc.asm

maybe somebody can even find a bug or something -
please do let me know...

well - I started from thousands of cycles first to
filter the data (using library code) then to shift all
the samples in the buffer by one - now I have a code
that is performing "the same" in around 500 cycles -
it's enough for me as for now - but the interesting
aspect is that the same thing can be done for so many
different ways ;-) the beauty of DSPs I guess...

anyway - I have that FIR now - now I will only need a
LMS to adapt FIR's coeficients - but I think I'll
figure something out...

Wojciech Rewers (still looking for a job)

__________________________________


One final suggestion and I will let you get on with it. You can optimize
the code to use the LDDW if you copy your input data into *two* buffers,
one odd aligned and one even aligned. Alternate between them on each
sample and the LDDW should be in the right place each time.
At 03:15 AM 9/22/2003, you wrote:
>--- Arius - Rick Collins <> wrote:
> > I took a look at the code Andrew Elder posted and I
> > think I understand what they are doing.
>
>yeah... I actually realized I should digest that code
>even before Andrew advised that... so I did... at
>first I couldn't understand any of it, but after a
>while I realized what's going on ;-) god it's simple!
>and very neat too...
>
> > I think this only works from internal memory).
>
>and why would that be?
>
> > The final adds account for
> > the fact that since the pipeline in the mpy and add
> > are 4 clocks, there are
> > really four MACs going on in each register, so they
> > have to be summed to
> > get the final result.
>
>yeah... understanding most of that algorithm was easy
>- the ADDSPs was the most difficult part as I couldn't
>figure out how you can accumulate into just one
>register if the ADDSP takes 4 cycles... in fact - even
>though only one register is used - still there are
>founr accumulation going on - each for every fourth
>cycle - and the results are kept somewhere in the
>pipeline I guess - so then those four sums for each
>register have to be added together... the spru198
>explains the whole process quite well... so - you know
>how it is - if all possibilities fail - read the
>manual ;-)
>
> > As you can see, it takes a lot to write code like
> > this.
>
>well - again I was proved, there are no difficult
>things in life - only new ones... at first when I
>looked at that dotp function - it was all black magic
>to me - now when I look at it all so simple and neat -
>I think that functions shows excellently the beauty
>and power of C6x devices...
>
> > It is much better to use library code if possible.
>
>well - nobody likes to reinvent the wheel...
>especially when the deadline is close...
>
> > So this code will do what you want if you
> > can change the addressing for a circular buffer.
> > Can you figure out how to do that?
>
>and what do you think I was doing when you were making
>out with Isabelle? ;-)
>
>I had to give up half the speed though... since I need
>"sample-by-sample" algotithm - I can't have any
>constraints on the location of samples in the memory -
>and using LDDW imposes on you that the samples are
>aligned on odd-even addresses or something like
>that...
>
>so what I did is that I changed LDDW into LDW, skipped
>the parallel MPYSPs and ADDSPs plus I use circular
>buffer for data... if anyone is interested - the code
>is here:
>
>http://www.wrewers.karolin.pl/firc.asm
>
>maybe somebody can even find a bug or something -
>please do let me know...
>
>well - I started from thousands of cycles first to
>filter the data (using library code) then to shift all
>the samples in the buffer by one - now I have a code
>that is performing "the same" in around 500 cycles -
>it's enough for me as for now - but the interesting
>aspect is that the same thing can be done for so many
>different ways ;-) the beauty of DSPs I guess...
>
>anyway - I have that FIR now - now I will only need a
>LMS to adapt FIR's coeficients - but I think I'll
>figure something out...
>
>Wojciech Rewers (still looking for a job)
>
>__________________________________
>
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX