Reply by Keith E. Larson July 12, 20022002-07-12
Hello Andreas

Your first question 'how fast is memory filling' will depend on a few
things. The answer can be found in the C3x Users Guide (device guide), but
as you are in the learning, here are some answers that you can compare to as
you go along.

External memory writes take a minimum of 2 cycles each (0 ws) and cannot be
parallelized since there is only one bus! Look at the timing diagrams in
the UG or data sheet. You may want to ask why reads are 1 cycle and writes
are 2 cycles. Hint: Writes to the bus are 'posted' which allows the CPU to
continue running internaly and as long as there are no conflicts, no speed
penalty. A 2 cycle write also makes for an easy /CE controlled write interface.

Internal writes (reads or write/read) can be parallelized achieving 2 writes
per cycle, or 4x that of external 0-ws memory. I try to point this out as
much as I can since this can be huge advantage compared to external memory!

If your loop counter is global, static or not fully optimized as a repeat
block value it will likely be kept in memory and updated on each loop (IE slow)

The compiler often needs some help in setting up parallel writes. In
particular, consider how the code you have shown will be looked at by the
compiler. Basicaly you did not provide many hints. You need to remember
that the compiler is trying to minimize resources as well as cycles. In
addition, some things are simply not that easy to teach the compiler. For
example.

- Parallel writes only come in pairs. If the repeat count is not even
you cant expect the compiler to perform parallel write.

Instead, consider the following which does achieve parallel write code
generation (using -o3 -mr and -ou optimizations). Excluding setup, and
assuming the destination(s) is on chip and there are no other code or dma
conflicts, this will take 50 cycles. However, if these conditions are not
met, your milage could vary considerably!

Program Data My code Your Code
internal internal 50 100 Different memory spaces
internal internal 50 100 code/data same memory spaces,
RPTS
internal internal 100 200 code/data same memory spaces,
RPTB
external(0ws) internal 50 100
internal external(0ws) 200 200
external(0ws) external(0ws) 200 200 RPTS loop
external(0ws) external(0ws) 300 300 RPTB loop (must also fetch code)

main()
{
int *p1, *p2;
int x;
p1 = (int *) 0x809800 ;
p2 = (int *) 0x809800+50;
for(x=0;x<100/2;x++)
{
*p1++=0;
*p2++=0;
}
}

Code Size
---------
Generate a map file (an output from the linker) and open it as a text
document in CC. You may be surprised how small many programs can be.
Again, just like the memory fill example, knowing how to minimize things can
help considerably. Its not that hard to do once you have completed a few
simple programs.

long doubles
------------
'float' and 'double' are both treated by the compiler as 32 bit floats.
Internaly the c3x registers are 40 bits with the bottom 8 extended precision
bits not normaly being saved. Additionaly, 32 bit inputs from memory are of
only 32 bit precision.

When 'long double' is selected a two word structure is created which fully
saves the 40 bit registers. This option is great for higher precision needs
such as when computing coefficients, but should be avoided for high speed
DSP work. If you feel that you absolutely need the precision, it often
turns out that there are ways to get that precision from rewriting your code.

Taking advantage of floating point
----------------------------------
My best example of how to take advantage of floating point is using
differential compression on audio, video and other data (TI and I have a
patent on this). What I found out is that floating point is inherently the
same as ADPCM when constrained to log base 2, but with the additional
advantage of having a huge dynamic range (fixed point does not have an
exponent). A simple differential (first derivitive) at the front and and an
integration at the back end is all that is needed and you can process the
data even when its 'compressed' since the data is merely a pre-filtered
floating point data stream.

The PAR_EQ.C example that now ships with the DSK software shows just how
effective this can be. It is set up to LDC compress the incoming audio data,
chop off most of the mantissa bits, and then pass that data stream through
10 stages of IIR filtering. Normaly this would make almost any data anomoly
surface with ease, but most people cant hear the effect even when there are
NO mantissa bits so imagine what happens when you dont chop them all out.

Float/Int/Long Double Connection in hardware
--------
Have a look at how the floating point extended precision registers are set
up in the register file. You should quickly notice that these 40 bit
registers are accessed as integers using the lower 32 bits and as floats
using the upper 32 bits. The 'fix' and 'float' assembler codes are then
used to convert from one format to another.

Hope this helps
Best regards,
Keith Larson
=====================
At 10:13 PM 7/11/02 -0000, you wrote:
Hi,

I am using the C31 DSK and the Code Composer and as I am quite new in the
DSP field, I have "newbie" questions.

How can I calculate the execution speed of the C31? What would be the
execution time of the following part of code:

for (n=0; n <= (100); n++)
{
S[n]=0;
}

Is there a way at all to predict the execution time? In which TI document
can I find more about this topic?

As the DSK has only 2K RAM I d like to know how much my C code uses. Is
there a possibility to check this within CodeComposer?

If I define an array: long double a[50]; how much memory will be reserved
for it?

Where is the connection between the float/double/long double and the
short/single-precision/extended precision floating point format mentioned in
the TMS320C3x Users guide?

regards,

Andreas
+-----------+
|Keith Larson |
|Member Group Technical Staff |
|Texas Instruments Incorporated |
| |
| 281-274-3288 |
| |
| www.micro.ti.com/~klarson |
|-----------+
| TMS320C3x/C4x/VC33 Applications |
| |
| TMS320VC33 |
| The lowest cost and lowest power 500 uw/mflop |
| floating point DSP on the planet! |
+-----------+


Reply by andreas2002now July 11, 20022002-07-11
Hi,

I am using the C31 DSK and the Code Composer and as I am quite
new in the DSP field, I have "newbie" questions.

How can I calculate the execution speed of the C31?
What would be the execution time of the following part of code:
for (n=0; n <= (100); n++)
{
S[n]=0;
}
Is there a way at all to predict the execution time?
In whitch TI document can I find more about this topic?

As the DSK has only 2K RAM I d like to know how much my C code uses.
Is there a possibility to check this within CodeComposer?
If I define an array: long double a[50]; how much memory will be
reserved for it?
Where is the connection between the float/double/long double
and the short/single-precision/extended precision floating point
format mentioned in the TMS320C3x Users guide?

regards,

Andreas