Reply by Jerry Avins July 31, 20042004-07-31
Jaime Andres Aranguren Cardona wrote:

   ...

> Jerry: did you maybe mean that there are efficient data structures to > implement circular buffers in software if the compiler doesn't support > it directly, inferring from the underlying hardware?
Yes. There are interesting ways to use the same data laid down nore than once, end to end, so that each pass is continuous ans checking for wrap-around on each access isn't needed. Think about what has to be done when the stride is greater than one! Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Reply by Jaime Andres Aranguren Cardona July 31, 20042004-07-31
Jerry Avins <jya@ieee.org> wrote in message ...
> The task being > programmed can best be done with a circular buffer. There are efficient > data structures to implement it in software if the hardware doesn't > support it directly.
I'd rather say that what happens more often is that the software (C code + compiler) doesn't support the circular buffers, but the hardware DOES support it. Does't the 64x have circular buffers? Jerry: did you maybe mean that there are efficient data structures to implement circular buffers in software if the compiler doesn't support it directly, inferring from the underlying hardware? JaaC
> > Jerry
Reply by Jaime Andres Aranguren Cardona July 31, 20042004-07-31
> > * How expensive are floats on that processor? Doubles? Some processors > are significantly faster with ordinary floats.
Unfortunately the 6416 is a fixed point DSP (16 bit). However, I wouldn't expect that to be the only reason for incrementting the pCU usage that much! The 64x family is by far the most capable in terms of computational power, among TI's DSPs, and surpassing many other DSPs from other vendors.
> > * You calculate norm, then divide by it. Multiplication is usually much > faster. > > * Shifting FIR[j]=FIR[j-1] is way expensive -- make this a circular buffer.
Sure it is! Do your best to make it circular. Unfortunately not always 8alsmot never!) the C compilers understand the DSPs have circular buffers.
> > * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. > > Ideally you'd take this whole section of code and do it by hand in > assembly. I don't know if you have the time, but fercrissakes it's a > DSP chip! C compilers don't understand these things, and if it was easy > there wouldn't be any mystique!
JaaC
Reply by Jay Mullen July 30, 20042004-07-30
marlo_ti@yahoo.com (Marlo Flores) wrote in
news:624176e6.0407280228.11a43cac@posting.google.com: 

> I made the same project two years ago with the C5402 and my code was > written completely in C. It worked quite well. The solution was > optimizing the compiler to Level 3. > > Tim Wescott <tim@wescottnospamdesign.com> wrote in message > news:<10g60psgpnghm37@corp.supernews.com>... >> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen >> Code Composter recognize this. Check your assembly, if you're not >> getting a MAC instruction here then _this_ is the code to hand-do in >> assembly. >>
Can you please elaborate a little more on optimizing the compiler to level 3?
Reply by Jerry Avins July 29, 20042004-07-29
Robert Sherry wrote:
> Jay, > > Here are some ideas on how you might get better code performance: > > 1) Consider the for loop: > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > I suspect that this for loop can be replaced with a call to memcpy. > The memcpy supplied with your compiler should be written in assembly > language and therefore > it takes full advantage of any special instructions provided by the > hardware. I am thinking of a zero overhead repeat loop.
No matter how efficiently you might move 16 data elements, readjusting two pointers (which can be in registers) will be faster. The task being programmed can best be done with a circular buffer. There are efficient data structures to implement it in software if the hardware doesn't support it directly. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Reply by Michael S July 29, 20042004-07-29
Do you realize that C64 (or any fix or floating point DSP for that
matter) is pretty poor platform for double-precision floating point
calculations? As a rule of thumb it's 100 to 1000 times slower that
your average PC.

Consider conversion to more natural data types, preferably 16bit
fixpoint values for samples and coefficients and 32 fixpoint for
accumulators.
Transversal LMS structure that you currently implemented probably
doesn't have sufficient numerical stability for the fix-point
implementation. You would have to change your filter to more stable
structure, preferably lattice/ladder.
Reply by Robert Sherry July 29, 20042004-07-29
Jay,

    Here are some ideas on how you might get better code performance:

1) Consider the for loop:
 for (j=15; j>0; j--)
    {
        FIR1[j]=FIR1[j-1];
     }
I suspect that this for loop can be replaced with a call to memcpy.
The memcpy supplied with your compiler should be written in assembly
language and therefore
it takes full advantage of any special instructions provided by the
hardware. I am thinking of a zero overhead repeat loop.

2) You have these to for loops:
for (j =0; j<16; j++)
            {
            Z+= FIR1[j]*weight_array[j];
            }
            outp= RCnorm - Z;


             for (j=0; j<16; j++)
             {
             weight_array[j] += 2*0.01*outp*FIR1[j];
             }

I suspect that you can turn these two for loops into one loop. Not
sure how much this is going to help. You can also consider using
pointer arithmetic rather than indexing. On some machines this can be
a big win. Some compilers will do this for you automatically but many
do not. You can also consider turning each of the above for loops into
16 assignment statements. I am not sure it is worth the extra code
space.

3) If a variable is heavily used in a program, a compiler should
allocate it to a fast register. However, some compilers fail to
allocate the right variables to fast registers. As a result, your
program runs slowly. By using the keyword register, you can suggest to
the compiler that this variable should be allocated to a fast
register.

4) You wrote:
        double norm = pow (2,15);
    If the above statement is inside a loop, then I would compute 2^15
at compile time. By the way the statement:
            norm = 1.0;
might run faster then the statement:
            norm = 1;
because some compilers will do the conversion from int to double at
compile time. Most are not that bad.

I hope this helps.

                        Bob Sherry


"Jay" <cdragon@cogeco.ca> wrote in message
news:Xns9530CBE8594D2cdragoncogecoca@216.221.81.119...
> We are fourth year electrical engineering students involved in a
Final
> Design Project course. (Approaching deadline date) We are using a > TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive
noise
> cancellation. Unfortunately we have run into some serious
unexpected
> CPU Usage problems. We think that our C code is relatively simple. > > Our code consists of a simple LMS algorithm, which we have
implemented
> using Reference Framework 3 (provided by TI eXpressDSP tutorial). > > The code of our noise cancellation algorithm is as follows: > > //Variable Declaration (Global) > > static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > > static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double weight_array2 [16] =
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
> static double tempweights [5000][16]; > > Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; > Int size; /* in samples */ > Int chan; > Int i,j; > double RCnorm, LCnorm; > Int icount = 0; > > > double outp; > double norm = pow (2,15); > double Z = 0.0; > double peakpower = 0.0; > > > //Assign variables to input buffers > srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); > srcLeft = (Sample *)PIP_getReaderAddr(
thrAudioproc[chan].pipIn2);
> /* get the size in samples (the function below returns it in
words)
> */ > size = sizeInSamples( PIP_getReaderSize(
thrAudioproc[chan].pipIn )
> ); > > /* get the empty buffer from the out-pipe */ > PIP_alloc( thrAudioproc[chan].pipOut ); > PIP_alloc( thrAudioproc[chan].pipOut2 ); > //Declare output buffers > dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); > dst2 = (Sample *)PIP_getWriterAddr(
thrAudioproc[chan].pipOut2 );
> > > > > // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM
STARTS
> HERE ************************************* > > for ( i= 0; i < FRAMELEN; i++) > { > > > RCnorm = srcRight[i]/norm; > LCnorm =srcLeft[i]/norm; > > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z; > > > for (j=0; j<16; j++) > { > weight_array[j] += 2*0.01*outp*FIR1[j]; > } > > > ttemp=outp*norm; > > //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /*
real
> stereo/N-ch. > //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); > dst[i] = ttemp; > dst2[i] = ttemp; > } > > > If anyone has any ideas as to whether the format of our code could
be
> contributing to a high DSP CPU Usage (92.8%!!) it would be
appreciated
> if they could post some suggestions. Some of our ideas thus far > include: > > - inefficient variable declarations/definitions > - inefficient code structure > - 'for loop' problems???? > > > Any help would be greatly appreciated. Thank you very much.
Reply by Marlo Flores July 28, 20042004-07-28
I made the same project two years ago with the C5402 and my code was
written completely in C. It worked quite well. The solution was
optimizing the compiler to Level 3.

Tim Wescott <tim@wescottnospamdesign.com> wrote in message news:<10g60psgpnghm37@corp.supernews.com>...
> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. >
Reply by Jay Mullen July 25, 20042004-07-25

Thanks for such quick replies everyone.  I'll discuss this problem further 
with our group with the suggestions I've received so far.

I'll keep you posted if anything happens

Reply by Andor July 25, 20042004-07-25
Jay wrote:
...
> for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z;
Try to use the FIR routine that comes with the manufacturer's library for this processor. DSP's are made to compute FIR filters very efficiently and the library is guaranteed to have the most efficient implementation. If your own code works as expected but is too slow, then keep your own code to show your supervisor that you have understood the principle of FIR, but use the library code for further work. Regards, Andor