DSPRelated.com
Forums

TMS320C6416 CPU USAGE PROBLEM (Help needed ASAP!!)

Started by Jay July 24, 2004
We are fourth year electrical engineering students involved in a Final 
Design Project course. (Approaching deadline date) We are using a 
TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive noise 
cancellation.  Unfortunately we have run into some serious unexpected 
CPU Usage problems. We think that our C code is relatively simple.

Our code consists of a simple LMS algorithm, which we have implemented 
using Reference Framework 3 (provided by TI eXpressDSP tutorial).  

The code of our noise cancellation algorithm is as follows:

//Variable Declaration (Global)

static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
static double weight_array2 [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
static double tempweights [5000][16];

Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2;
    Int     size;       /* in samples */
    Int     chan;
    Int i,j;
    double RCnorm, LCnorm;
	Int icount = 0;


	double outp;
    double norm = pow (2,15);
    double Z = 0.0;
    double peakpower = 0.0;
    

//Assign variables to input buffers
srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn );  
    srcLeft = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn2);
    /* get the size in samples (the function below returns it in words) 
*/
    size = sizeInSamples( PIP_getReaderSize( thrAudioproc[chan].pipIn ) 
);
 
    /* get the empty buffer from the out-pipe */
    PIP_alloc( thrAudioproc[chan].pipOut );
    PIP_alloc( thrAudioproc[chan].pipOut2 );
//Declare output buffers 
    dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut );
    dst2 = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut2 );

       
        
        
        // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM STARTS 
HERE *************************************
        
        for ( i= 0; i < FRAMELEN; i++) 
        {    
           
           
           RCnorm = srcRight[i]/norm;
 	     LCnorm =srcLeft[i]/norm; 
           
           for (j=15; j>0; j--)
           {
           FIR1[j]=FIR1[j-1];
           } 
            
           FIR1[0] = LCnorm;
           
           Z=0.0;

	for (j =0; j<16; j++)
           {
           Z+= FIR1[j]*weight_array[j];
           }
           outp= RCnorm - Z;
           
            
            for (j=0; j<16; j++)
            {
            weight_array[j] += 2*0.01*outp*FIR1[j];
            }
                                    
            
            ttemp=outp*norm;
            
            //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /* real 
stereo/N-ch. 
            //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]);
            dst[i] = ttemp;
            dst2[i] = ttemp;
        }
    

If anyone has any ideas as to whether the format of our code could be 
contributing to a high DSP CPU Usage (92.8%!!) it would be appreciated 
if they could post some suggestions.  Some of our ideas thus far 
include:

-	inefficient variable declarations/definitions
-	inefficient code structure
-	'for loop' problems????


Any help would be greatly appreciated.  Thank you very much.
Jay wrote:

> We are fourth year electrical engineering students involved in a Final > Design Project course. (Approaching deadline date) We are using a > TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive noise > cancellation. Unfortunately we have run into some serious unexpected > CPU Usage problems. We think that our C code is relatively simple. > > Our code consists of a simple LMS algorithm, which we have implemented > using Reference Framework 3 (provided by TI eXpressDSP tutorial). > > The code of our noise cancellation algorithm is as follows: > > //Variable Declaration (Global) > > static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > > static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double weight_array2 [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double tempweights [5000][16]; > > Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; > Int size; /* in samples */ > Int chan; > Int i,j; > double RCnorm, LCnorm; > Int icount = 0; > > > double outp; > double norm = pow (2,15); > double Z = 0.0; > double peakpower = 0.0; > > > //Assign variables to input buffers > srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); > srcLeft = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn2); > /* get the size in samples (the function below returns it in words) > */ > size = sizeInSamples( PIP_getReaderSize( thrAudioproc[chan].pipIn ) > ); > > /* get the empty buffer from the out-pipe */ > PIP_alloc( thrAudioproc[chan].pipOut ); > PIP_alloc( thrAudioproc[chan].pipOut2 ); > //Declare output buffers > dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); > dst2 = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut2 ); > > > > > // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM STARTS > HERE ************************************* > > for ( i= 0; i < FRAMELEN; i++) > { > > > RCnorm = srcRight[i]/norm; > LCnorm =srcLeft[i]/norm; > > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z; > > > for (j=0; j<16; j++) > { > weight_array[j] += 2*0.01*outp*FIR1[j]; > } > > > ttemp=outp*norm; > > //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /* real > stereo/N-ch. > //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); > dst[i] = ttemp; > dst2[i] = ttemp; > } > > > If anyone has any ideas as to whether the format of our code could be > contributing to a high DSP CPU Usage (92.8%!!) it would be appreciated > if they could post some suggestions. Some of our ideas thus far > include: > > - inefficient variable declarations/definitions > - inefficient code structure > - 'for loop' problems???? > > > Any help would be greatly appreciated. Thank you very much.
* How expensive are floats on that processor? Doubles? Some processors are significantly faster with ordinary floats. * You calculate norm, then divide by it. Multiplication is usually much faster. * Shifting FIR[j]=FIR[j-1] is way expensive -- make this a circular buffer. * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code Composter recognize this. Check your assembly, if you're not getting a MAC instruction here then _this_ is the code to hand-do in assembly. Ideally you'd take this whole section of code and do it by hand in assembly. I don't know if you have the time, but fercrissakes it's a DSP chip! C compilers don't understand these things, and if it was easy there wouldn't be any mystique! -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Jay wrote:
...
> for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z;
Try to use the FIR routine that comes with the manufacturer's library for this processor. DSP's are made to compute FIR filters very efficiently and the library is guaranteed to have the most efficient implementation. If your own code works as expected but is too slow, then keep your own code to show your supervisor that you have understood the principle of FIR, but use the library code for further work. Regards, Andor

Thanks for such quick replies everyone.  I'll discuss this problem further 
with our group with the suggestions I've received so far.

I'll keep you posted if anything happens

I made the same project two years ago with the C5402 and my code was
written completely in C. It worked quite well. The solution was
optimizing the compiler to Level 3.

Tim Wescott <tim@wescottnospamdesign.com> wrote in message news:<10g60psgpnghm37@corp.supernews.com>...
> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. >
Jay,

    Here are some ideas on how you might get better code performance:

1) Consider the for loop:
 for (j=15; j>0; j--)
    {
        FIR1[j]=FIR1[j-1];
     }
I suspect that this for loop can be replaced with a call to memcpy.
The memcpy supplied with your compiler should be written in assembly
language and therefore
it takes full advantage of any special instructions provided by the
hardware. I am thinking of a zero overhead repeat loop.

2) You have these to for loops:
for (j =0; j<16; j++)
            {
            Z+= FIR1[j]*weight_array[j];
            }
            outp= RCnorm - Z;


             for (j=0; j<16; j++)
             {
             weight_array[j] += 2*0.01*outp*FIR1[j];
             }

I suspect that you can turn these two for loops into one loop. Not
sure how much this is going to help. You can also consider using
pointer arithmetic rather than indexing. On some machines this can be
a big win. Some compilers will do this for you automatically but many
do not. You can also consider turning each of the above for loops into
16 assignment statements. I am not sure it is worth the extra code
space.

3) If a variable is heavily used in a program, a compiler should
allocate it to a fast register. However, some compilers fail to
allocate the right variables to fast registers. As a result, your
program runs slowly. By using the keyword register, you can suggest to
the compiler that this variable should be allocated to a fast
register.

4) You wrote:
        double norm = pow (2,15);
    If the above statement is inside a loop, then I would compute 2^15
at compile time. By the way the statement:
            norm = 1.0;
might run faster then the statement:
            norm = 1;
because some compilers will do the conversion from int to double at
compile time. Most are not that bad.

I hope this helps.

                        Bob Sherry


"Jay" <cdragon@cogeco.ca> wrote in message
news:Xns9530CBE8594D2cdragoncogecoca@216.221.81.119...
> We are fourth year electrical engineering students involved in a
Final
> Design Project course. (Approaching deadline date) We are using a > TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive
noise
> cancellation. Unfortunately we have run into some serious
unexpected
> CPU Usage problems. We think that our C code is relatively simple. > > Our code consists of a simple LMS algorithm, which we have
implemented
> using Reference Framework 3 (provided by TI eXpressDSP tutorial). > > The code of our noise cancellation algorithm is as follows: > > //Variable Declaration (Global) > > static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > > static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double weight_array2 [16] =
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
> static double tempweights [5000][16]; > > Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; > Int size; /* in samples */ > Int chan; > Int i,j; > double RCnorm, LCnorm; > Int icount = 0; > > > double outp; > double norm = pow (2,15); > double Z = 0.0; > double peakpower = 0.0; > > > //Assign variables to input buffers > srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); > srcLeft = (Sample *)PIP_getReaderAddr(
thrAudioproc[chan].pipIn2);
> /* get the size in samples (the function below returns it in
words)
> */ > size = sizeInSamples( PIP_getReaderSize(
thrAudioproc[chan].pipIn )
> ); > > /* get the empty buffer from the out-pipe */ > PIP_alloc( thrAudioproc[chan].pipOut ); > PIP_alloc( thrAudioproc[chan].pipOut2 ); > //Declare output buffers > dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); > dst2 = (Sample *)PIP_getWriterAddr(
thrAudioproc[chan].pipOut2 );
> > > > > // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM
STARTS
> HERE ************************************* > > for ( i= 0; i < FRAMELEN; i++) > { > > > RCnorm = srcRight[i]/norm; > LCnorm =srcLeft[i]/norm; > > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z; > > > for (j=0; j<16; j++) > { > weight_array[j] += 2*0.01*outp*FIR1[j]; > } > > > ttemp=outp*norm; > > //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /*
real
> stereo/N-ch. > //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); > dst[i] = ttemp; > dst2[i] = ttemp; > } > > > If anyone has any ideas as to whether the format of our code could
be
> contributing to a high DSP CPU Usage (92.8%!!) it would be
appreciated
> if they could post some suggestions. Some of our ideas thus far > include: > > - inefficient variable declarations/definitions > - inefficient code structure > - 'for loop' problems???? > > > Any help would be greatly appreciated. Thank you very much.
Do you realize that C64 (or any fix or floating point DSP for that
matter) is pretty poor platform for double-precision floating point
calculations? As a rule of thumb it's 100 to 1000 times slower that
your average PC.

Consider conversion to more natural data types, preferably 16bit
fixpoint values for samples and coefficients and 32 fixpoint for
accumulators.
Transversal LMS structure that you currently implemented probably
doesn't have sufficient numerical stability for the fix-point
implementation. You would have to change your filter to more stable
structure, preferably lattice/ladder.
Robert Sherry wrote:
> Jay, > > Here are some ideas on how you might get better code performance: > > 1) Consider the for loop: > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > I suspect that this for loop can be replaced with a call to memcpy. > The memcpy supplied with your compiler should be written in assembly > language and therefore > it takes full advantage of any special instructions provided by the > hardware. I am thinking of a zero overhead repeat loop.
No matter how efficiently you might move 16 data elements, readjusting two pointers (which can be in registers) will be faster. The task being programmed can best be done with a circular buffer. There are efficient data structures to implement it in software if the hardware doesn't support it directly. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
marlo_ti@yahoo.com (Marlo Flores) wrote in
news:624176e6.0407280228.11a43cac@posting.google.com: 

> I made the same project two years ago with the C5402 and my code was > written completely in C. It worked quite well. The solution was > optimizing the compiler to Level 3. > > Tim Wescott <tim@wescottnospamdesign.com> wrote in message > news:<10g60psgpnghm37@corp.supernews.com>... >> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen >> Code Composter recognize this. Check your assembly, if you're not >> getting a MAC instruction here then _this_ is the code to hand-do in >> assembly. >>
Can you please elaborate a little more on optimizing the compiler to level 3?
> > * How expensive are floats on that processor? Doubles? Some processors > are significantly faster with ordinary floats.
Unfortunately the 6416 is a fixed point DSP (16 bit). However, I wouldn't expect that to be the only reason for incrementting the pCU usage that much! The 64x family is by far the most capable in terms of computational power, among TI's DSPs, and surpassing many other DSPs from other vendors.
> > * You calculate norm, then divide by it. Multiplication is usually much > faster. > > * Shifting FIR[j]=FIR[j-1] is way expensive -- make this a circular buffer.
Sure it is! Do your best to make it circular. Unfortunately not always 8alsmot never!) the C compilers understand the DSPs have circular buffers.
> > * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. > > Ideally you'd take this whole section of code and do it by hand in > assembly. I don't know if you have the time, but fercrissakes it's a > DSP chip! C compilers don't understand these things, and if it was easy > there wouldn't be any mystique!
JaaC