We are fourth year electrical engineering students involved in a Final Design Project course. (Approaching deadline date) We are using a TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive noise cancellation. Unfortunately we have run into some serious unexpected CPU Usage problems. We think that our C code is relatively simple. Our code consists of a simple LMS algorithm, which we have implemented using Reference Framework 3 (provided by TI eXpressDSP tutorial). The code of our noise cancellation algorithm is as follows: //Variable Declaration (Global) static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; static double weight_array2 [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; static double tempweights [5000][16]; Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; Int size; /* in samples */ Int chan; Int i,j; double RCnorm, LCnorm; Int icount = 0; double outp; double norm = pow (2,15); double Z = 0.0; double peakpower = 0.0; //Assign variables to input buffers srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); srcLeft = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn2); /* get the size in samples (the function below returns it in words) */ size = sizeInSamples( PIP_getReaderSize( thrAudioproc[chan].pipIn ) ); /* get the empty buffer from the out-pipe */ PIP_alloc( thrAudioproc[chan].pipOut ); PIP_alloc( thrAudioproc[chan].pipOut2 ); //Declare output buffers dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); dst2 = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut2 ); // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM STARTS HERE ************************************* for ( i= 0; i < FRAMELEN; i++) { RCnorm = srcRight[i]/norm; LCnorm =srcLeft[i]/norm; for (j=15; j>0; j--) { FIR1[j]=FIR1[j-1]; } FIR1[0] = LCnorm; Z=0.0; for (j =0; j<16; j++) { Z+= FIR1[j]*weight_array[j]; } outp= RCnorm - Z; for (j=0; j<16; j++) { weight_array[j] += 2*0.01*outp*FIR1[j]; } ttemp=outp*norm; //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /* real stereo/N-ch. //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); dst[i] = ttemp; dst2[i] = ttemp; } If anyone has any ideas as to whether the format of our code could be contributing to a high DSP CPU Usage (92.8%!!) it would be appreciated if they could post some suggestions. Some of our ideas thus far include: - inefficient variable declarations/definitions - inefficient code structure - 'for loop' problems???? Any help would be greatly appreciated. Thank you very much.
TMS320C6416 CPU USAGE PROBLEM (Help needed ASAP!!)
Started by ●July 24, 2004
Reply by ●July 24, 20042004-07-24
Jay wrote:> We are fourth year electrical engineering students involved in a Final > Design Project course. (Approaching deadline date) We are using a > TMS320C6416 DSK DSP by Texas Instruments to perform some adaptive noise > cancellation. Unfortunately we have run into some serious unexpected > CPU Usage problems. We think that our C code is relatively simple. > > Our code consists of a simple LMS algorithm, which we have implemented > using Reference Framework 3 (provided by TI eXpressDSP tutorial). > > The code of our noise cancellation algorithm is as follows: > > //Variable Declaration (Global) > > static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > > static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double weight_array2 [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double tempweights [5000][16]; > > Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; > Int size; /* in samples */ > Int chan; > Int i,j; > double RCnorm, LCnorm; > Int icount = 0; > > > double outp; > double norm = pow (2,15); > double Z = 0.0; > double peakpower = 0.0; > > > //Assign variables to input buffers > srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); > srcLeft = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn2); > /* get the size in samples (the function below returns it in words) > */ > size = sizeInSamples( PIP_getReaderSize( thrAudioproc[chan].pipIn ) > ); > > /* get the empty buffer from the out-pipe */ > PIP_alloc( thrAudioproc[chan].pipOut ); > PIP_alloc( thrAudioproc[chan].pipOut2 ); > //Declare output buffers > dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); > dst2 = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut2 ); > > > > > // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHM STARTS > HERE ************************************* > > for ( i= 0; i < FRAMELEN; i++) > { > > > RCnorm = srcRight[i]/norm; > LCnorm =srcLeft[i]/norm; > > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z; > > > for (j=0; j<16; j++) > { > weight_array[j] += 2*0.01*outp*FIR1[j]; > } > > > ttemp=outp*norm; > > //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /* real > stereo/N-ch. > //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); > dst[i] = ttemp; > dst2[i] = ttemp; > } > > > If anyone has any ideas as to whether the format of our code could be > contributing to a high DSP CPU Usage (92.8%!!) it would be appreciated > if they could post some suggestions. Some of our ideas thus far > include: > > - inefficient variable declarations/definitions > - inefficient code structure > - 'for loop' problems???? > > > Any help would be greatly appreciated. Thank you very much.* How expensive are floats on that processor? Doubles? Some processors are significantly faster with ordinary floats. * You calculate norm, then divide by it. Multiplication is usually much faster. * Shifting FIR[j]=FIR[j-1] is way expensive -- make this a circular buffer. * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code Composter recognize this. Check your assembly, if you're not getting a MAC instruction here then _this_ is the code to hand-do in assembly. Ideally you'd take this whole section of code and do it by hand in assembly. I don't know if you have the time, but fercrissakes it's a DSP chip! C compilers don't understand these things, and if it was easy there wouldn't be any mystique! -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●July 25, 20042004-07-25
Jay wrote: ...> for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z;Try to use the FIR routine that comes with the manufacturer's library for this processor. DSP's are made to compute FIR filters very efficiently and the library is guaranteed to have the most efficient implementation. If your own code works as expected but is too slow, then keep your own code to show your supervisor that you have understood the principle of FIR, but use the library code for further work. Regards, Andor
Reply by ●July 25, 20042004-07-25
Thanks for such quick replies everyone. I'll discuss this problem further with our group with the suggestions I've received so far. I'll keep you posted if anything happens
Reply by ●July 28, 20042004-07-28
I made the same project two years ago with the C5402 and my code was written completely in C. It worked quite well. The solution was optimizing the compiler to Level 3. Tim Wescott <tim@wescottnospamdesign.com> wrote in message news:<10g60psgpnghm37@corp.supernews.com>...> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. >
Reply by ●July 29, 20042004-07-29
Jay, Here are some ideas on how you might get better code performance: 1) Consider the for loop: for (j=15; j>0; j--) { FIR1[j]=FIR1[j-1]; } I suspect that this for loop can be replaced with a call to memcpy. The memcpy supplied with your compiler should be written in assembly language and therefore it takes full advantage of any special instructions provided by the hardware. I am thinking of a zero overhead repeat loop. 2) You have these to for loops: for (j =0; j<16; j++) { Z+= FIR1[j]*weight_array[j]; } outp= RCnorm - Z; for (j=0; j<16; j++) { weight_array[j] += 2*0.01*outp*FIR1[j]; } I suspect that you can turn these two for loops into one loop. Not sure how much this is going to help. You can also consider using pointer arithmetic rather than indexing. On some machines this can be a big win. Some compilers will do this for you automatically but many do not. You can also consider turning each of the above for loops into 16 assignment statements. I am not sure it is worth the extra code space. 3) If a variable is heavily used in a program, a compiler should allocate it to a fast register. However, some compilers fail to allocate the right variables to fast registers. As a result, your program runs slowly. By using the keyword register, you can suggest to the compiler that this variable should be allocated to a fast register. 4) You wrote: double norm = pow (2,15); If the above statement is inside a loop, then I would compute 2^15 at compile time. By the way the statement: norm = 1.0; might run faster then the statement: norm = 1; because some compilers will do the conversion from int to double at compile time. Most are not that bad. I hope this helps. Bob Sherry "Jay" <cdragon@cogeco.ca> wrote in message news:Xns9530CBE8594D2cdragoncogecoca@216.221.81.119...> We are fourth year electrical engineering students involved in aFinal> Design Project course. (Approaching deadline date) We are using a > TMS320C6416 DSK DSP by Texas Instruments to perform some adaptivenoise> cancellation. Unfortunately we have run into some seriousunexpected> CPU Usage problems. We think that our C code is relatively simple. > > Our code consists of a simple LMS algorithm, which we haveimplemented> using Reference Framework 3 (provided by TI eXpressDSP tutorial). > > The code of our noise cancellation algorithm is as follows: > > //Variable Declaration (Global) > > static double FIR1[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double FIR2[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > > static double weight_array [16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; > static double weight_array2 [16] ={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};> static double tempweights [5000][16]; > > Sample *srcLeft, *dst, *srcRight, *dst2, ttemp, ttemp2; > Int size; /* in samples */ > Int chan; > Int i,j; > double RCnorm, LCnorm; > Int icount = 0; > > > double outp; > double norm = pow (2,15); > double Z = 0.0; > double peakpower = 0.0; > > > //Assign variables to input buffers > srcRight = (Sample *)PIP_getReaderAddr( thrAudioproc[chan].pipIn ); > srcLeft = (Sample *)PIP_getReaderAddr(thrAudioproc[chan].pipIn2);> /* get the size in samples (the function below returns it inwords)> */ > size = sizeInSamples( PIP_getReaderSize(thrAudioproc[chan].pipIn )> ); > > /* get the empty buffer from the out-pipe */ > PIP_alloc( thrAudioproc[chan].pipOut ); > PIP_alloc( thrAudioproc[chan].pipOut2 ); > //Declare output buffers > dst = (Sample *)PIP_getWriterAddr( thrAudioproc[chan].pipOut ); > dst2 = (Sample *)PIP_getWriterAddr(thrAudioproc[chan].pipOut2 );> > > > > // ***********BASIC LMS (NOISE CANCELLATION) ALGORITHMSTARTS> HERE ************************************* > > for ( i= 0; i < FRAMELEN; i++) > { > > > RCnorm = srcRight[i]/norm; > LCnorm =srcLeft[i]/norm; > > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > > FIR1[0] = LCnorm; > > Z=0.0; > > for (j =0; j<16; j++) > { > Z+= FIR1[j]*weight_array[j]; > } > outp= RCnorm - Z; > > > for (j=0; j<16; j++) > { > weight_array[j] += 2*0.01*outp*FIR1[j]; > } > > > ttemp=outp*norm; > > //dst[i] = srcRight[i];//(Short)(norm*RCnorm[i]); /*real> stereo/N-ch. > //dst2[i] = srcRight[i];//(Short)(norm*LCnorm[i]); > dst[i] = ttemp; > dst2[i] = ttemp; > } > > > If anyone has any ideas as to whether the format of our code couldbe> contributing to a high DSP CPU Usage (92.8%!!) it would beappreciated> if they could post some suggestions. Some of our ideas thus far > include: > > - inefficient variable declarations/definitions > - inefficient code structure > - 'for loop' problems???? > > > Any help would be greatly appreciated. Thank you very much.
Reply by ●July 29, 20042004-07-29
Do you realize that C64 (or any fix or floating point DSP for that matter) is pretty poor platform for double-precision floating point calculations? As a rule of thumb it's 100 to 1000 times slower that your average PC. Consider conversion to more natural data types, preferably 16bit fixpoint values for samples and coefficients and 32 fixpoint for accumulators. Transversal LMS structure that you currently implemented probably doesn't have sufficient numerical stability for the fix-point implementation. You would have to change your filter to more stable structure, preferably lattice/ladder.
Reply by ●July 29, 20042004-07-29
Robert Sherry wrote:> Jay, > > Here are some ideas on how you might get better code performance: > > 1) Consider the for loop: > for (j=15; j>0; j--) > { > FIR1[j]=FIR1[j-1]; > } > I suspect that this for loop can be replaced with a call to memcpy. > The memcpy supplied with your compiler should be written in assembly > language and therefore > it takes full advantage of any special instructions provided by the > hardware. I am thinking of a zero overhead repeat loop.No matter how efficiently you might move 16 data elements, readjusting two pointers (which can be in registers) will be faster. The task being programmed can best be done with a circular buffer. There are efficient data structures to implement it in software if the hardware doesn't support it directly. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Reply by ●July 30, 20042004-07-30
marlo_ti@yahoo.com (Marlo Flores) wrote in news:624176e6.0407280228.11a43cac@posting.google.com:> I made the same project two years ago with the C5402 and my code was > written completely in C. It worked quite well. The solution was > optimizing the compiler to Level 3. > > Tim Wescott <tim@wescottnospamdesign.com> wrote in message > news:<10g60psgpnghm37@corp.supernews.com>... >> * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen >> Code Composter recognize this. Check your assembly, if you're not >> getting a MAC instruction here then _this_ is the code to hand-do in >> assembly. >>Can you please elaborate a little more on optimizing the compiler to level 3?
Reply by ●July 31, 20042004-07-31
> > * How expensive are floats on that processor? Doubles? Some processors > are significantly faster with ordinary floats.Unfortunately the 6416 is a fixed point DSP (16 bit). However, I wouldn't expect that to be the only reason for incrementting the pCU usage that much! The 64x family is by far the most capable in terms of computational power, among TI's DSPs, and surpassing many other DSPs from other vendors.> > * You calculate norm, then divide by it. Multiplication is usually much > faster. > > * Shifting FIR[j]=FIR[j-1] is way expensive -- make this a circular buffer.Sure it is! Do your best to make it circular. Unfortunately not always 8alsmot never!) the C compilers understand the DSPs have circular buffers.> > * Z+= FIR1[j]*weight_array[j] implements a MAC, but I've never seen Code > Composter recognize this. Check your assembly, if you're not getting a > MAC instruction here then _this_ is the code to hand-do in assembly. > > Ideally you'd take this whole section of code and do it by hand in > assembly. I don't know if you have the time, but fercrissakes it's a > DSP chip! C compilers don't understand these things, and if it was easy > there wouldn't be any mystique!JaaC