DSPRelated.com
Forums

cost of LDW instruction

Started by stino_rides February 26, 2008
Hi All!

here's a situation sketch: we're doing a typical AD - Process - DA
application on C6713 and C6416 boards. Everything is 'pluggable': any
processing step can be enabled/disabled and signal routing is
configurable too. This is however only small cpu overhead in
comparision with the actual processing (which happens in blocks of
8*100 samples @ 20kHz, ie every 5mSec). There are also some seperate
tasks used for control/display over TCP/IP.

The Process step mainly consists of methods with this prototype:

void Proc( tProc* pObj, short* pSamples, const unsigned nSamples );

some tasks performed by these methods: amplification, FIR filters,
peak detection, template matching.
Basically, inside the method there's always the same principle,
namely looping over the samples and doing something with it.
Suppose a simple amplifier (without overflow checking):

void Amp( tAmp* pObj, short* pSamples, const unsigned nSamples )
{
unsigned i = 0;
for( ; i < nSamples ; ++i )
pSamples[ i ] *= pObj->nAmplification;
}
Now, this project has been in development for about 3 years now, and
new features are being added constantly. The problem that's raising
now is running out of cpu time when a lot of processing is enabled.
Before switching on to faster hardware, we'd like to have a look and
see if it's possible to optimize some bottlenecks. We're however not
familiar with assembly, so before digging in to this we did some
simple experiments.
For example, if we change the Amp method to this:

void Amp( tAmp* pObj, short* pSamples, const unsigned nSamples )
{
const short nAmp = pObj->nAmplification;
unsigned i = 0;
for( ; i < nSamples ; ++i )
pSamples[ i ] *= nAmp;
}

there's one LDW instruction more to initialize nAmp, but within the
loop there's one LDW instruction less. For 100 samples, this means 99
LDW instructions less.
Amp is just a simple example, but now we're wondering: is it worth
rewriting the methods and putting all consts on the stack instead of
getting them via the pObj pointer each iteration? Will there be a
gain in execution time or are there other thins we should bother with
first?

Thanks in Advace!
Stijn

Check Out Industry's First Single-Chip, Multi-Format, Real-Time HD Video Transcoding Solution for Commercial & Consumer End Equipment: www.ti.com/dm6467
Hello Stijn,

On Tue, Feb 26, 2008 at 3:20 AM, stino_rides wrote:

> Hi All!
>
> here's a situation sketch: we're doing a typical AD - Process - DA
> application on C6713 and C6416 boards. Everything is 'pluggable': any
> processing step can be enabled/disabled and signal routing is
> configurable too. This is however only small cpu overhead in
> comparision with the actual processing (which happens in blocks of
> 8*100 samples @ 20kHz, ie every 5mSec). There are also some seperate
> tasks used for control/display over TCP/IP.
>
> The Process step mainly consists of methods with this prototype:
>
> void Proc( tProc* pObj, short* pSamples, const unsigned nSamples );
>
> some tasks performed by these methods: amplification, FIR filters,
> peak detection, template matching.
> Basically, inside the method there's always the same principle,
> namely looping over the samples and doing something with it.
> Suppose a simple amplifier (without overflow checking):
>
> void Amp( tAmp* pObj, short* pSamples, const unsigned nSamples )
> {
> unsigned i = 0;
> for( ; i < nSamples ; ++i )
> pSamples[ i ] *= pObj->nAmplification;
> }
>
> Now, this project has been in development for about 3 years now, and
> new features are being added constantly. The problem that's raising
> now is running out of cpu time when a lot of processing is enabled.
> Before switching on to faster hardware, we'd like to have a look and
> see if it's possible to optimize some bottlenecks. We're however not
> familiar with assembly, so before digging in to this we did some
> simple experiments.
> For example, if we change the Amp method to this:
>
> void Amp( tAmp* pObj, short* pSamples, const unsigned nSamples )
> {
> const short nAmp = pObj->nAmplification;
> unsigned i = 0;
> for( ; i < nSamples ; ++i )
> pSamples[ i ] *= nAmp;
> }
>
> there's one LDW instruction more to initialize nAmp, but within the
> loop there's one LDW instruction less. For 100 samples, this means 99
> LDW instructions less.

This is certainly a good practise for any embedded code. You do not
want to perform any 'extra' memory accesses [or computation] within a
loop.

> Amp is just a simple example, but now we're wondering: is it worth
> rewriting the methods and putting all consts on the stack instead of
> getting them via the pObj pointer each iteration? Will there be a
> gain in execution time or are there other thins we should bother with
> first?

I cannot make a blanket statement about your code and architecture,
but 'off the top of my head' I would suggest the following steps:
1. Measure the performance of your code [idle time or whatever
measures you use].
2. Review and refactor your code to minimize work done inside of the
loops. Measure again.
3. Review the memory layout of your code [internal vs. external memory
-if you use external memory]. Make adjustments if needed and measure
again.
4. If you still need more headroom, you will need to locate the more
complex time consuming functions. Ideally you would be able to profile
your code to get the data - I realize that this can be very difficult
[or extremely tedious] on some realtime systems. See if you can
improve any of these. If so, measure again.
5. If you capture your data in external memory, you might see if you
put the buffers in internal memory. If so, measure again.
6. If your code runs in external memory, you can may be able to get a
boost by running some of your 'leaf' functions as 'overlays' in
internal memory. This takes some careful work, can be implemented
different ways, and can provide significant performance increases for
functions that consist primarily of loops.
6. If you need to go the assembly code route, start with the simple
functions. For the example, you could load 4 shorts at once [LDDW],
perform your multiples... This could reduce the number of memory
cycles. I like to use the original code as a comment to the asm
functions.

mikedunn
>
> Thanks in Advace!
> Stijn
>
>

--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
Check Out Industry's First Single-Chip, Multi-Format, Real-Time HD Video Transcoding Solution for Commercial & Consumer End Equipment: www.ti.com/dm6467
Stijn,

On Wed, Feb 27, 2008 at 2:05 AM, stino_rides wrote:
> Hello Mike,
>
> thanks for your quick answer!

> > This is certainly a good practise for any embedded code. You do not
> > want to perform any 'extra' memory accesses [or computation] within
> a
> > loop.
>
> Ok, thanks for clearing that out.

> > I cannot make a blanket statement about your code and architecture,
> > but 'off the top of my head' I would suggest the following steps:
> > 1. Measure the performance of your code [idle time or whatever
> > measures you use].
> > 2. Review and refactor your code to minimize work done inside of the
> > loops. Measure again.
> > 3. Review the memory layout of your code [internal vs. external
> memory
> > -if you use external memory]. Make adjustments if needed and
> measure
> > again.
>
> Most of the objects, const section and AD samples are in internal
> memory already.
>
> > 4. If you still need more headroom, you will need to locate the more
> > complex time consuming functions. Ideally you would be able to
> profile
> > your code to get the data - I realize that this can be very
> difficult
> > [or extremely tedious] on some realtime systems. See if you can
> > improve any of these. If so, measure again.
>
> How would you go about profiling the code? Using CCS's functionality?
> Or will something simple like using a timer at a high rate do the
> trick?
1.
If your system will exhibit 'normal program flow' [possibly with
garbage data], you could use CCS's profiling [don't try to 'profile
the world' at one time. CCS uses sw breakpoints and profiling will
slow down the system. Since your system appears to use model that
'gets samples then executes n number of functions then repeat', you
may be able to set a breakpoint after you acquire the samples and then
profile the functions [or some of the functions].
2.
You can use a free running timer as long as you 'get enough ticks per
function' to be useful. You can save the data in an array and print it
when you have your data. Don't use any printf's during critical
execution - they are very slow. Keep in mind that this will give you
relative numbers - the timer read and result store to memory twice in
a function will add some cycles.
3.
The third method is to use IO pins and a scope. This can be a bit
tedious if you have a lot of functions. The ideal setup is a logic
analyzer, access to an unused CEn signal [or an unused address decode,
or lots of used IO pins]. With this, you can do a single write with
a different pattern for each function entry and exit.

If you use #2 or #3, you can implement the extra code so that it is
added or removed with a build option.
> > 5. If you capture your data in external memory, you might see if you
> > put the buffers in internal memory. If so, measure again.
> > 6. If your code runs in external memory, you can may be able to get
> a
> > boost by running some of your 'leaf' functions as 'overlays' in
> > internal memory. This takes some careful work, can be implemented
> > different ways, and can provide significant performance increases
> for
> > functions that consist primarily of loops.
>
> Do you mean putting the code in internal memory, just like the data?
> That shouldn't be to hard, all code in the processing loops is not
> very large and there is still a lot of free space in internal memory.

Yes. I assumed that you might not have too much internal memory. Now
that you know the size of the code and the amount of free internal
memory, you could create a section in memory for some of the routines.
You will probably want to experiment [that is putting 2 small routines
vs. 1 large one to see which gives the most improvement]. This will
provide significant improvement.

Actually, what I was referring to was something like this:
1. Reserve 'n' words of internal memory for code. For example, 100 words.
2. before calling any function less than 100 words, copy it into
internal memory then execute it.
3. before executing the next function, copy it to the same location and call it.
This obviously would require some overhead and possibly some other
restrictions to facilitate minimizing the overhead and cache issues.

Good luck.

Please post inquiries to the group.

mikedunn
>
> > 6. If you need to go the assembly code route, start with the simple
> > functions. For the example, you could load 4 shorts at once [LDDW],
> > perform your multiples... This could reduce the number of memory
> > cycles. I like to use the original code as a comment to the asm
> > functions.
>
> That seems interesting as well, thanks!
> Stijn
>

--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
Check Out Industry's First Single-Chip, Multi-Format, Real-Time HD Video Transcoding Solution for Commercial & Consumer End Equipment: www.ti.com/dm6467
Stijn,

On Thu, Feb 28, 2008 at 9:19 AM, stino_rides wrote:
> > 1.
> > If your system will exhibit 'normal program flow' [possibly with
> > garbage data], you could use CCS's profiling [don't try to 'profile
> > the world' at one time. CCS uses sw breakpoints and profiling will
> > slow down the system. Since your system appears to use model that
> > 'gets samples then executes n number of functions then repeat', you
> > may be able to set a breakpoint after you acquire the samples and
> then
> > profile the functions [or some of the functions].
> > 2.
> > You can use a free running timer as long as you 'get enough ticks
> per
> > function' to be useful. You can save the data in an array and print
> it
> > when you have your data. Don't use any printf's during critical
> > execution - they are very slow. Keep in mind that this will give
> you
> > relative numbers - the timer read and result store to memory twice
> in
> > a function will add some cycles.
>
> for now I've been using a timer with 10uSec precision.
>
> > 3.
> > The third method is to use IO pins and a scope. This can be a bit
> > tedious if you have a lot of functions. The ideal setup is a logic
> > analyzer, access to an unused CEn signal [or an unused address
> decode,
> > or lots of used IO pins]. With this, you can do a single write
> with
> > a different pattern for each function entry and exit.
>
> interesting! never thought of it that way.
>
> > Yes. I assumed that you might not have too much internal memory.
> Now
> > that you know the size of the code and the amount of free internal
> > memory, you could create a section in memory for some of the
> routines.
> > You will probably want to experiment [that is putting 2 small
> routines
> > vs. 1 large one to see which gives the most improvement]. This will
> > provide significant improvement.
> >
> > Actually, what I was referring to was something like this:
> > 1. Reserve 'n' words of internal memory for code. For example, 100
> words.
> > 2. before calling any function less than 100 words, copy it into
> > internal memory then execute it.
> > 3. before executing the next function, copy it to the same location
> and call it.
> > This obviously would require some overhead and possibly some other
> > restrictions to facilitate minimizing the overhead and cache issues.
>
> Well I didn't have time look into your suggestion, but after checking
> the free size left in IRAM, it seemed there was still more than
> enough space left to place all processing functions, as well as the
> stack of the task executing them into IRAM; I just put all .text
> sections of the objects of concern into IRAM and mofified the
> TSK_create call for the process task to use IRAM as well. One
> weirdness here: TSK_stat says the used stacksize is 4095 while there
> is 4096 allocated, however I know the actual size used is only about
> 800 (by looking at the memory + when allocating stack automatically
> it does return 800).. CCS bug I guess?
>
> Anyway, there is result already! Here's the average of the time spent
> in the main processing loop, over 10 runs, each having the loop
> executed 2^16 times:
> original: 3,5 mSec
> IRAM: 3,1 mSec
> That's definately quite an improvement, taking into account that I
> didn't change a line of code to achieve this.
>
> Thanks again for all your suggestions!

Congratulations! ! ! That gives you 400,000 ns to play with.

mikedunn
>
> Stijn

--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
Check Out Industry's First Single-Chip, Multi-Format, Real-Time HD Video Transcoding Solution for Commercial & Consumer End Equipment: www.ti.com/dm6467