Forums

DSP optimizations - limits?

Started by Michael Schoeberl April 25, 2005
Hi ...

I'm starting to work on DSPs and wondering about speed optimizations and 
the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board 
with the Code Composer Studio)



- first there are compiler optimizations - loop unrolling, pipelining 
and all this ... ti has good documents and examples about that

- then the TI documents describe the cache analysis with the simulator 
... I think I also understood the problem - some bad conditions can 
render the cache useless

- but finally I came across another limit ... the image I'm processing 
is located in the external SDRAM and currently the image is read/written 
4 times (and this takes most of the time) ... and no, it's to big for 
the ISRAM/L2 cache


I think I shoud try interleaving of the processing? how much could I gain?

are there any other bottlenecks? how do you optimize in general? are 
there any tricks? (urls? books?) or is it just experience ..?



Up to now I've done a few projects on microcontrollers and a few years 
of part-time FPGA stuff (and studying of course). I'm used to cyle 
accurate datarate estimations - these cpu stalls and cache issues seem 
quite unpredictable to me :-/


bye,
Michael
Michael Schoeberl wrote:
> Hi ... > > I'm starting to work on DSPs and wondering about speed optimizations and > the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board > with the Code Composer Studio) > > > > - first there are compiler optimizations - loop unrolling, pipelining > and all this ... ti has good documents and examples about that > > - then the TI documents describe the cache analysis with the simulator > ... I think I also understood the problem - some bad conditions can > render the cache useless > > - but finally I came across another limit ... the image I'm processing > is located in the external SDRAM and currently the image is read/written > 4 times (and this takes most of the time) ... and no, it's too big for > the ISRAM/L2 cache >
In image processing applications that I've seen things have ended up in FPGA applications nearly every time. In general the speed issues revolve around memory bandwidth issues more often than processor bandwidth issues. When you're using an FPGA it's much easier to add another bank of RAM and ping-pong between the two, to hand-optimize the RAM accesses, and to exploit internal RAM effectively. In your particular case unless you can do the work one segment of the image at a time you're stuck with the capability of your DSP hardware to get the memory in and out of the RAM.
> > I think I shoud try interleaving of the processing? how much could I gain?
Do you mean between two processors? The theoretical bound for speedup from using N processors is N times. I once had a class in multiprocessing that included a project to investigate this. The two endpoints were (a) N = 2, custom hardware, speedup = 1.99 or so and (b) N = 4, VAX cluster, speedup < 1/48 (30 minutes on a single processor vs. more than 24 hours on all four, with serious communications stalls). My experience has been between the two, but all the multiprocessor work that I've done has been more for the purpose of getting individual fully functional modules rather than for generating raw speed.
> > are there any other bottlenecks? how do you optimize in general?
The general approach is to understand what your system is and where the bottlenecks are, then try to open up the bottlenecks. It sounds like you're already doing this.
> are there any tricks? (urls? books?) or is it just experience ..?
Any tricks will be specific to the hardware you're using, which boils down to the product of experience and cleverness.
> > Up to now I've done a few projects on microcontrollers and a few years > of part-time FPGA stuff (and studying of course). I'm used to cyle > accurate datarate estimations - these cpu stalls and cache issues seem > quite unpredictable to me :-/
Yup.
> >
-- Tim Wescott Wescott Design Services http://www.wescottdesign.com
> In image processing applications that I've seen things have ended up in > FPGA applications nearly every time.
Okay ... this project won't ;-) up to now the department has done image-algorithms on intel platforms (P4 with MMX or IPP lib optimizations) and now we want to find out how fast we can get with a DSP ... the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper bound ...
>> I think I shoud try interleaving of the processing? how much could I >> gain?
> Do you mean between two processors? The theoretical bound for speedup > from using N processors is N times.
okay - in this sense, my algorithm would scale quite well ... but I thought about something else: I'm playing with a pyramid decomposition (load, filter, downsample, store, load, upsample-hor, store, load, upsample-vert, store, load, subtract, store) ... first aproach: filtering over the whole image results in a better usage of the cpu/pipeline ... (but then there is the sdram bottleneck) next idea: do each operation only on one single line of the image ... could avoid the load/store bottleneck - but is it worth the effort? for an FPGA design I would first take a piece of paper and calculate the datarates for my buses and find predictable solution ...
> The general approach is to understand what your system is and where the > bottlenecks are, then try to open up the bottlenecks. It sounds like > you're already doing this.
yea ... learning the hard way - trial & error ... ;-) okay thanks - if everybody else is doing this then I should get used to this :-) bye, Michael
> > > for an FPGA design I would first take a piece of paper and calculate the > datarates for my buses and find predictable solution ... > >
So why try and short-cut just because it's now a mixture of hardware and software? You still need to analyse the problem.
> > > for an FPGA design I would first take a piece of paper and calculate the > datarates for my buses and find predictable solution ... >
You pretty much do the same thing with DSP software. On a general purpose computer, you can only generalize your software performance as the computer is tending to different things while your code runs. On any embedded system (which is most applications of DSP chips), your CPU is only running the code you told it to, resulting in a much more predictable situation.
>>for an FPGA design I would first take a piece of paper and calculate the >>datarates for my buses and find predictable solution ...
> On any embedded system (which is most applications of DSP chips), your > CPU is only running the code you told it to, resulting in a much more > predictable situation.
It still looks quite unpredictable to me ... I've got some algorithm (e.g. the pyramid decomposition of an image) and I can count the number of operations: 3M load/store (or less if implemented different) 1.7M add 0.6M shift 1.5M mult Where should I end up with medium effort of tuning? - mult 5 cycles, others 1 cycle would give 13M cycles for plain linear execution ... - if I assume maxium parallelism (unrealistic 8 inst/cycle) I could calculate 0.8M cycles somewhere in between? or even below? (there is still no load-doubleword, add2 and all this used) bye, Michael
"Michael Schoeberl" <nospam-newsgroups@michael-schoeberl.de> wrote in 
message news:426e7c31_1@news.eticomm.net...
>>>for an FPGA design I would first take a piece of paper and calculate the >>>datarates for my buses and find predictable solution ... > >> On any embedded system (which is most applications of DSP chips), your >> CPU is only running the code you told it to, resulting in a much more >> predictable situation. > > It still looks quite unpredictable to me ... I've got some algorithm > (e.g. the pyramid decomposition of an image) and I can count the number of > operations: > > > 3M load/store (or less if implemented different) > 1.7M add > 0.6M shift > 1.5M mult > > > Where should I end up with medium effort of tuning? > > > - mult 5 cycles, others 1 cycle would give 13M cycles > for plain linear execution ... > > - if I assume maxium parallelism (unrealistic 8 inst/cycle) I could > calculate 0.8M cycles > > > somewhere in between? or even below? (there is still no load-doubleword, > add2 and all this used) > > > bye, > Michael
Generally if this is the level at which you're estimating, then there is the good news and the bad news: - the good news is that you've done a pretty good analytical job of counting the number of operations ... well, except it sounds like you left out some operations like load-dbl - the bade news is that it's difficult to know just where in the code you will have bottlenecks just in general. Often the bottlenecks are more important than than snippet of code where you *think* the time will be used. You have to instrument for these. Some practitioners believe it's better to not optimize code up front. Rather, fix the bottlenecks after the code is written. Fred
Well, this is not a new problem for image/video applications  and the
most obvious and common solution is to split up the processing
linewise/columnwise with background "pre-fetching" the next line of
data to be processed using DMA into the dsp SRAM. You end up with a
ping-pong buffer scheme. What you need to ensure is that the data
pre-fetch of the next line completes by the time you finish processing
one line of data.  If implemented correctly, this should significantly
reduce the overhead incurred due to sdram accesses.

Unfortunately there is no simple well defined way to predict the
overhead incurred.

You have to rely on TI Simulator profiling data for various events (
CPU Stalls, Cachem misses etc )to identify the bottlenecks.

Michael Schoeberl wrote:
> > In image processing applications that I've seen things have ended
up in
> > FPGA applications nearly every time. > > Okay ... this project won't ;-) > > up to now the department has done image-algorithms on intel platforms
> (P4 with MMX or IPP lib optimizations) and now we want to find out
how
> fast we can get with a DSP ... > > the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper
> bound ... > > > >> I think I shoud try interleaving of the processing? how much could
I
> >> gain? > > > Do you mean between two processors? The theoretical bound for
speedup
> > from using N processors is N times. > > okay - in this sense, my algorithm would scale quite well ... but I > thought about something else: > > > I'm playing with a pyramid decomposition (load, filter, downsample, > store, load, upsample-hor, store, load, upsample-vert, store, load, > subtract, store) ... > > first aproach: filtering over the whole image results in a better
usage
> of the cpu/pipeline ... (but then there is the sdram bottleneck) > > next idea: do each operation only on one single line of the image ... > could avoid the load/store bottleneck - but is it worth the effort? > > > for an FPGA design I would first take a piece of paper and calculate
the
> datarates for my buses and find predictable solution ... > > > > The general approach is to understand what your system is and where
the
> > bottlenecks are, then try to open up the bottlenecks. It sounds
like
> > you're already doing this. > > yea ... learning the hard way - trial & error ... ;-) > > okay thanks - if everybody else is doing this then I should get used
to
> this :-) > > > bye, > Michael