comp.dsp | DSP optimizations

Hi ...

I'm starting to work on DSPs and wondering about speed optimizations and 
the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board 
with the Code Composer Studio)



- first there are compiler optimizations - loop unrolling, pipelining 
and all this ... ti has good documents and examples about that

- then the TI documents describe the cache analysis with the simulator 
... I think I also understood the problem - some bad conditions can 
render the cache useless

- but finally I came across another limit ... the image I'm processing 
is located in the external SDRAM and currently the image is read/written 
4 times (and this takes most of the time) ... and no, it's to big for 
the ISRAM/L2 cache


I think I shoud try interleaving of the processing? how much could I gain?

are there any other bottlenecks? how do you optimize in general? are 
there any tricks? (urls? books?) or is it just experience ..?



Up to now I've done a few projects on microcontrollers and a few years 
of part-time FPGA stuff (and studying of course). I'm used to cyle 
accurate datarate estimations - these cpu stalls and cache issues seem 
quite unpredictable to me :-/


bye,
Michael

Reply by Tim Wescott ●April 25, 20052005-04-25

Michael Schoeberl wrote:
> Hi ...
> 
> I'm starting to work on DSPs and wondering about speed optimizations and 
> the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board 
> with the Code Composer Studio)
> 
> 
> 
> - first there are compiler optimizations - loop unrolling, pipelining 
> and all this ... ti has good documents and examples about that
> 
> - then the TI documents describe the cache analysis with the simulator 
> ... I think I also understood the problem - some bad conditions can 
> render the cache useless
> 
> - but finally I came across another limit ... the image I'm processing 
> is located in the external SDRAM and currently the image is read/written 
> 4 times (and this takes most of the time) ... and no, it's too big for 
> the ISRAM/L2 cache
> 
In image processing applications that I've seen things have ended up in 
FPGA applications nearly every time.  In general the speed issues 
revolve around memory bandwidth issues more often than processor 
bandwidth issues.  When you're using an FPGA it's much easier to add 
another bank of RAM and ping-pong between the two, to hand-optimize the 
RAM accesses, and to exploit internal RAM effectively.  In your 
particular case unless you can do the work one segment of the image at a 
time you're stuck with the capability of your DSP hardware to get the 
memory in and out of the RAM.
> 
> I think I shoud try interleaving of the processing? how much could I gain?

Do you mean between two processors?  The theoretical bound for speedup 
from using N processors is N times.  I once had a class in 
multiprocessing that included a project to investigate this.  The two 
endpoints were (a) N = 2, custom hardware, speedup = 1.99 or so and (b) 
N = 4, VAX cluster, speedup < 1/48 (30 minutes on a single processor vs. 
more than 24 hours on all four, with serious communications stalls).  My 
experience has been between the two, but all the multiprocessor work 
that I've done has been more for the purpose of getting individual fully 
functional modules rather than for generating raw speed.
> 
> are there any other bottlenecks? how do you optimize in general? 

The general approach is to understand what your system is and where the 
bottlenecks are, then try to open up the bottlenecks.  It sounds like 
you're already doing this.

> are there any tricks? (urls? books?) or is it just experience ..?

Any tricks will be specific to the hardware you're using, which boils 
down to the product of experience and cleverness.
> 
> Up to now I've done a few projects on microcontrollers and a few years 
> of part-time FPGA stuff (and studying of course). I'm used to cyle 
> accurate datarate estimations - these cpu stalls and cache issues seem 
> quite unpredictable to me :-/

Yup.
> 
> 

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Michael Schoeberl ●April 25, 20052005-04-25

> In image processing applications that I've seen things have ended up in 
> FPGA applications nearly every time.

Okay ... this project won't ;-)

up to now the department has done image-algorithms on intel platforms 
(P4 with MMX or IPP lib optimizations) and now we want to find out how 
fast we can get with a DSP ...

the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper 
bound ...


>> I think I shoud try interleaving of the processing? how much could I 
>> gain?

> Do you mean between two processors?  The theoretical bound for speedup 
> from using N processors is N times.

okay - in this sense, my algorithm would scale quite well ... but I 
thought about something else:


I'm playing with a pyramid decomposition (load, filter, downsample, 
store, load, upsample-hor, store, load, upsample-vert, store, load, 
subtract, store) ...

first aproach: filtering over the whole image results in a better usage 
of the cpu/pipeline ... (but then there is the sdram bottleneck)

next idea: do each operation only on one single line of the image ...
could avoid the load/store bottleneck - but is it worth the effort?


for an FPGA design I would first take a piece of paper and calculate the 
datarates for my buses and find predictable solution ...


> The general approach is to understand what your system is and where the 
> bottlenecks are, then try to open up the bottlenecks.  It sounds like 
> you're already doing this.

yea ... learning the hard way - trial & error ... ;-)

okay thanks - if everybody else is doing this then I should get used to 
this :-)


bye,
Michael

Reply by John Smith ●April 26, 20052005-04-26

>
>
> for an FPGA design I would first take a piece of paper and calculate the
> datarates for my buses and find predictable solution ...
>
>

So why try and short-cut just because it's now a mixture of hardware and
software? You still need to analyse the problem.

Reply by Charles Krug ●April 26, 20052005-04-26

>
>
> for an FPGA design I would first take a piece of paper and calculate the
> datarates for my buses and find predictable solution ...
>

You pretty much do the same thing with DSP software.

On a general purpose computer, you can only generalize your software
performance as the computer is tending to different things while your
code runs.

On any embedded system (which is most applications of DSP chips), your
CPU is only running the code you told it to, resulting in a much more
predictable situation.

Reply by Michael Schoeberl ●April 26, 20052005-04-26

>>for an FPGA design I would first take a piece of paper and calculate the
>>datarates for my buses and find predictable solution ...

> On any embedded system (which is most applications of DSP chips), your
> CPU is only running the code you told it to, resulting in a much more
> predictable situation.

It still looks quite unpredictable to me ... I've got some algorithm
(e.g. the pyramid decomposition of an image) and I can count the number 
of operations:


3M load/store (or less if implemented different)
1.7M add
0.6M shift
1.5M mult


Where should I end up with medium effort of tuning?


- mult 5 cycles, others 1 cycle would give 13M cycles
for plain linear execution ...

- if I assume maxium parallelism (unrealistic 8 inst/cycle) I could 
calculate 0.8M cycles


somewhere in between? or even below? (there is still no load-doubleword, 
add2 and all this used)


bye,
Michael

Reply by Fred Marshall ●April 26, 20052005-04-26

"Michael Schoeberl" <nospam-newsgroups@michael-schoeberl.de> wrote in 
message news:426e7c31_1@news.eticomm.net...
>>>for an FPGA design I would first take a piece of paper and calculate the
>>>datarates for my buses and find predictable solution ...
>
>> On any embedded system (which is most applications of DSP chips), your
>> CPU is only running the code you told it to, resulting in a much more
>> predictable situation.
>
> It still looks quite unpredictable to me ... I've got some algorithm
> (e.g. the pyramid decomposition of an image) and I can count the number of 
> operations:
>
>
> 3M load/store (or less if implemented different)
> 1.7M add
> 0.6M shift
> 1.5M mult
>
>
> Where should I end up with medium effort of tuning?
>
>
> - mult 5 cycles, others 1 cycle would give 13M cycles
> for plain linear execution ...
>
> - if I assume maxium parallelism (unrealistic 8 inst/cycle) I could 
> calculate 0.8M cycles
>
>
> somewhere in between? or even below? (there is still no load-doubleword, 
> add2 and all this used)
>
>
> bye,
> Michael

Generally if this is the level at which you're estimating, then there is the 
good news and the bad news:
- the good news is that you've done a pretty good analytical job of counting 
the number of operations ... well, except it sounds like you left out some 
operations like load-dbl
- the bade news is that it's difficult to know just where in the code you 
will have bottlenecks just in general.
Often the bottlenecks are more important than than snippet of code where you 
*think* the time will be used.  You have to instrument for these.  Some 
practitioners believe it's better to not optimize code up front.  Rather, 
fix the bottlenecks after the code is written.

Fred

Reply by ●April 30, 20052005-04-30

Well, this is not a new problem for image/video applications  and the
most obvious and common solution is to split up the processing
linewise/columnwise with background "pre-fetching" the next line of
data to be processed using DMA into the dsp SRAM. You end up with a
ping-pong buffer scheme. What you need to ensure is that the data
pre-fetch of the next line completes by the time you finish processing
one line of data.  If implemented correctly, this should significantly
reduce the overhead incurred due to sdram accesses.

Unfortunately there is no simple well defined way to predict the
overhead incurred.

You have to rely on TI Simulator profiling data for various events (
CPU Stalls, Cachem misses etc )to identify the bottlenecks.

Michael Schoeberl wrote:
> > In image processing applications that I've seen things have ended
up in
> > FPGA applications nearly every time.
>
> Okay ... this project won't ;-)
>
> up to now the department has done image-algorithms on intel platforms

> (P4 with MMX or IPP lib optimizations) and now we want to find out
how
> fast we can get with a DSP ...
>
> the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper

> bound ...
>
>
> >> I think I shoud try interleaving of the processing? how much could
I
> >> gain?
>
> > Do you mean between two processors?  The theoretical bound for
speedup
> > from using N processors is N times.
>
> okay - in this sense, my algorithm would scale quite well ... but I
> thought about something else:
>
>
> I'm playing with a pyramid decomposition (load, filter, downsample,
> store, load, upsample-hor, store, load, upsample-vert, store, load,
> subtract, store) ...
>
> first aproach: filtering over the whole image results in a better
usage
> of the cpu/pipeline ... (but then there is the sdram bottleneck)
>
> next idea: do each operation only on one single line of the image ...
> could avoid the load/store bottleneck - but is it worth the effort?
>
>
> for an FPGA design I would first take a piece of paper and calculate
the
> datarates for my buses and find predictable solution ...
>
>
> > The general approach is to understand what your system is and where
the
> > bottlenecks are, then try to open up the bottlenecks.  It sounds
like
> > you're already doing this.
>
> yea ... learning the hard way - trial & error ... ;-)
>
> okay thanks - if everybody else is doing this then I should get used
to 
> this :-)
> 
> 
> bye,
> Michael

DSP optimizations - limits?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group