Hi ... I'm starting to work on DSPs and wondering about speed optimizations and the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board with the Code Composer Studio) - first there are compiler optimizations - loop unrolling, pipelining and all this ... ti has good documents and examples about that - then the TI documents describe the cache analysis with the simulator ... I think I also understood the problem - some bad conditions can render the cache useless - but finally I came across another limit ... the image I'm processing is located in the external SDRAM and currently the image is read/written 4 times (and this takes most of the time) ... and no, it's to big for the ISRAM/L2 cache I think I shoud try interleaving of the processing? how much could I gain? are there any other bottlenecks? how do you optimize in general? are there any tricks? (urls? books?) or is it just experience ..? Up to now I've done a few projects on microcontrollers and a few years of part-time FPGA stuff (and studying of course). I'm used to cyle accurate datarate estimations - these cpu stalls and cache issues seem quite unpredictable to me :-/ bye, Michael
DSP optimizations - limits?
Started by ●April 25, 2005
Reply by ●April 25, 20052005-04-25
Michael Schoeberl wrote:> Hi ... > > I'm starting to work on DSPs and wondering about speed optimizations and > the (reasonable) limits of this. (I got a TI TMSC320C6416 demo-board > with the Code Composer Studio) > > > > - first there are compiler optimizations - loop unrolling, pipelining > and all this ... ti has good documents and examples about that > > - then the TI documents describe the cache analysis with the simulator > ... I think I also understood the problem - some bad conditions can > render the cache useless > > - but finally I came across another limit ... the image I'm processing > is located in the external SDRAM and currently the image is read/written > 4 times (and this takes most of the time) ... and no, it's too big for > the ISRAM/L2 cache >In image processing applications that I've seen things have ended up in FPGA applications nearly every time. In general the speed issues revolve around memory bandwidth issues more often than processor bandwidth issues. When you're using an FPGA it's much easier to add another bank of RAM and ping-pong between the two, to hand-optimize the RAM accesses, and to exploit internal RAM effectively. In your particular case unless you can do the work one segment of the image at a time you're stuck with the capability of your DSP hardware to get the memory in and out of the RAM.> > I think I shoud try interleaving of the processing? how much could I gain?Do you mean between two processors? The theoretical bound for speedup from using N processors is N times. I once had a class in multiprocessing that included a project to investigate this. The two endpoints were (a) N = 2, custom hardware, speedup = 1.99 or so and (b) N = 4, VAX cluster, speedup < 1/48 (30 minutes on a single processor vs. more than 24 hours on all four, with serious communications stalls). My experience has been between the two, but all the multiprocessor work that I've done has been more for the purpose of getting individual fully functional modules rather than for generating raw speed.> > are there any other bottlenecks? how do you optimize in general?The general approach is to understand what your system is and where the bottlenecks are, then try to open up the bottlenecks. It sounds like you're already doing this.> are there any tricks? (urls? books?) or is it just experience ..?Any tricks will be specific to the hardware you're using, which boils down to the product of experience and cleverness.> > Up to now I've done a few projects on microcontrollers and a few years > of part-time FPGA stuff (and studying of course). I'm used to cyle > accurate datarate estimations - these cpu stalls and cache issues seem > quite unpredictable to me :-/Yup.> >-- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●April 25, 20052005-04-25
> In image processing applications that I've seen things have ended up in > FPGA applications nearly every time.Okay ... this project won't ;-) up to now the department has done image-algorithms on intel platforms (P4 with MMX or IPP lib optimizations) and now we want to find out how fast we can get with a DSP ... the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper bound ...>> I think I shoud try interleaving of the processing? how much could I >> gain?> Do you mean between two processors? The theoretical bound for speedup > from using N processors is N times.okay - in this sense, my algorithm would scale quite well ... but I thought about something else: I'm playing with a pyramid decomposition (load, filter, downsample, store, load, upsample-hor, store, load, upsample-vert, store, load, subtract, store) ... first aproach: filtering over the whole image results in a better usage of the cpu/pipeline ... (but then there is the sdram bottleneck) next idea: do each operation only on one single line of the image ... could avoid the load/store bottleneck - but is it worth the effort? for an FPGA design I would first take a piece of paper and calculate the datarates for my buses and find predictable solution ...> The general approach is to understand what your system is and where the > bottlenecks are, then try to open up the bottlenecks. It sounds like > you're already doing this.yea ... learning the hard way - trial & error ... ;-) okay thanks - if everybody else is doing this then I should get used to this :-) bye, Michael
Reply by ●April 26, 20052005-04-26
> > > for an FPGA design I would first take a piece of paper and calculate the > datarates for my buses and find predictable solution ... > >So why try and short-cut just because it's now a mixture of hardware and software? You still need to analyse the problem.
Reply by ●April 26, 20052005-04-26
> > > for an FPGA design I would first take a piece of paper and calculate the > datarates for my buses and find predictable solution ... >You pretty much do the same thing with DSP software. On a general purpose computer, you can only generalize your software performance as the computer is tending to different things while your code runs. On any embedded system (which is most applications of DSP chips), your CPU is only running the code you told it to, resulting in a much more predictable situation.
Reply by ●April 26, 20052005-04-26
>>for an FPGA design I would first take a piece of paper and calculate the >>datarates for my buses and find predictable solution ...> On any embedded system (which is most applications of DSP chips), your > CPU is only running the code you told it to, resulting in a much more > predictable situation.It still looks quite unpredictable to me ... I've got some algorithm (e.g. the pyramid decomposition of an image) and I can count the number of operations: 3M load/store (or less if implemented different) 1.7M add 0.6M shift 1.5M mult Where should I end up with medium effort of tuning? - mult 5 cycles, others 1 cycle would give 13M cycles for plain linear execution ... - if I assume maxium parallelism (unrealistic 8 inst/cycle) I could calculate 0.8M cycles somewhere in between? or even below? (there is still no load-doubleword, add2 and all this used) bye, Michael
Reply by ●April 26, 20052005-04-26
"Michael Schoeberl" <nospam-newsgroups@michael-schoeberl.de> wrote in message news:426e7c31_1@news.eticomm.net...>>>for an FPGA design I would first take a piece of paper and calculate the >>>datarates for my buses and find predictable solution ... > >> On any embedded system (which is most applications of DSP chips), your >> CPU is only running the code you told it to, resulting in a much more >> predictable situation. > > It still looks quite unpredictable to me ... I've got some algorithm > (e.g. the pyramid decomposition of an image) and I can count the number of > operations: > > > 3M load/store (or less if implemented different) > 1.7M add > 0.6M shift > 1.5M mult > > > Where should I end up with medium effort of tuning? > > > - mult 5 cycles, others 1 cycle would give 13M cycles > for plain linear execution ... > > - if I assume maxium parallelism (unrealistic 8 inst/cycle) I could > calculate 0.8M cycles > > > somewhere in between? or even below? (there is still no load-doubleword, > add2 and all this used) > > > bye, > MichaelGenerally if this is the level at which you're estimating, then there is the good news and the bad news: - the good news is that you've done a pretty good analytical job of counting the number of operations ... well, except it sounds like you left out some operations like load-dbl - the bade news is that it's difficult to know just where in the code you will have bottlenecks just in general. Often the bottlenecks are more important than than snippet of code where you *think* the time will be used. You have to instrument for these. Some practitioners believe it's better to not optimize code up front. Rather, fix the bottlenecks after the code is written. Fred
Reply by ●April 30, 20052005-04-30
Well, this is not a new problem for image/video applications and the most obvious and common solution is to split up the processing linewise/columnwise with background "pre-fetching" the next line of data to be processed using DMA into the dsp SRAM. You end up with a ping-pong buffer scheme. What you need to ensure is that the data pre-fetch of the next line completes by the time you finish processing one line of data. If implemented correctly, this should significantly reduce the overhead incurred due to sdram accesses. Unfortunately there is no simple well defined way to predict the overhead incurred. You have to rely on TI Simulator profiling data for various events ( CPU Stalls, Cachem misses etc )to identify the bottlenecks. Michael Schoeberl wrote:> > In image processing applications that I've seen things have endedup in> > FPGA applications nearly every time. > > Okay ... this project won't ;-) > > up to now the department has done image-algorithms on intel platforms> (P4 with MMX or IPP lib optimizations) and now we want to find outhow> fast we can get with a DSP ... > > the XXXX MIPs from the DSP datasheet are clearly an unrealistic upper> bound ... > > > >> I think I shoud try interleaving of the processing? how much couldI> >> gain? > > > Do you mean between two processors? The theoretical bound forspeedup> > from using N processors is N times. > > okay - in this sense, my algorithm would scale quite well ... but I > thought about something else: > > > I'm playing with a pyramid decomposition (load, filter, downsample, > store, load, upsample-hor, store, load, upsample-vert, store, load, > subtract, store) ... > > first aproach: filtering over the whole image results in a betterusage> of the cpu/pipeline ... (but then there is the sdram bottleneck) > > next idea: do each operation only on one single line of the image ... > could avoid the load/store bottleneck - but is it worth the effort? > > > for an FPGA design I would first take a piece of paper and calculatethe> datarates for my buses and find predictable solution ... > > > > The general approach is to understand what your system is and wherethe> > bottlenecks are, then try to open up the bottlenecks. It soundslike> > you're already doing this. > > yea ... learning the hard way - trial & error ... ;-) > > okay thanks - if everybody else is doing this then I should get usedto> this :-) > > > bye, > Michael