Hi Guys, I was reading a report the other day where the author stated that modern-day floating-point DSP chips can perform division almost as fast (measured in CPU clocks cycles) as they can perform multiplication. Is that true? (I remember some years ago our pal here Dr. Mike Rosing--what ever happened to Dr. Mike?--,told me that division took roughly 30 clock cycles to achieve a relatively accurate quotient.) Thanks guys, [-Rick-]
Simple Floating-Pt DSP chip question
Started by ●April 27, 2006
Reply by ●April 27, 20062006-04-27
Rick Lyons wrote:> Hi Guys, > > I was reading a report the other day where the > author stated that modern-day floating-point DSP chips > can perform division almost as fast (measured in CPU clocks > cycles) as they can perform multiplication. > > Is that true?On the SHARCs, a division takes 14 times longer than a multiplication. There is single cycle inverse seed instruction, but it takes some time to polish it all the way to 32bits accuracy.> (I remember some years ago our pal here Dr. Mike > Rosing--what ever happened to Dr. Mike?--,told me > that division took roughly 30 clock cycles to achieve a > relatively accurate quotient.)Dr. Mike is very active in the ADSP e-mail group. It's a pity he doesn't show up here anymore. Regards, Andor
Reply by ●April 28, 20062006-04-28
Hi Rick> I was reading a report the other day where the > author stated that modern-day floating-point DSP chips > can perform division almost as fast (measured in CPU clocks > cycles) as they can perform multiplication. > > Is that true? > > (I remember some years ago our pal here Dr. Mike > Rosing--what ever happened to Dr. Mike?--,told me > that division took roughly 30 clock cycles to achieve a > relatively accurate quotient.)My knowledge is really only with TI's devices, where the TMS320C67xx provide both a single clock time reciprocal and a single clock time reciprocal of a square root functions documented in SPRA516, but these are ony accurate to eight bits of mantissa. In the real world these can be quite useful! I have a satellite transponder project that makes an attempt to calculate the noise floor for AGC purposes, and changing the code to use the eight bit approximations rather than standard 32 bit precision improved speed ten fold with no perceived degradation of result quality. Cheers, Howard
Reply by ●April 30, 20062006-04-30
On Thu, 27 Apr 2006 22:33:58 GMT, R.Lyons@_BOGUS_ieee.org (Rick Lyons) wrote:> > >Hi Guys, > > I was reading a report the other day where the >author stated that modern-day floating-point DSP chips >can perform division almost as fast (measured in CPU clocks >cycles) as they can perform multiplication. > >Is that true? > >(I remember some years ago our pal here Dr. Mike >Rosing--what ever happened to Dr. Mike?--,told me >that division took roughly 30 clock cycles to achieve a >relatively accurate quotient.) > >Thanks guys, >[-Rick-]Hi Andor & Howard, Thanks for your replies. They're super interesting to me. See Ya', [-Rick-]
Reply by ●April 30, 20062006-04-30
Rick Lyons wrote:> Hi Guys, > > I was reading a report the other day where the > author stated that modern-day floating-point DSP chips > can perform division almost as fast (measured in CPU clocks > cycles) as they can perform multiplication. > > Is that true?It seems unlikely, since I think you need to go through many cyles (20 or 30) to get an accurate result. It may refer to the seed operation which can be one cycle but to take just that would be accepting an effectively smaller accuracy, and could probably be done better by strength reduction (replacing the divide by shifts and adds etc). By the way, since multiplies are as fast as adds now, is the FFT's advantage (reducing multiplications) still what it is always quoted to be or should we take account of adds too? I never took the time to work this out properly yet. Chris ================= Chris Bore www.bores.com
Reply by ●April 30, 20062006-04-30
Rick Lyons wrote:> Hi Guys, > > I was reading a report the other day where the > author stated that modern-day floating-point DSP chips > can perform division almost as fast (measured in CPU clocks > cycles) as they can perform multiplication. > > Is that true? >if by "almost" he means an order of magnitude slower, then maybe. All the DSP I have used the division is much slower, but maybe the author knows about a specialized DSP.
Reply by ●April 30, 20062006-04-30
Chris Bore wrote:> Rick Lyons wrote: > >>Hi Guys, >> >> I was reading a report the other day where the >>author stated that modern-day floating-point DSP chips >>can perform division almost as fast (measured in CPU clocks >>cycles) as they can perform multiplication. >> >>Is that true? > > > It seems unlikely, since I think you need to go through many cyles (20 > or 30) to get an accurate result. It may refer to the seed operation > which can be one cycle but to take just that would be accepting an > effectively smaller accuracy, and could probably be done better by > strength reduction (replacing the divide by shifts and adds etc). > > By the way, since multiplies are as fast as adds now, is the FFT's > advantage (reducing multiplications) still what it is always quoted to > be or should we take account of adds too? I never took the time to work > this out properly yet. > >Chris, To answer your last question, I would say 'yes' and 'yes.' The FFT still has the advantage that its execution time is k1*N*log(N) but to find the value of k1 we now have to include the add (or subtract) time that is associated with each multiply operation, as well as some load and store operations. The DFT's execution time is still k2*N^2 and because the additions, loads and stores are part of a single-cycle MAC operation they are not included in calculating k2, and so k2 is smaller than k1. We are still comparing execution times of k1*N*log(N) for the FFT with execution times of k2*N^2 for the DFT. Although k2 is smaller than k1, there must always be a value of N past which the FFT has a shorter execution time than the DFT, but this break-even value gets moved up a little. Regards, John
Reply by ●April 30, 20062006-04-30
Chris Bore wrote:> Rick Lyons wrote: > >>Hi Guys, >> >> I was reading a report the other day where the >>author stated that modern-day floating-point DSP chips >>can perform division almost as fast (measured in CPU clocks >>cycles) as they can perform multiplication. >> >>Is that true? > > > It seems unlikely, since I think you need to go through many cyles (20 > or 30) to get an accurate result. It may refer to the seed operation > which can be one cycle but to take just that would be accepting an > effectively smaller accuracy, and could probably be done better by > strength reduction (replacing the divide by shifts and adds etc). > > By the way, since multiplies are as fast as adds now, is the FFT's > advantage (reducing multiplications) still what it is always quoted to > be or should we take account of adds too? I never took the time to work > this out properly yet.A multiply may be as fast as an add, but the energy taken to execute it is much greater. Since most of the world's DSP is battery powered, a multiply will always suck. Steve
Reply by ●May 1, 20062006-05-01
> By the way, since multiplies are as fast as adds now, is the FFT's > advantage (reducing multiplications) still what it is always quoted to > be or should we take account of adds too?An FFT reduces both the multiplies and adds compared to the naive DFT formula (which has O(N^2) of both). In fact, arguably the O(N log N) additions have been the limiting factor for decades now, since it is actually possible to perform an FFT with only O(N) multiplications (thanks to results by Winograd in the 70's). (On general-purpose CPUs, simple operation counts ceased to have much predictive value years ago. Two FFT implementations can easily have nearly the same arithmetic operation count and differ in performance by an order of magnitude.) Cordially, Steven G. Johnson
Reply by ●May 2, 20062006-05-02
John Monro wrote: ...> The FFT still has the advantage that its execution time is k1*N*log(N) > but to find the value of k1 we now have to include the add (or subtract) > ime that is associated with each multiply operation, as well as some > load and store operations. > > The DFT's execution time is still k2*N^2 and because the additions, > loads and stores are part of a single-cycle MAC operation they are not > included in calculating k2, and so k2 is smaller than k1.Just for interests sake, on a first generation SHARC, k2 = 1 and k1 = 1.01... (for a real 2^n -point FFTs, using a DIT algorithm). The reason for k1 being so small is the availability of the MAC equivalent single cycle instruction for FFTs on the SHARC (multiply / add and subtract). On second and third generation SHARCs, k2 = 0.5 and k1 = 0.62 ... (due to SIMD), using the same instruction set as the first generation SHARCs. On TigerSHARCs, k1 is even closer to 0.5. These numbers are valid only for FFTs that fit into the internal memory of the DSP. Regards, Andor






