DSPRelated.com
Forums

Optimal square calculation?

Started by Chris Bore October 1, 2008
I have an application that involves many sums of squares. For
instance:

for (i = 0; i < N; i++
  sum += x[i]*x[i];

Is any clever optimization possible when doing sums of squares?

(The application is calculating statistics on very large 3D arrays of
16-bit integer data - typically 3 Gsamples. There are also many cases
of sums of 4th power, and squares of differences.)

It is close enough to classic MAC but I feel there may be some benefit
from the fact that only one input argument needs to be loaded, or a
trick that avoids the multiply?

At the moment this is implemented on PC workstation (multi-processor
with multi-threading) which is the required platform. Although at some
time it may be possible to review this.

Thanks,

Chris
===================
Chris Bore
chris@bores.com
www.bores.com
Chris Bore wrote:
> I have an application that involves many sums of squares. For > instance: > > for (i = 0; i < N; i++ > sum += x[i]*x[i]; > > Is any clever optimization possible when doing sums of squares? > > (The application is calculating statistics on very large 3D arrays of > 16-bit integer data - typically 3 Gsamples. There are also many cases > of sums of 4th power, and squares of differences.) > > It is close enough to classic MAC but I feel there may be some benefit > from the fact that only one input argument needs to be loaded, or a > trick that avoids the multiply? > > At the moment this is implemented on PC workstation (multi-processor > with multi-threading) which is the required platform. Although at some > time it may be possible to review this. >
Depending on your CPU and its pipeline lengths etc, you may well get some benefit from unrolling the loop and using a small number of partial sums, e.g. for (i = 0; i < N; i += 4) // NB: assuming N is a multiple of 4 here { sum0 += x[i] * x[i]; sum1 += x[i + 1] * x[i + 1]; sum2 += x[i + 2] * x[i + 2]; sum3 += x[i + 3] * x[i + 3]; } sum = sum0 + sum1 + sum2 + sum3; If your CPU has SIMD capability then you should also consider leveraging that. Paul
Chris Bore wrote:
(snip)

> for (i = 0; i < N; i++ > sum += x[i]*x[i];
> Is any clever optimization possible when doing sums of squares?
I don't know of any. One might hope the compiler would figure out not to do the subscript calculation twice. In Fortran you can use x(i)**2 which is slightly more likely to be optimized. (Don't use x(i)**2. which will be done with log and exp.)
> (The application is calculating statistics on very large 3D arrays of > 16-bit integer data - typically 3 Gsamples. There are also many cases > of sums of 4th power, and squares of differences.)
The only one I can think of is the one used for standard deviation, instead of computing the mean, and then the sum of the square of the deviation from the mean, one computes the sum and sum of squares. Mostly it saves storage in not needing to keep the data around. For fourth power, you want the square of the square (two multiplies) instead of the more obvious three. -- glen
On Oct 1, 2:59&#4294967295;am, Chris Bore <chris.b...@gmail.com> wrote:
> I have an application that involves many sums of squares. For > instance: > > for (i = 0; i < N; i++ > &#4294967295; sum += x[i]*x[i]; > > Is any clever optimization possible when doing sums of squares? > > (The application is calculating statistics on very large 3D arrays of > 16-bit integer data - typically 3 Gsamples. There are also many cases > of sums of 4th power, and squares of differences.) > > It is close enough to classic MAC but I feel there may be some benefit > from the fact that only one input argument needs to be loaded, or a > trick that avoids the multiply? > > At the moment this is implemented on PC workstation (multi-processor > with multi-threading) which is the required platform. Although at some > time it may be possible to review this. > > Thanks, > > Chris > =================== > Chris Bore > ch...@bores.comwww.bores.com
If by "PC workstation" you mean an x86 platform, you might look into the various SSE instruction sets. They allow for 128-bit wide vector operations, so you can do 8 16-bit element-wise multiplications per instruction. Some compilers (i.e. gcc) even provide C-callable wrappers (intrinsics) for the instructions so you don't have to write any assembly yourself. Also, as Paul said, unrolling your loops can help to mitigate some of the overhead of looping if you know a priori that the number of samples will be a multiple of some number. If this is really performance-critical, check the assembly that the compiler generates; if you think you can do better, go for it. The loop you gave is pretty simple, so if this is truly the critical path (I'd use a profiler to make sure), you could pretty easily write an optimized version in asm. Jason
cincydsp@gmail.com writes:

> On Oct 1, 2:59&#4294967295;am, Chris Bore <chris.b...@gmail.com> wrote: >> I have an application that involves many sums of squares. For >> instance: >> >> for (i = 0; i < N; i++ >> &#4294967295; sum += x[i]*x[i]; >> >> Is any clever optimization possible when doing sums of squares? >> >> (The application is calculating statistics on very large 3D arrays of >> 16-bit integer data - typically 3 Gsamples. There are also many cases >> of sums of 4th power, and squares of differences.) >> >> It is close enough to classic MAC but I feel there may be some benefit >> from the fact that only one input argument needs to be loaded, or a >> trick that avoids the multiply? >> >> At the moment this is implemented on PC workstation (multi-processor >> with multi-threading) which is the required platform. Although at some >> time it may be possible to review this. >> >> Thanks, >> >> Chris >> =================== >> Chris Bore >> ch...@bores.comwww.bores.com > > If by "PC workstation" you mean an x86 platform, you might look into > the various SSE instruction sets. They allow for 128-bit wide vector > operations, so you can do 8 16-bit element-wise multiplications per > instruction. Some compilers (i.e. gcc) even provide C-callable > wrappers (intrinsics) for the instructions so you don't have to write > any assembly yourself. Also, as Paul said, unrolling your loops can > help to mitigate some of the overhead of looping if you know a priori > that the number of samples will be a multiple of some number. If this > is really performance-critical, check the assembly that the compiler > generates; if you think you can do better, go for it. The loop you > gave is pretty simple, so if this is truly the critical path (I'd use > a profiler to make sure), you could pretty easily write an optimized > version in asm.
I'll see your bid, Jason, and raise you a routine: http://www.digitalsignallabs.com/firmac.asm I assembled this using yasm (a wonderful assembler, in my view). I'm not of the "do your assembly in the compiler" ilk. If you're going to do assembly, just break down and do it. By the way, Agner Fog has a fantastic series of guides on writing optimized x86 assembly. Highly recommended reading if you're doing this sort of thing. -- % Randy Yates % "She's sweet on Wagner-I think she'd die for Beethoven. %% Fuquay-Varina, NC % She love the way Puccini lays down a tune, and %%% 919-577-9882 % Verdi's always creepin' from her room." %%%% <yates@ieee.org> % "Rockaria", *A New World Record*, ELO http://www.digitalsignallabs.com
In hardware it is often advantageous to implement a squaring
function by the sequence "leading zero/one count, shift,
square lookup table, shift back".   This is smaller than using
a multiplier, unless you have enough clock cycles to do a
serialized multiplier (serial-parallel, or even bit-serial).

In software, I would be kind of surprised if, on any processor
around today, such techniques make sense.  As suggested, loop 
unrolling, "register" keywords and compiler optimizer flags should be 
all you need.  Dropping into assembler would seem a last resort and
a sign you may need to select the next speed level of CPU/DSP.

Steve

Chris Bore wrote:

> I have an application that involves many sums of squares. For > instance: > > for (i = 0; i < N; i++ > sum += x[i]*x[i]; > > Is any clever optimization possible when doing sums of squares?
i = N; ptr = x; while(i--) { xtmp = *ptr++; sum += xtmp*xtmp; } The code will be somewhat more optimal.
> (The application is calculating statistics on very large 3D arrays of > 16-bit integer data - typically 3 Gsamples.
Then the speed is limited by memory bandwidth.
> There are also many cases > of sums of 4th power, and squares of differences.)
Then do SSE in assembler.
> It is close enough to classic MAC but I feel there may be some benefit > from the fact that only one input argument needs to be loaded, or a > trick that avoids the multiply? > > At the moment this is implemented on PC workstation (multi-processor > with multi-threading) which is the required platform.
Then split a long compiting cycle into two cycles running as the different threads. Although at some
> time it may be possible to review this.
Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
spope33@speedymail.org (Steve Pope) writes:
> [...] > Dropping into assembler would seem a last resort and a sign you may > need to select the next speed level of CPU/DSP.
If size, cost, or power are not important, I agree. For another 50 bucks you could probably get a CPU that can handle the task using the compiler. -- % Randy Yates % "Remember the good old 1980's, when %% Fuquay-Varina, NC % things were so uncomplicated?" %%% 919-577-9882 % 'Ticket To The Moon' %%%% <yates@ieee.org> % *Time*, Electric Light Orchestra http://www.digitalsignallabs.com
On 1 Oct, 07:59, Chris Bore <chris.b...@gmail.com> wrote:
> I have an application that involves many sums of squares. For > instance: > > for (i = 0; i < N; i++ > &#4294967295; sum += x[i]*x[i]; > > Is any clever optimization possible when doing sums of squares? > > (The application is calculating statistics on very large 3D arrays of > 16-bit integer data - typically 3 Gsamples. There are also many cases > of sums of 4th power, and squares of differences.) > > It is close enough to classic MAC but I feel there may be some benefit > from the fact that only one input argument needs to be loaded, or a > trick that avoids the multiply? > > At the moment this is implemented on PC workstation (multi-processor > with multi-threading) which is the required platform. Although at some > time it may be possible to review this. > > Thanks, > > Chris > =================== > Chris Bore > ch...@bores.comwww.bores.com
Thanks for the helpful replies. The distance approximation seems to be about twice as fast as calculating the sqrt of squares. That is: magnitude = max(abs(I), abs(Q)) + 0.5 * min (abs(I), abs(Q)); which has an error up to 12%. (This is based on changing the nature of the question, to 'how can I efficiently estimate the magnitude of a vector whose cartesian coordinates are I and Q?' That is, by avoiding the squares and square roots and replacing them with a numerical approximation to the desired end result.) Can I extend the scope of the question? :-) Many of these statistical calculations involve sums of squares etc. They are standard operations like variance, second moments and so on. Following the distance approximation, are there optimized routines (in C..) that estimate things like variance, sum of squares, to some desired (lack of) accuracy? I have started looking at statistics libraries, but some seem far from optimized in two senses. (1) they are actually slower than simply writing out a simple looped expression for the desired calculation. (2) they do not estimate, they calculate according to a formula - I feel that there must be numerical methods to get close to these standard measures. To draw a parallel with DSP (with which I am more familiar than this statistical operation) I would expect to find libraries optimized for speed, memory, etc - as I do for instance for the FFT and derivate transforms. The same need for very fast statistics must also exist, I am hoping to find its products. :-) Thanks, Chris =============== Chris Bore BORES Signal Processing www.bores.com
On 1 Oct, 15:29, Vladimir Vassilevsky <antispam_bo...@hotmail.com>
wrote:

> Chris Bore wrote: > > I have an application that involves many sums of squares. For > > instance: > > > for (i = 0; i < N; i++ > > &#4294967295; sum += x[i]*x[i]; > > > Is any clever optimization possible when doing sums of squares? > > i = N; > ptr = x; > > while(i--) > { > xtmp = *ptr++; > sum += xtmp*xtmp; > > } > > The code will be somewhat more optimal. > > > (The application is calculating statistics on very large 3D arrays of > > 16-bit integer data - typically 3 Gsamples. > > Then the speed is limited by memory bandwidth.
Not entirely, although this is a problem that we have addressed. The data are partitioned and processed as the stream is acquired, or read from disk. My current aim includes being able to do some processing 'on the fly' as the acquisition proceeds. Eventually the processing may be built into the acquisition system on an embedded platform, but that is some time off.
> > There are also many cases > > of sums of 4th power, and squares of differences.) > > Then do SSE in assembler. > > > It is close enough to classic MAC but I feel there may be some benefit > > from the fact that only one input argument needs to be loaded, or a > > trick that avoids the multiply? > > > At the moment this is implemented on PC workstation (multi-processor > > with multi-threading) which is the required platform. > > Then split a long compiting cycle into two cycles running as the > different threads. > > &#4294967295; Although at some > > > time it may be possible to review this. > > Vladimir Vassilevsky > DSP and Mixed Signal Design Consultanthttp://www.abvolt.com