Forums

FFTW multi-threaded routines, what's use case for them?

Started by all4dsp November 30, 2011
I'm creating a web application that accepts client requests to crunch math
on a server. The application will receive multiple client requests
simultaneously. Was FFTW's multi-threaded routines intended to:

(A) process all client requests serially through FFTW, but for the current
client process being executed by FFTW, have it use multiple cores (e.g.
threads) to increase speed of execution for that one client process (assume
here the problem is sufficiently large for multiple threads to increase
speed of execution)?

(B) process client requests in parallel up to capability of # of cores
available? That is, one client's request is assigned one core, and if there
are 3 (or more) client requests pending and 3 available cores, the three
separate client requests can be run through FFTW simultaneously (e.g. like
having three separate machines each with their own FFTW)? FFTW treats each
independently, although the RAM on the server is of course shared.

(C) both cases above?

My interest is performing (B).

On 30 Nov, 19:58, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote:
> I'm creating a web application that accepts client requests to crunch math > on a server. The application will receive multiple client requests > simultaneously. Was FFTW's multi-threaded routines intended to: > > (A) process all client requests serially through FFTW, but for the current > client process being executed by FFTW, have it use multiple cores (e.g. > threads) to increase speed of execution for that one client process (assume > here the problem is sufficiently large for multiple threads to increase > speed of execution)? > > (B) process client requests in parallel up to capability of # of cores > available? That is, one client's request is assigned one core, and if there > are 3 (or more) client requests pending and 3 available cores, the three > separate client requests can be run through FFTW simultaneously (e.g. like > having three separate machines each with their own FFTW)? FFTW treats each > independently, although the RAM on the server is of course shared. > > (C) both cases above? > > My interest is performing (B).
Libraries are usually designed for case A. Rune
>On 30 Nov, 19:58, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote: >> I'm creating a web application that accepts client requests to crunch
math
>> on a server. The application will receive multiple client requests >> simultaneously. Was FFTW's multi-threaded routines intended to: >> >> (A) process all client requests serially through FFTW, but for the
current
>> client process being executed by FFTW, have it use multiple cores (e.g. >> threads) to increase speed of execution for that one client process
(assume
>> here the problem is sufficiently large for multiple threads to increase >> speed of execution)? >> >> (B) process client requests in parallel up to capability of # of cores >> available? That is, one client's request is assigned one core, and if
there
>> are 3 (or more) client requests pending and 3 available cores, the
three
>> separate client requests can be run through FFTW simultaneously (e.g.
like
>> having three separate machines each with their own FFTW)? FFTW treats
each
>> independently, although the RAM on the server is of course shared. >> >> (C) both cases above? >> >> My interest is performing (B). > >Libraries are usually designed for case A. > >Rune >
Thanks Rune. Is there any way to achieve (B) outside of FFTW, in order to scale an application dependent on FFTW? I can think of getting a second server and using a load balancer in front of the two machines, but this approach seems overkill. Or, is the best one can do is implement (A) to speed up each serial process as much as possible?
On Wed, 30 Nov 2011 15:00:12 -0600, all4dsp wrote:

>>On 30 Nov, 19:58, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote: >>> I'm creating a web application that accepts client requests to crunch > math >>> on a server. The application will receive multiple client requests >>> simultaneously. Was FFTW's multi-threaded routines intended to: >>> >>> (A) process all client requests serially through FFTW, but for the > current >>> client process being executed by FFTW, have it use multiple cores >>> (e.g. threads) to increase speed of execution for that one client >>> process > (assume >>> here the problem is sufficiently large for multiple threads to >>> increase speed of execution)? >>> >>> (B) process client requests in parallel up to capability of # of cores >>> available? That is, one client's request is assigned one core, and if > there >>> are 3 (or more) client requests pending and 3 available cores, the > three >>> separate client requests can be run through FFTW simultaneously (e.g. > like >>> having three separate machines each with their own FFTW)? FFTW treats > each >>> independently, although the RAM on the server is of course shared. >>> >>> (C) both cases above? >>> >>> My interest is performing (B). >> >>Libraries are usually designed for case A. >> >>Rune >> >> > Thanks Rune. Is there any way to achieve (B) outside of FFTW, in order > to scale an application dependent on FFTW? I can think of getting a > second server and using a load balancer in front of the two machines, > but this approach seems overkill. Or, is the best one can do is > implement (A) to speed up each serial process as much as possible?
I would go with case (A) because that'll get you up and running the quickest. Then, in the fullness of time, look at your use statistics and decide if it's worth while modifying the code for case (B). -- My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com
On 30 Nov, 22:00, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote:
> >On 30 Nov, 19:58, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote: > >> I'm creating a web application that accepts client requests to crunch > math > >> on a server. The application will receive multiple client requests > >> simultaneously. Was FFTW's multi-threaded routines intended to: > > >> (A) process all client requests serially through FFTW, but for the > current > >> client process being executed by FFTW, have it use multiple cores (e.g. > >> threads) to increase speed of execution for that one client process > (assume > >> here the problem is sufficiently large for multiple threads to increase > >> speed of execution)? > > >> (B) process client requests in parallel up to capability of # of cores > >> available? That is, one client's request is assigned one core, and if > there > >> are 3 (or more) client requests pending and 3 available cores, the > three > >> separate client requests can be run through FFTW simultaneously (e.g. > like > >> having three separate machines each with their own FFTW)? FFTW treats > each > >> independently, although the RAM on the server is of course shared. > > >> (C) both cases above? > > >> My interest is performing (B). > > >Libraries are usually designed for case A. > > >Rune > > Thanks Rune. Is there any way to achieve (B) outside of FFTW, in order to > scale an application dependent on FFTW? I can think of getting a second > server and using a load balancer in front of the two machines, but this > approach seems overkill. Or, is the best one can do is implement (A) to > speed up each serial process as much as possible?&#2013266070; Skjul sitert tekst &#2013266070;
It seems you are looking to implement a server application. That's a rather specialized programming application far away from DSP. You might want to ask about this in a programming forum. Rune
>On 30 Nov, 22:00, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote: >> >On 30 Nov, 19:58, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote: >> >> I'm creating a web application that accepts client requests to
crunch
>> math >> >> on a server. The application will receive multiple client requests >> >> simultaneously. Was FFTW's multi-threaded routines intended to: >> >> >> (A) process all client requests serially through FFTW, but for the >> current >> >> client process being executed by FFTW, have it use multiple cores
(e.g=
>. >> >> threads) to increase speed of execution for that one client process >> (assume >> >> here the problem is sufficiently large for multiple threads to
increas=
>e >> >> speed of execution)? >> >> >> (B) process client requests in parallel up to capability of # of
cores
>> >> available? That is, one client's request is assigned one core, and
if
>> there >> >> are 3 (or more) client requests pending and 3 available cores, the >> three >> >> separate client requests can be run through FFTW simultaneously
(e.g.
>> like >> >> having three separate machines each with their own FFTW)? FFTW
treats
>> each >> >> independently, although the RAM on the server is of course shared. >> >> >> (C) both cases above? >> >> >> My interest is performing (B). >> >> >Libraries are usually designed for case A. >> >> >Rune >> >> Thanks Rune. Is there any way to achieve (B) outside of FFTW, in order
to
>> scale an application dependent on FFTW? I can think of getting a second >> server and using a load balancer in front of the two machines, but this >> approach seems overkill. Or, is the best one can do is implement (A) to >> speed up each serial process as much as possible?=96 Skjul sitert tekst
=
>=96 > >It seems you are looking to implement a server application. >That's a rather specialized programming application far >away from DSP. You might want to ask about this in a >programming forum. > >Rune >
Thanks Rune, Tim, OK. Now I know the use case for FFTW multi-threads is case (A), and anything related to case (B) must be done outside FFTW. This helps a lot. Thanks
all4dsp <all4dsp@n_o_s_p_a_m.comcast.net> wrote:

(snip)
>>> (A) process all client requests serially through FFTW, but for the >>> current client process being executed by FFTW, have it use >>> multiple cores
>>> (B) process client requests in parallel up to capability of # of >>> cores available? That is, one client's request is assigned >>> one core, and if there are 3 (or more) client requests >>> pending and 3 available cores, the three separate client >>> requests can be run through FFTW simultaneously
(snip)
> Thanks Rune, Tim, OK. Now I know the use case for FFTW multi-threads is > case (A), and anything related to case (B) must be done outside FFTW. This > helps a lot. Thanks
It seems to me that if you sequentially do (A), and (A) is well implemented, then you get everything through about as fast as possible. Often it is easier to implement (B) if you start with serial code, and with sufficient requests it should work well. (C) is somewhat harder, especially if you need to change while processing a request. -- glen
On 11/30/2011 1:20 PM, glen herrmannsfeldt wrote:

I dunno.... It seems to me that no one has addressed multiple 
instantiations of the program.  Not that I know how to do it exactly but 
that seems a reasonable approach.

After all, "the program" is simply some code waiting to be executed 
until it's instantiated - at which point there is an active state of it.
So, seems to me that there could be multiple executions going on at the 
same time.  Even a very brute force method might be envisioned (compile 
FFTW1, FFTW2, etc. where the "program" exists with different names and 
they are instantiated with a scheduler of some kind for each client.
Something like a round-robin scheduler would seem to do the trick there. 
  But, somehow I doubt one would actually do it that way because I would 
imagine that multiple instantiations are do-able without going to this 
extreme.  I just don't know.

I'd surely be concerned (because I'm ignorant of the details) that .dlls 
might get in the act to slow things down so I'd either get some 
assurance or build instances of the program as above that are not only 
separately compiled AND use their own ".dll" versions as a brute force 
approach.  This is wild conjecture.

The point was made about getting things done as fast as possible with a 
multi-threaded version.  That's a good point if only as an entry to a 
discussion.  The question is, can you "share time" without "sharing CPU 
capacity"?  That is, can you take advantage of otherwise unusable CPU 
capacity?:

It's true that in an ideal world that IF the CPU cores or CPUs are taken 
up 100% by a multi-threaded app then doing one execution after another 
is likely the fastest one can do.  It's then a global throughput issue. 
  But, is that necessarily a good assumption?  As above, you might 
compile FFTW with a new name and then execute one in a loop and then two 
in individual loops to see how the performance and CPU loading changes. 
  It's obviously going to be computer and OS dependent.

Assuming the ability to share time as above, I should think that you'd 
also need to be aware of the interaction between doing one large 
instance quickly and doing a small instance "quickly" at the same time:
Scenario:
Let's assume that the large instance will take at least 10X the time as 
the small one if they are each run alone.
- Start a "large" instance.
- Thereafter start a "small" instance.
How much longer will the small instance take to complete compared to 
when it runs alone?
It seems this is a good measure to help learn if serial executions are 
just fine by themselves and parallel execution isn't even necessary. 
(i.e. if the small task takes just about as long, when it's run in 
parallel with the large task, as the large task takes anyway then 
parallel execution doesn't look like it pays off all that much).

I'm not current on parallel processing compilers such as The Portland 
Group's Fortran offerings and have no idea.  So, I looked it up and 
found GNU Parallel, Nvidia CUDA and so forth.  Perhaps there's something 
there to help.  I'm sure you're up to speed on all those tools.

So, yeah, this isn't much of a DSP math or app oriented question (or 
answer) but things like this ARE often of interest to DSP folks .. who 
live in the real world.  It's a DSP *system* question.  Fair game, eh?

My 2-CPU Pentium MMX NT machine died this year.  It was intended to do 
parallel image processing in competition with image processing boards of 
the 1997 variety from Precision Digital Images and the like with 
TMS320Cxx processors on PCI.  Today that's common and those boards are 
in the dim past.  This suggests that you look to where the toughest 
problems are and how they are being dealt with.  So, I'd look close at 
Nvidia / games / etc.

Fred
On Nov 30, 1:58&#2013266080;pm, "all4dsp" <all4dsp@n_o_s_p_a_m.comcast.net> wrote:
> I'm creating a web application that accepts client requests to crunch math > on a server. The application will receive multiple client requests > simultaneously. Was FFTW's multi-threaded routines intended to: > > (A) process all client requests serially through FFTW, but for the current > client process being executed by FFTW, have it use multiple cores (e.g. > threads) to increase speed of execution for that one client process (assume > here the problem is sufficiently large for multiple threads to increase > speed of execution)? > > (B) process client requests in parallel up to capability of # of cores > available? That is, one client's request is assigned one core, and if there > are 3 (or more) client requests pending and 3 available cores, the three > separate client requests can be run through FFTW simultaneously (e.g. like > having three separate machines each with their own FFTW)? FFTW treats each > independently, although the RAM on the server is of course shared. > > (C) both cases above? > > My interest is performing (B).
Threads and cores aren't a one-to-one programming abstractions. If a thread blocks, the core will execute an unblocked thread. A parallel compiler (at compile time) will schedule I/O (cashes) so that cores are kept busy. They work well for problems like dense Linear algebra. If data location is only know at run time, like in Sparse Linear Algebra, threads run out of data and block. There are typically many more threads than cores.