DSPRelated.com
Forums

How to fft huge data arrays?

Started by Andreas Besting September 26, 2005
Hi!

For my little project i need to fft large data arrays, that do not fit 
into memory. So far i know about overlap add and for some problems that 
works fine, but I wondered if it's possible to transform the whole data 
altogether.
I tried to code a special memory management for the fft but due to bit 
reversed addressing i get so many page faults that it's really not 
efficient...  I also tried to reconstruct the large fft from several 
windowed transforms, but it got me very strange results (is this even 
possible??? i'm really not an expert in dsp). So, any help would be nice!

Andreas
How large?

Try the author of "mammut", �yvind Hammer:

http://www.notam02.no/notam02/prod-prg-mammuthelp.html

it does a single huge FFT on a soundfile. The link to the software itself seems 
to be broken or just not available.

One way or another, if your data i on disk rather than in memory you will be 
swapping pages (one assumes that the OS memory mapping support will be 
reasonably eficcient!), so there must by definition be a limit on just how 
efficient it can be.



Richard Dobson

Andreas Besting wrote:

> Hi! > > For my little project i need to fft large data arrays, that do not fit > into memory. So far i know about overlap add and for some problems that > works fine, but I wondered if it's possible to transform the whole data > altogether. > I tried to code a special memory management for the fft but due to bit > reversed addressing i get so many page faults that it's really not > efficient... I also tried to reconstruct the large fft from several > windowed transforms, but it got me very strange results (is this even > possible??? i'm really not an expert in dsp). So, any help would be nice! > > Andreas
Hi!

Thanks for the link!

Richard Dobson wrote:
> How large?
That depends - sometimes up to 100MB per signal, but often in my application i haven opened many signals, and then it gets critical. But you're right, i have to swap pages. The main problem is, that i can't do it during the transform or after the tranform when i'm restoring the original bit sorting. I would have to read single values almost randomly from the file, and that's too slow. The best way for me would be to use my current fft implementation on small segments, and put it all together after that using demand paging, but i don't know how this works. I first tried to take from each transformed segment real and imaginary samples and add them up one after another in a large complex array, but when i took the inverse transform it wasn't looking like the original signal :-(
> > Try the author of "mammut", �yvind Hammer: > > http://www.notam02.no/notam02/prod-prg-mammuthelp.html > > it does a single huge FFT on a soundfile. The link to the software > itself seems to be broken or just not available. > > One way or another, if your data i on disk rather than in memory you > will be swapping pages (one assumes that the OS memory mapping support > will be reasonably eficcient!), so there must by definition be a limit > on just how efficient it can be. > > > > Richard Dobson >
Andreas Besting wrote:
> Hi! > > Thanks for the link! > > Richard Dobson wrote: > > How large? > > That depends - sometimes up to 100MB per signal, but often in my > application i haven opened many signals, and then it gets critical.
What is the application? Why do you have to work with 100 MB segments? If you are implementing a filter, could you make do with smaller segments, say 5-10 MB, and use an overlap-add/overlap-save technique? In my experience, the problems with handling sheer data volumes can become overwhelming for even the most trivial operations, when the data volumes become comparable with memory size. I find that it surprisingly often pays off to use a smallish buffer and loop through the data file a number of times. The data become easier to handle, and the performance stays reasonable. Rune
Andreas Besting wrote:
> For my little project i need to fft large data arrays, that do not fit > into memory. So far i know about overlap add and for some problems that > works fine, but I wondered if it's possible to transform the whole data > altogether. > I tried to code a special memory management for the fft but due to bit > reversed addressing i get so many page faults that it's really not > efficient... I also tried to reconstruct the large fft from several > windowed transforms, but it got me very strange results (is this even > possible??? i'm really not an expert in dsp). So, any help would be nice!
I think the way they did very long fft's in the old days was to use a multipass algorithm, writing out the array to tape in a different sort on each near-linear pass, so as not to trash memory or any cache. Might have required more than two 9-track tape drives, and/or more than on pass per address bit in the reversed address. Did these algorithms have a name? IMHO. YMMV. -- rhn A.T nicholson d.O.t C-o-M
Andreas Besting wrote:
> Hi! > > For my little project i need to fft large data arrays, that do not fit > into memory. So far i know about overlap add and for some problems that > works fine, but I wondered if it's possible to transform the whole data > altogether. > I tried to code a special memory management for the fft but due to bit > reversed addressing i get so many page faults that it's really not > efficient... I also tried to reconstruct the large fft from several > windowed transforms, but it got me very strange results (is this even > possible??? i'm really not an expert in dsp). So, any help would be nice!
I don't know for sure, but assume your "little project" involves applying some sort of filter to a sound file. The Fast Fourier Transform, FFT, is a fast way to implement the Discrete Fourier Transform, DFT. The DFT can naively be implemented in a way that requires O(N^2) FLOPs for transforming an N-point sequence, while the FFT requires only O(N*log(N)) FLOPs for doing the same job. So the FFT uses a fraction log(N)/N of the time of the naive algorithm. For N>1 we see that it is a very good idea to use the FFT to compute the DFT, when disregarding various forms of program run-time overhead. Now, the operation of filtering a data sequence can be expressed in terms the DFTs of the input signal and filter transfer function, so a filtering operation can be imnplemented in terms of FFTs. The question is whether that is a smart thing to do. Implementing the filter in terms of DFTs requires computing the FFT of the input signal, one or two FFTs to find the filter transfer function, multiplying the spectra of the input signal and the transfer function followed by an IFFT to find the output signal. All of this takes place in complex arithmetic, so it turns out to be quite a number of FLOPs involved. The net result is that if the number of coefficients in your filter is comparable to log(N), N being the number of data points, you might want to consider the straight-forward time domain implementation of your filter. The run-time penalty will be small compared to the FFT/IFFT algorithm (for very small filters the time-domain methof may even be faster), and it wuill be way simpler to implement. Depending on the compiler you use, you might even be able to exploit certain fancy MMX-instructions in your CPU that can make a very fast implementation of the algorithm. Now, you need very presice knowledge of the number of FLOPs of an FFT algorithm to evaluate where the limit is for switching to time domain, but a very naive argument says that time domain is the place to work with 100000000 samples and a filter with 25 or less coefficients. So at first glance, this type of argument only works for small IIR filters or very small FIR filters. However, the O(NlogN) notation does away with numerical constants like in O(NlogN) -> "The FFT takes 28*N*log(30*N) FLOPS" so depending on the numerical constants that go into the exact expression, there may be room for somewhat larger filters in time domain. But you need to consult somebody who know the FLOP counts in the FFT algorithms to find out the exact details. Rune
Rune Allnor wrote:

> > I don't know for sure, but assume your "little project" involves > applying some sort of filter to a sound file.
Currently I'm working with biomedical time series and i wrote a small viewer/editor to process the data with some filters and other functions. It's also a project for me to learn a bit more about dsp.
> The net result is that if the number of coefficients in your > filter is comparable to log(N), N being the number of data > points, you might want to consider the straight-forward > time domain implementation of your filter. > > The run-time penalty will be small compared to the FFT/IFFT > algorithm (for very small filters the time-domain methof may > even be faster), and it wuill be way simpler to implement. > Depending on the compiler you use, you might even be able > to exploit certain fancy MMX-instructions in your CPU that > can make a very fast implementation of the algorithm. >
That's interesting, i can't even imagine how filtering in the time domain works. When i started on the project, i read some tutorials about the FFT and after some time (months...) i managed to implement it. I wrote a simple rectangular filter that worked like that: FFT, zero frequency bins, IFFT. Then i realized, that for large signals my heap space was too small and the program crashed. I read about overlap add and coded a better filter. That worked fine, but due to the overlapping the performance was much worse than before. (BTW, i still have a question about filters: When i convolve the signal with the filter, i understand that the result signal's size is longer than the original one (+filter size). Is it ok when i cut the beginning and the end to get a filtered signal with equal length? i did it and the result looks quite good.) After that i implemented cross-correlation using FFT and had the same problem that my heap space was too small. But now my "filter" was another large signal and i could'n use overlap add (or can i?). So i thought, if i had an fft that worked also on very large signals, many other problems (such as cross-correlation) would be much easier to realize. I tried to write an FFT using on-demand paging, but it was terribly slow because the bit reversed addressing caused too many page faults. My last attempt was to code a kind of "buffered" FFT, that could somehow merge many small spectra to a large one, but as i said, this didn't work at all. I just found a paper that looks very interesting, it's called "Performing Out-of-Core FFTs on Parallel Disk Systems", http://historical.ncstrl.org/tr/pdf/icase/TR-96-70.pdf I haven't read it but it seems to address my problem. So far, thank you for your help!
Another approach is to use a series of zoom FFTs on sub-bands of data,
building up the big FFT in chunks. This method takes N linear passes
through the data, where N is the number of sub-bands. Recently I chose
this technique to do a  524288 real point FFT on a '54XX DSP chip using
a 4096 point complex FFT routine and a decimation factor of 128
(4096*128 == 524288).

John

Andreas Besting wrote:

> That's interesting, i can't even imagine how filtering in the time > domain works.
Then you definitely should read this book (free download!): http://www.dspguide.com/
> (BTW, i still have a question about filters: When i convolve the signal > with the filter, i understand that the result signal's size is longer > than the original one (+filter size). Is it ok when i cut the beginning > and the end to get a filtered signal with equal length? i did it and the > result looks quite good.)
I don't thinks so - but it's all in the book. bye Andreas -- Andreas H�nnebeck | email: ah@despammed.com ----- privat ---- | www : http://www.huennebeck-online.de Fax/Anrufbeantworter: 0721/151-284301 GPG-Key: http://www.huennebeck-online.de/public_keys/andreas.asc PGP-Key: http://www.huennebeck-online.de/public_keys/pgp_andreas.asc
john wrote:
> Another approach is to use a series of zoom FFTs on sub-bands of data, > building up the big FFT in chunks. This method takes N linear passes > through the data, where N is the number of sub-bands. Recently I chose > this technique to do a 524288 real point FFT on a '54XX DSP chip using > a 4096 point complex FFT routine and a decimation factor of 128 > (4096*128 == 524288). > > John >
Yes!! That's exactly what i need! So it does work! But how do i create the large spectrum from the chunks? Andreas