Reply by February 8, 20172017-02-08
On Wed, 8 Feb 2017 14:59:27 +0100, Christian Gollwitzer
<auriocus@gmx.de> wrote:

>Am 07.02.17 um 00:22 schrieb Tim Wescott: >> On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote: >> >>> On 06.02.2017 23:32, Tim Wescott wrote: >>>> I have a DSP book that was handed to me in the 1980's, and it was old >>>> then. In the chapter on the DFT, they discuss a bunch of different FFT >>>> algorithms, including one that has the order of operations arranged so >>>> that you can store the input, the intermediate results, and the output >>>> all on one (or maybe two) tapes. > >That sounds interesting, however, as guessed by Evgeny, I am not very >keen to implement my own FFT algorithm.
Most of the crazy variants of the DFT/FFT have already been implemented in some available library, somewhere, like FFTW or similar libraries. I suspect searching around on terms related to FIFO FFT may turn up something. The FIFO-oriented FFT algorithms have been around for many decades, so I'd be surprised if you had to code your own. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Reply by Christian Gollwitzer February 8, 20172017-02-08
Am 07.02.17 um 00:22 schrieb Tim Wescott:
> On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote: > >> On 06.02.2017 23:32, Tim Wescott wrote: >>> I have a DSP book that was handed to me in the 1980's, and it was old >>> then. In the chapter on the DFT, they discuss a bunch of different FFT >>> algorithms, including one that has the order of operations arranged so >>> that you can store the input, the intermediate results, and the output >>> all on one (or maybe two) tapes.
That sounds interesting, however, as guessed by Evgeny, I am not very keen to implement my own FFT algorithm.
>> The fun part of GPU computing is the ease of getting started. If you >> have a not-very-old GPU and a Matlab, you can play with it using >> literally a couple lines of code. There's a single function to put an >> array of data into a GPU device, and a single function to retrieve it. >> Matlab overloads a lot of common functions to perform calculations on >> the device. You don't have to use Matlab, of course; it's just the >> easiest way to get started:
I am using OpenCL to program the GPU. I haven't tried the Matlab toolbox so far, but I guess, for the kind of algorithms I'm after, using straight OpenCL is not just the fastest, but also relatively easy to implement.
> I would kind of expect that SOMEONE would have a FFT algorithm that's > optimized for the GPU.
Yes, of course. nVidia has cuFFT in their CUDA-Toolbox, and for OpenCL there are two serious imlementations, Apple FFT and clFFT (a project from AMD). clFFT can do 2D FFTs, and they also do a bag of tricks that a mere mortal like me would not come up with. For instance, after they transform the rows, they perform a transpose operation with caching in the local shared memory, to reduce the stride to 1, and then transpose back. This is much faster then performing a FFT with a stride of 8192 to transform the columns. Preliminary testing show approximately the same speed on the Intel Iris GPU as the CPU in my MacBook, hopefully I can gain something when I manage to compile this on my Windows machine at work equipped with a big nVidia GPU. Thanks for you ideas, Christian
Reply by Tim Wescott February 6, 20172017-02-06
On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote:

> On 06.02.2017 23:32, Tim Wescott wrote: >> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote: >> >>> Am 06.02.17 um 04:35 schrieb Tim Wescott: >>>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote: >>>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase- >>>> userguide.html#tth_sEc1.3 >>>> >>>> I don't think you can get there from here unless you're willing to >>>> trade off circular symmetry for speed. >>>> >>>> You could implement a filter with transfer function 1/((1 + Au^2)(1 + >>>> Av^2)) by filtering each row individually, then each column -- but I >>>> don't think that's really what you're looking for. >>> >>> Thanks! I can see it now, that the circular symmetry is impossible to >>> achieve with simple 1D filtering. The Gaussian is a very special case, >>> because it can be factored into two 1D-Gaussians (and there the IIR >>> filter beats the FFT method easily). I can't see a way how to factor >>> the Lorentzian into two 1D functions, probably it's not possible. >>> >>> The thing is that the rest of the tomographic reconstruction runs on >>> the GPU - using OpenCL I could get enormous speed ups and outperform a >>> machine with 36 CPU cores by a factor of 2. This phase contrast >>> filter, however, will be tough. FFTs on the GPU are not much faster >>> than on the CPU according to some references, due to the irregular >>> memory access pattern. Therefore I was hoping to get an improvement on >>> that end. I'll have to do the FFT on the GPU to see how long it really >>> takes. >> >> Something just popped out of my way-back machine and into my head: >> >> I have a DSP book that was handed to me in the 1980's, and it was old >> then. In the chapter on the DFT, they discuss a bunch of different FFT >> algorithms, including one that has the order of operations arranged so >> that you can store the input, the intermediate results, and the output >> all on one (or maybe two) tapes. >> >> Yes, tapes. I'm not so old to have fond memories of them, but I'm just >> old enough to have seen the glow in my slightly-older peers. >> >> It occurs to me that -- assuming you want to pursue things that hard, >> and if memory accesses in a line is indeed the issue -- you might be >> able to use the same concept in a GPU. >> >> "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer, >> Prentice-Hall, 1975, ISBN 0-13-214635-5: >> >> Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the >> same geometry for each stage, thereby permitting sequential data >> accessing and storage." >> >> If you can't find any references shoot me an email and I'll see if I >> can copy the relevant pages for you. >> >> Hope this helps... >> >> > IDK, but the OP would likely just use one of the standard FFT packages. > > The fun part of GPU computing is the ease of getting started. If you > have a not-very-old GPU and a Matlab, you can play with it using > literally a couple lines of code. There's a single function to put an > array of data into a GPU device, and a single function to retrieve it. > Matlab overloads a lot of common functions to perform calculations on > the device. You don't have to use Matlab, of course; it's just the > easiest way to get started: > > http://blogs.mathworks.com/loren/2012/02/06/using-gpus-in-matlab/
I would kind of expect that SOMEONE would have a FFT algorithm that's optimized for the GPU. Assuming that it's even possible, if you just compiled code for a "regular" FFT then you might not get the best results. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by Evgeny Filatov February 6, 20172017-02-06
On 06.02.2017 23:32, Tim Wescott wrote:
> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote: > >> Am 06.02.17 um 04:35 schrieb Tim Wescott: >>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote: >>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase- >>> userguide.html#tth_sEc1.3 >>> >>> I don't think you can get there from here unless you're willing to >>> trade off circular symmetry for speed. >>> >>> You could implement a filter with transfer function 1/((1 + Au^2)(1 + >>> Av^2)) by filtering each row individually, then each column -- but I >>> don't think that's really what you're looking for. >> >> Thanks! I can see it now, that the circular symmetry is impossible to >> achieve with simple 1D filtering. The Gaussian is a very special case, >> because it can be factored into two 1D-Gaussians (and there the IIR >> filter beats the FFT method easily). I can't see a way how to factor the >> Lorentzian into two 1D functions, probably it's not possible. >> >> The thing is that the rest of the tomographic reconstruction runs on the >> GPU - using OpenCL I could get enormous speed ups and outperform a >> machine with 36 CPU cores by a factor of 2. This phase contrast filter, >> however, will be tough. FFTs on the GPU are not much faster than on the >> CPU according to some references, due to the irregular memory access >> pattern. Therefore I was hoping to get an improvement on that end. I'll >> have to do the FFT on the GPU to see how long it really takes. > > Something just popped out of my way-back machine and into my head: > > I have a DSP book that was handed to me in the 1980's, and it was old > then. In the chapter on the DFT, they discuss a bunch of different FFT > algorithms, including one that has the order of operations arranged so > that you can store the input, the intermediate results, and the output > all on one (or maybe two) tapes. > > Yes, tapes. I'm not so old to have fond memories of them, but I'm just > old enough to have seen the glow in my slightly-older peers. > > It occurs to me that -- assuming you want to pursue things that hard, and > if memory accesses in a line is indeed the issue -- you might be able to > use the same concept in a GPU. > > "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer, > Prentice-Hall, 1975, ISBN 0-13-214635-5: > > Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the > same geometry for each stage, thereby permitting sequential data > accessing and storage." > > If you can't find any references shoot me an email and I'll see if I can > copy the relevant pages for you. > > Hope this helps... >
IDK, but the OP would likely just use one of the standard FFT packages. The fun part of GPU computing is the ease of getting started. If you have a not-very-old GPU and a Matlab, you can play with it using literally a couple lines of code. There's a single function to put an array of data into a GPU device, and a single function to retrieve it. Matlab overloads a lot of common functions to perform calculations on the device. You don't have to use Matlab, of course; it's just the easiest way to get started: http://blogs.mathworks.com/loren/2012/02/06/using-gpus-in-matlab/ Gene
Reply by Tim Wescott February 6, 20172017-02-06
On Mon, 06 Feb 2017 14:32:46 -0600, Tim Wescott wrote:

> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote: > >> Am 06.02.17 um 04:35 schrieb Tim Wescott: >>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote: >>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase- >>> userguide.html#tth_sEc1.3 >>> >>> I don't think you can get there from here unless you're willing to >>> trade off circular symmetry for speed. >>> >>> You could implement a filter with transfer function 1/((1 + Au^2)(1 + >>> Av^2)) by filtering each row individually, then each column -- but I >>> don't think that's really what you're looking for. >> >> Thanks! I can see it now, that the circular symmetry is impossible to >> achieve with simple 1D filtering. The Gaussian is a very special case, >> because it can be factored into two 1D-Gaussians (and there the IIR >> filter beats the FFT method easily). I can't see a way how to factor >> the Lorentzian into two 1D functions, probably it's not possible. >> >> The thing is that the rest of the tomographic reconstruction runs on >> the GPU - using OpenCL I could get enormous speed ups and outperform a >> machine with 36 CPU cores by a factor of 2. This phase contrast filter, >> however, will be tough. FFTs on the GPU are not much faster than on the >> CPU according to some references, due to the irregular memory access >> pattern. Therefore I was hoping to get an improvement on that end. I'll >> have to do the FFT on the GPU to see how long it really takes. > > Something just popped out of my way-back machine and into my head: > > I have a DSP book that was handed to me in the 1980's, and it was old > then. In the chapter on the DFT, they discuss a bunch of different FFT > algorithms, including one that has the order of operations arranged so > that you can store the input, the intermediate results, and the output > all on one (or maybe two) tapes. > > Yes, tapes. I'm not so old to have fond memories of them, but I'm just > old enough to have seen the glow in my slightly-older peers. > > It occurs to me that -- assuming you want to pursue things that hard, > and if memory accesses in a line is indeed the issue -- you might be > able to use the same concept in a GPU. > > "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer, > Prentice-Hall, 1975, ISBN 0-13-214635-5: > > Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the > same geometry for each stage, thereby permitting sequential data > accessing and storage." > > If you can't find any references shoot me an email and I'll see if I can > copy the relevant pages for you. > > Hope this helps...
BTW, I think the last time I read that page was when I was an undergrad, in the mid 1980's. I found it by flipping through the chapter and glancing at the figures until I saw the one I remembered. I can't remember where I leave my car keys -- but some bit of mathemagical trivia that I've never even used? SURE! -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by Tim Wescott February 6, 20172017-02-06
On Mon, 06 Feb 2017 14:02:09 +0300, Evgeny Filatov wrote:

> On 06.02.2017 10:47, Christian Gollwitzer wrote: >> The thing is that the rest of the tomographic reconstruction runs on >> the GPU - using OpenCL I could get enormous speed ups and outperform a >> machine with 36 CPU cores by a factor of 2. This phase contrast filter, >> however, will be tough. FFTs on the GPU are not much faster than on the >> CPU according to some references, due to the irregular memory access >> pattern. Therefore I was hoping to get an improvement on that end. I'll >> have to do the FFT on the GPU to see how long it really takes. >> >> Best regards, >> >> Christian > > With the current generation of GPUs the one real bottleneck is getting > the data in and out of the GPU memory. If the entire processing is done > on the GPU, you should also get some tangible gains with FFTs: > > https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench > > Regardless of the hardware, it should be possible to speed up your > algorithm if you curtail the FFT size by reducing the margins. The exact > size might depend on the filter parameters, but quadrupling the FFT size > (as you suggested in the OP) might be unnecessary. > > Gene
Maybe do an overlap-and-add? It should be just as possible with 2D FFT as one, I think. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by Tim Wescott February 6, 20172017-02-06
On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote:

> Am 06.02.17 um 04:35 schrieb Tim Wescott: >> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote: >>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase- >> userguide.html#tth_sEc1.3 >> >> I don't think you can get there from here unless you're willing to >> trade off circular symmetry for speed. >> >> You could implement a filter with transfer function 1/((1 + Au^2)(1 + >> Av^2)) by filtering each row individually, then each column -- but I >> don't think that's really what you're looking for. > > Thanks! I can see it now, that the circular symmetry is impossible to > achieve with simple 1D filtering. The Gaussian is a very special case, > because it can be factored into two 1D-Gaussians (and there the IIR > filter beats the FFT method easily). I can't see a way how to factor the > Lorentzian into two 1D functions, probably it's not possible. > > The thing is that the rest of the tomographic reconstruction runs on the > GPU - using OpenCL I could get enormous speed ups and outperform a > machine with 36 CPU cores by a factor of 2. This phase contrast filter, > however, will be tough. FFTs on the GPU are not much faster than on the > CPU according to some references, due to the irregular memory access > pattern. Therefore I was hoping to get an improvement on that end. I'll > have to do the FFT on the GPU to see how long it really takes.
Something just popped out of my way-back machine and into my head: I have a DSP book that was handed to me in the 1980's, and it was old then. In the chapter on the DFT, they discuss a bunch of different FFT algorithms, including one that has the order of operations arranged so that you can store the input, the intermediate results, and the output all on one (or maybe two) tapes. Yes, tapes. I'm not so old to have fond memories of them, but I'm just old enough to have seen the glow in my slightly-older peers. It occurs to me that -- assuming you want to pursue things that hard, and if memory accesses in a line is indeed the issue -- you might be able to use the same concept in a GPU. "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer, Prentice-Hall, 1975, ISBN 0-13-214635-5: Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the same geometry for each stage, thereby permitting sequential data accessing and storage." If you can't find any references shoot me an email and I'll see if I can copy the relevant pages for you. Hope this helps... -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by Evgeny Filatov February 6, 20172017-02-06
On 06.02.2017 10:47, Christian Gollwitzer wrote:
> The thing is that the rest of the tomographic reconstruction runs on the > GPU - using OpenCL I could get enormous speed ups and outperform a > machine with 36 CPU cores by a factor of 2. This phase contrast filter, > however, will be tough. FFTs on the GPU are not much faster than on the > CPU according to some references, due to the irregular memory access > pattern. Therefore I was hoping to get an improvement on that end. I'll > have to do the FFT on the GPU to see how long it really takes. > > Best regards, > > Christian
With the current generation of GPUs the one real bottleneck is getting the data in and out of the GPU memory. If the entire processing is done on the GPU, you should also get some tangible gains with FFTs: https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench Regardless of the hardware, it should be possible to speed up your algorithm if you curtail the FFT size by reducing the margins. The exact size might depend on the filter parameters, but quadrupling the FFT size (as you suggested in the OP) might be unnecessary. Gene
Reply by Christian Gollwitzer February 6, 20172017-02-06
Am 06.02.17 um 04:35 schrieb Tim Wescott:
> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote: >> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase- > userguide.html#tth_sEc1.3 > > I don't think you can get there from here unless you're willing to trade > off circular symmetry for speed. > > You could implement a filter with transfer function > 1/((1 + Au^2)(1 + Av^2)) by filtering each row individually, then each > column -- but I don't think that's really what you're looking for.
Thanks! I can see it now, that the circular symmetry is impossible to achieve with simple 1D filtering. The Gaussian is a very special case, because it can be factored into two 1D-Gaussians (and there the IIR filter beats the FFT method easily). I can't see a way how to factor the Lorentzian into two 1D functions, probably it's not possible. The thing is that the rest of the tomographic reconstruction runs on the GPU - using OpenCL I could get enormous speed ups and outperform a machine with 36 CPU cores by a factor of 2. This phase contrast filter, however, will be tough. FFTs on the GPU are not much faster than on the CPU according to some references, due to the irregular memory access pattern. Therefore I was hoping to get an improvement on that end. I'll have to do the FFT on the GPU to see how long it really takes. Best regards, Christian
Reply by Tim Wescott February 5, 20172017-02-05
On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:

> Hi all, > > I'm looking for a way to speed up a special image reconstruction > algorithm. [*] > > Actually, the filter is very simple linear filter. In essence, the image > is Fourier transformed, multiplied by > > 1/(1+A*(u^2+v^2)) > > (u, v are the conjugate coordinates of x,y) and transformed back. > Realistic images can be as large as 4096x4096, so with padding an 8kx8k > FFT is performed on several thousand images, which takes hours even on > powerful machines. > > I am relatively ignorant of classic filter design, but looking at this > transfer functions, I have the feeling that it should be possible to do > using some kind of recursive filter. For instance, the 1D-Fourier > transform of a Lorientzian is an exponential exp(-|x|) - therefore, in > the 1D case, I think this would be a simple exponential smoothing filter > - isn't it? Is there any way to design an IIR filter that operates > similarly in 2D, and is rotationally symmetric? > > Best regards, > > Christian > > > > [*] The algorithm is used to correct X-ray images taken at synchrotrons > where the X-rays are deflected from the straight line due to refraction > and the details can be found here: > > https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
userguide.html#tth_sEc1.3 I don't think you can get there from here unless you're willing to trade off circular symmetry for speed. You could implement a filter with transfer function 1/((1 + Au^2)(1 + Av^2)) by filtering each row individually, then each column -- but I don't think that's really what you're looking for. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!