On Wed, 8 Feb 2017 14:59:27 +0100, Christian Gollwitzer
<auriocus@gmx.de> wrote:
>Am 07.02.17 um 00:22 schrieb Tim Wescott:
>> On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote:
>>
>>> On 06.02.2017 23:32, Tim Wescott wrote:
>>>> I have a DSP book that was handed to me in the 1980's, and it was old
>>>> then. In the chapter on the DFT, they discuss a bunch of different FFT
>>>> algorithms, including one that has the order of operations arranged so
>>>> that you can store the input, the intermediate results, and the output
>>>> all on one (or maybe two) tapes.
>
>That sounds interesting, however, as guessed by Evgeny, I am not very
>keen to implement my own FFT algorithm.
Most of the crazy variants of the DFT/FFT have already been
implemented in some available library, somewhere, like FFTW or similar
libraries. I suspect searching around on terms related to FIFO FFT
may turn up something. The FIFO-oriented FFT algorithms have been
around for many decades, so I'd be surprised if you had to code your
own.
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
Reply by Christian Gollwitzer●February 8, 20172017-02-08
Am 07.02.17 um 00:22 schrieb Tim Wescott:
> On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote:
>
>> On 06.02.2017 23:32, Tim Wescott wrote:
>>> I have a DSP book that was handed to me in the 1980's, and it was old
>>> then. In the chapter on the DFT, they discuss a bunch of different FFT
>>> algorithms, including one that has the order of operations arranged so
>>> that you can store the input, the intermediate results, and the output
>>> all on one (or maybe two) tapes.
That sounds interesting, however, as guessed by Evgeny, I am not very
keen to implement my own FFT algorithm.
>> The fun part of GPU computing is the ease of getting started. If you
>> have a not-very-old GPU and a Matlab, you can play with it using
>> literally a couple lines of code. There's a single function to put an
>> array of data into a GPU device, and a single function to retrieve it.
>> Matlab overloads a lot of common functions to perform calculations on
>> the device. You don't have to use Matlab, of course; it's just the
>> easiest way to get started:
I am using OpenCL to program the GPU. I haven't tried the Matlab toolbox
so far, but I guess, for the kind of algorithms I'm after, using
straight OpenCL is not just the fastest, but also relatively easy to
implement.
> I would kind of expect that SOMEONE would have a FFT algorithm that's
> optimized for the GPU.
Yes, of course. nVidia has cuFFT in their CUDA-Toolbox, and for OpenCL
there are two serious imlementations, Apple FFT and clFFT (a project
from AMD). clFFT can do 2D FFTs, and they also do a bag of tricks that a
mere mortal like me would not come up with. For instance, after they
transform the rows, they perform a transpose operation with caching in
the local shared memory, to reduce the stride to 1, and then transpose
back. This is much faster then performing a FFT with a stride of 8192 to
transform the columns. Preliminary testing show approximately the same
speed on the Intel Iris GPU as the CPU in my MacBook, hopefully I can
gain something when I manage to compile this on my Windows machine at
work equipped with a big nVidia GPU.
Thanks for you ideas,
Christian
Reply by Tim Wescott●February 6, 20172017-02-06
On Tue, 07 Feb 2017 01:03:30 +0300, Evgeny Filatov wrote:
> On 06.02.2017 23:32, Tim Wescott wrote:
>> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote:
>>
>>> Am 06.02.17 um 04:35 schrieb Tim Wescott:
>>>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
>>>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
>>>> userguide.html#tth_sEc1.3
>>>>
>>>> I don't think you can get there from here unless you're willing to
>>>> trade off circular symmetry for speed.
>>>>
>>>> You could implement a filter with transfer function 1/((1 + Au^2)(1 +
>>>> Av^2)) by filtering each row individually, then each column -- but I
>>>> don't think that's really what you're looking for.
>>>
>>> Thanks! I can see it now, that the circular symmetry is impossible to
>>> achieve with simple 1D filtering. The Gaussian is a very special case,
>>> because it can be factored into two 1D-Gaussians (and there the IIR
>>> filter beats the FFT method easily). I can't see a way how to factor
>>> the Lorentzian into two 1D functions, probably it's not possible.
>>>
>>> The thing is that the rest of the tomographic reconstruction runs on
>>> the GPU - using OpenCL I could get enormous speed ups and outperform a
>>> machine with 36 CPU cores by a factor of 2. This phase contrast
>>> filter, however, will be tough. FFTs on the GPU are not much faster
>>> than on the CPU according to some references, due to the irregular
>>> memory access pattern. Therefore I was hoping to get an improvement on
>>> that end. I'll have to do the FFT on the GPU to see how long it really
>>> takes.
>>
>> Something just popped out of my way-back machine and into my head:
>>
>> I have a DSP book that was handed to me in the 1980's, and it was old
>> then. In the chapter on the DFT, they discuss a bunch of different FFT
>> algorithms, including one that has the order of operations arranged so
>> that you can store the input, the intermediate results, and the output
>> all on one (or maybe two) tapes.
>>
>> Yes, tapes. I'm not so old to have fond memories of them, but I'm just
>> old enough to have seen the glow in my slightly-older peers.
>>
>> It occurs to me that -- assuming you want to pursue things that hard,
>> and if memory accesses in a line is indeed the issue -- you might be
>> able to use the same concept in a GPU.
>>
>> "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer,
>> Prentice-Hall, 1975, ISBN 0-13-214635-5:
>>
>> Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the
>> same geometry for each stage, thereby permitting sequential data
>> accessing and storage."
>>
>> If you can't find any references shoot me an email and I'll see if I
>> can copy the relevant pages for you.
>>
>> Hope this helps...
>>
>>
> IDK, but the OP would likely just use one of the standard FFT packages.
>
> The fun part of GPU computing is the ease of getting started. If you
> have a not-very-old GPU and a Matlab, you can play with it using
> literally a couple lines of code. There's a single function to put an
> array of data into a GPU device, and a single function to retrieve it.
> Matlab overloads a lot of common functions to perform calculations on
> the device. You don't have to use Matlab, of course; it's just the
> easiest way to get started:
>
> http://blogs.mathworks.com/loren/2012/02/06/using-gpus-in-matlab/
I would kind of expect that SOMEONE would have a FFT algorithm that's
optimized for the GPU.
Assuming that it's even possible, if you just compiled code for a
"regular" FFT then you might not get the best results.
--
Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
I'm looking for work -- see my website!
Reply by Evgeny Filatov●February 6, 20172017-02-06
On 06.02.2017 23:32, Tim Wescott wrote:
> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote:
>
>> Am 06.02.17 um 04:35 schrieb Tim Wescott:
>>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
>>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
>>> userguide.html#tth_sEc1.3
>>>
>>> I don't think you can get there from here unless you're willing to
>>> trade off circular symmetry for speed.
>>>
>>> You could implement a filter with transfer function 1/((1 + Au^2)(1 +
>>> Av^2)) by filtering each row individually, then each column -- but I
>>> don't think that's really what you're looking for.
>>
>> Thanks! I can see it now, that the circular symmetry is impossible to
>> achieve with simple 1D filtering. The Gaussian is a very special case,
>> because it can be factored into two 1D-Gaussians (and there the IIR
>> filter beats the FFT method easily). I can't see a way how to factor the
>> Lorentzian into two 1D functions, probably it's not possible.
>>
>> The thing is that the rest of the tomographic reconstruction runs on the
>> GPU - using OpenCL I could get enormous speed ups and outperform a
>> machine with 36 CPU cores by a factor of 2. This phase contrast filter,
>> however, will be tough. FFTs on the GPU are not much faster than on the
>> CPU according to some references, due to the irregular memory access
>> pattern. Therefore I was hoping to get an improvement on that end. I'll
>> have to do the FFT on the GPU to see how long it really takes.
>
> Something just popped out of my way-back machine and into my head:
>
> I have a DSP book that was handed to me in the 1980's, and it was old
> then. In the chapter on the DFT, they discuss a bunch of different FFT
> algorithms, including one that has the order of operations arranged so
> that you can store the input, the intermediate results, and the output
> all on one (or maybe two) tapes.
>
> Yes, tapes. I'm not so old to have fond memories of them, but I'm just
> old enough to have seen the glow in my slightly-older peers.
>
> It occurs to me that -- assuming you want to pursue things that hard, and
> if memory accesses in a line is indeed the issue -- you might be able to
> use the same concept in a GPU.
>
> "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer,
> Prentice-Hall, 1975, ISBN 0-13-214635-5:
>
> Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the
> same geometry for each stage, thereby permitting sequential data
> accessing and storage."
>
> If you can't find any references shoot me an email and I'll see if I can
> copy the relevant pages for you.
>
> Hope this helps...
>
IDK, but the OP would likely just use one of the standard FFT packages.
The fun part of GPU computing is the ease of getting started. If you
have a not-very-old GPU and a Matlab, you can play with it using
literally a couple lines of code. There's a single function to put an
array of data into a GPU device, and a single function to retrieve it.
Matlab overloads a lot of common functions to perform calculations on
the device. You don't have to use Matlab, of course; it's just the
easiest way to get started:
http://blogs.mathworks.com/loren/2012/02/06/using-gpus-in-matlab/
Gene
Reply by Tim Wescott●February 6, 20172017-02-06
On Mon, 06 Feb 2017 14:32:46 -0600, Tim Wescott wrote:
> On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote:
>
>> Am 06.02.17 um 04:35 schrieb Tim Wescott:
>>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
>>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
>>> userguide.html#tth_sEc1.3
>>>
>>> I don't think you can get there from here unless you're willing to
>>> trade off circular symmetry for speed.
>>>
>>> You could implement a filter with transfer function 1/((1 + Au^2)(1 +
>>> Av^2)) by filtering each row individually, then each column -- but I
>>> don't think that's really what you're looking for.
>>
>> Thanks! I can see it now, that the circular symmetry is impossible to
>> achieve with simple 1D filtering. The Gaussian is a very special case,
>> because it can be factored into two 1D-Gaussians (and there the IIR
>> filter beats the FFT method easily). I can't see a way how to factor
>> the Lorentzian into two 1D functions, probably it's not possible.
>>
>> The thing is that the rest of the tomographic reconstruction runs on
>> the GPU - using OpenCL I could get enormous speed ups and outperform a
>> machine with 36 CPU cores by a factor of 2. This phase contrast filter,
>> however, will be tough. FFTs on the GPU are not much faster than on the
>> CPU according to some references, due to the irregular memory access
>> pattern. Therefore I was hoping to get an improvement on that end. I'll
>> have to do the FFT on the GPU to see how long it really takes.
>
> Something just popped out of my way-back machine and into my head:
>
> I have a DSP book that was handed to me in the 1980's, and it was old
> then. In the chapter on the DFT, they discuss a bunch of different FFT
> algorithms, including one that has the order of operations arranged so
> that you can store the input, the intermediate results, and the output
> all on one (or maybe two) tapes.
>
> Yes, tapes. I'm not so old to have fond memories of them, but I'm just
> old enough to have seen the glow in my slightly-older peers.
>
> It occurs to me that -- assuming you want to pursue things that hard,
> and if memory accesses in a line is indeed the issue -- you might be
> able to use the same concept in a GPU.
>
> "Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer,
> Prentice-Hall, 1975, ISBN 0-13-214635-5:
>
> Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the
> same geometry for each stage, thereby permitting sequential data
> accessing and storage."
>
> If you can't find any references shoot me an email and I'll see if I can
> copy the relevant pages for you.
>
> Hope this helps...
BTW, I think the last time I read that page was when I was an undergrad,
in the mid 1980's. I found it by flipping through the chapter and
glancing at the figures until I saw the one I remembered.
I can't remember where I leave my car keys -- but some bit of mathemagical
trivia that I've never even used? SURE!
--
Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
I'm looking for work -- see my website!
Reply by Tim Wescott●February 6, 20172017-02-06
On Mon, 06 Feb 2017 14:02:09 +0300, Evgeny Filatov wrote:
> On 06.02.2017 10:47, Christian Gollwitzer wrote:
>> The thing is that the rest of the tomographic reconstruction runs on
>> the GPU - using OpenCL I could get enormous speed ups and outperform a
>> machine with 36 CPU cores by a factor of 2. This phase contrast filter,
>> however, will be tough. FFTs on the GPU are not much faster than on the
>> CPU according to some references, due to the irregular memory access
>> pattern. Therefore I was hoping to get an improvement on that end. I'll
>> have to do the FFT on the GPU to see how long it really takes.
>>
>> Best regards,
>>
>> Christian
>
> With the current generation of GPUs the one real bottleneck is getting
> the data in and out of the GPU memory. If the entire processing is done
> on the GPU, you should also get some tangible gains with FFTs:
>
> https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench
>
> Regardless of the hardware, it should be possible to speed up your
> algorithm if you curtail the FFT size by reducing the margins. The exact
> size might depend on the filter parameters, but quadrupling the FFT size
> (as you suggested in the OP) might be unnecessary.
>
> Gene
Maybe do an overlap-and-add? It should be just as possible with 2D FFT
as one, I think.
--
Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
I'm looking for work -- see my website!
Reply by Tim Wescott●February 6, 20172017-02-06
On Mon, 06 Feb 2017 08:47:47 +0100, Christian Gollwitzer wrote:
> Am 06.02.17 um 04:35 schrieb Tim Wescott:
>> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
>>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
>> userguide.html#tth_sEc1.3
>>
>> I don't think you can get there from here unless you're willing to
>> trade off circular symmetry for speed.
>>
>> You could implement a filter with transfer function 1/((1 + Au^2)(1 +
>> Av^2)) by filtering each row individually, then each column -- but I
>> don't think that's really what you're looking for.
>
> Thanks! I can see it now, that the circular symmetry is impossible to
> achieve with simple 1D filtering. The Gaussian is a very special case,
> because it can be factored into two 1D-Gaussians (and there the IIR
> filter beats the FFT method easily). I can't see a way how to factor the
> Lorentzian into two 1D functions, probably it's not possible.
>
> The thing is that the rest of the tomographic reconstruction runs on the
> GPU - using OpenCL I could get enormous speed ups and outperform a
> machine with 36 CPU cores by a factor of 2. This phase contrast filter,
> however, will be tough. FFTs on the GPU are not much faster than on the
> CPU according to some references, due to the irregular memory access
> pattern. Therefore I was hoping to get an improvement on that end. I'll
> have to do the FFT on the GPU to see how long it really takes.
Something just popped out of my way-back machine and into my head:
I have a DSP book that was handed to me in the 1980's, and it was old
then. In the chapter on the DFT, they discuss a bunch of different FFT
algorithms, including one that has the order of operations arranged so
that you can store the input, the intermediate results, and the output
all on one (or maybe two) tapes.
Yes, tapes. I'm not so old to have fond memories of them, but I'm just
old enough to have seen the glow in my slightly-older peers.
It occurs to me that -- assuming you want to pursue things that hard, and
if memory accesses in a line is indeed the issue -- you might be able to
use the same concept in a GPU.
"Digital Signal Processing", Alan V. Oppenheim and Ronald W. Shafer,
Prentice-Hall, 1975, ISBN 0-13-214635-5:
Figure 6.14 on page 301, titled "Rearrangement of fig 6.10 having the
same geometry for each stage, thereby permitting sequential data
accessing and storage."
If you can't find any references shoot me an email and I'll see if I can
copy the relevant pages for you.
Hope this helps...
--
Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
I'm looking for work -- see my website!
Reply by Evgeny Filatov●February 6, 20172017-02-06
On 06.02.2017 10:47, Christian Gollwitzer wrote:
> The thing is that the rest of the tomographic reconstruction runs on the
> GPU - using OpenCL I could get enormous speed ups and outperform a
> machine with 36 CPU cores by a factor of 2. This phase contrast filter,
> however, will be tough. FFTs on the GPU are not much faster than on the
> CPU according to some references, due to the irregular memory access
> pattern. Therefore I was hoping to get an improvement on that end. I'll
> have to do the FFT on the GPU to see how long it really takes.
>
> Best regards,
>
> Christian
With the current generation of GPUs the one real bottleneck is getting
the data in and out of the GPU memory. If the entire processing is done
on the GPU, you should also get some tangible gains with FFTs:
https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench
Regardless of the hardware, it should be possible to speed up your
algorithm if you curtail the FFT size by reducing the margins. The exact
size might depend on the filter parameters, but quadrupling the FFT size
(as you suggested in the OP) might be unnecessary.
Gene
Reply by Christian Gollwitzer●February 6, 20172017-02-06
Am 06.02.17 um 04:35 schrieb Tim Wescott:
> On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
>> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
> userguide.html#tth_sEc1.3
>
> I don't think you can get there from here unless you're willing to trade
> off circular symmetry for speed.
>
> You could implement a filter with transfer function
> 1/((1 + Au^2)(1 + Av^2)) by filtering each row individually, then each
> column -- but I don't think that's really what you're looking for.
Thanks! I can see it now, that the circular symmetry is impossible to
achieve with simple 1D filtering. The Gaussian is a very special case,
because it can be factored into two 1D-Gaussians (and there the IIR
filter beats the FFT method easily). I can't see a way how to factor the
Lorentzian into two 1D functions, probably it's not possible.
The thing is that the rest of the tomographic reconstruction runs on the
GPU - using OpenCL I could get enormous speed ups and outperform a
machine with 36 CPU cores by a factor of 2. This phase contrast filter,
however, will be tough. FFTs on the GPU are not much faster than on the
CPU according to some references, due to the irregular memory access
pattern. Therefore I was hoping to get an improvement on that end. I'll
have to do the FFT on the GPU to see how long it really takes.
Best regards,
Christian
Reply by Tim Wescott●February 5, 20172017-02-05
On Sat, 04 Feb 2017 13:21:21 +0100, Christian Gollwitzer wrote:
> Hi all,
>
> I'm looking for a way to speed up a special image reconstruction
> algorithm. [*]
>
> Actually, the filter is very simple linear filter. In essence, the image
> is Fourier transformed, multiplied by
>
> 1/(1+A*(u^2+v^2))
>
> (u, v are the conjugate coordinates of x,y) and transformed back.
> Realistic images can be as large as 4096x4096, so with padding an 8kx8k
> FFT is performed on several thousand images, which takes hours even on
> powerful machines.
>
> I am relatively ignorant of classic filter design, but looking at this
> transfer functions, I have the feeling that it should be possible to do
> using some kind of recursive filter. For instance, the 1D-Fourier
> transform of a Lorientzian is an exponential exp(-|x|) - therefore, in
> the 1D case, I think this would be a simple exponential smoothing filter
> - isn't it? Is there any way to design an IIR filter that operates
> similarly in 2D, and is rotationally symmetric?
>
> Best regards,
>
> Christian
>
>
>
> [*] The algorithm is used to correct X-ray images taken at synchrotrons
> where the X-rays are deflected from the straight line due to refraction
> and the details can be found here:
>
> https://imagej.nih.gov/ij/plugins/ankaphase/ankaphase-
userguide.html#tth_sEc1.3
I don't think you can get there from here unless you're willing to trade
off circular symmetry for speed.
You could implement a filter with transfer function
1/((1 + Au^2)(1 + Av^2)) by filtering each row individually, then each
column -- but I don't think that's really what you're looking for.
--
Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
I'm looking for work -- see my website!