## Best Real Time Convolution Algorithm?

Started by 4 years ago●8 replies●latest reply 4 years ago●1970 viewsWhat is the best current algorithm computationally for implementing a zero delay (partitioned) #Convolution of an audio rate signal with a fixed windowed kernel?

Is there an implementation of that algorithm available anywhere for sharing?

Hi,

I bet someone here will help you if you can more clearly describe what you want. Your terminology is puzzling. What, exactly, is a "window kernel"? What's the meaning of the word "partition"? By "zero delay convolution" do you mean implementing a convolutional filter that has zero phase shift between its input and output?

That should have been "windowed kernel" in case there is any kind of windowing involved in speeding it up. Fixed.

Yes, using partitioned methods latency can be reduced to zero by partitioning the DFT appropriately. I would like this feature. Who wouldn't. :-)

I'm especially interested in whether there are any new FFT algorithms specific to the task of a sliding convolution of a signal with a kernel.

But, thinking about it just a bit more I realize that a zero delay cascade implementation of the FFT can be easily provided by any system with a GPU and what system doesn't have a GPU today.

So I guess the question comes down to finding the best GPU code for zero delay GPU based FFT on the architecture of my choice, the Raspberry Pi 3. Anybody got a pointer to that? I don't think it is even patentable. Even I came up with the cascade structure 45 years ago. :-)

I assume that by "zero delay" you really mean "within one sampling interval", since you can't transfer from one register to another without *some* delay.

I'm sitting here visualizing what you'd have to do with all the butterflys in an FFT to get a new answer each sample, and I'm wondering if, even ignoring the extensive housekeeping you'd have to do, whether you would be doing less MAC operations than just implementing a convolution the old fashioned way.

Jez thinkin. No pencil ever touched paper, so take it as worth the effort that went into it...

The technique for saving operations is well known using an FFT IFFT process. And, the use of the largest possible arrays of samples helps. I found a reference:

Efficient Convolution Without Latency
William G. Gardner
Perceptual Computing Group
MIT Media Lab E15-401B
20 Ames St
Cambridge MA 02139-4307
Internet: billg@media.mit.edu
November 11, 1993

I wasn't aware of this and haven't fully comprehended the "zero delay" aspect. It seems counterintuitive so I'd be careful with definitions as Rick suggested. But, at least, the "partitioned" idea is clear. ....

OH! OK, the approach isn't zero delay at all. The objective is "zero" latency. That is, samples come out as soon as (or nearly as soon as) samples come in. But the FIR filter delay remains what you'd expect.

Yes, I see that now too. With a cascade implementation of convolution no transform is needed. That looks like my best bet. Thanks for clarifying my thinking.

Well, as before, using an FFT IFFT implementation of circular convolution (the proper term for it) *can be way *** faste**r if the arrays are large enough. So maybe you would still want to use the transforms. It sounds like you could have some fairly large arrays / sequences.

Also, in doing this, the filters only need to be FFT'd one time. Then they are kept in the frequency domain once and for all. The only FFTs thereafter are for the data and, of course, require an IFFT after the multiply.

There is nothing like zero latency and zero delay in reality...

Check your kernel and window size. In terms of "conceptual" computational complexity, fft/ifft will beat direct convolution from a certain number of samples on. That would be about 128 samples according to:

https://ccrma.stanford.edu/~jos/ReviewFourier/FFT_...

Yeah you might be able to cascade your convolutions. That would speed up things at one point, and potentially slow them down a bit at another point.

In reality, things are a bit more complex and depend very much on your choice of architecture, if you're going to max out all its available features (e.g.using well designed assembly), if you're going to write plain C and how you gonna write that code specifically. Are you writing ISO/ANSI code, are you using any statements to help your compiler optimize it... Or even: are you going to use a ready-made (hopefully well optimized and well working) library.

Your best bet is to check for available ready made solutions for both, convolution and fft/ifft, write your own code too. Concerning the "from scratch" method: so often it reveals that most simple is best; because you control and understand everything, so you can optimize it to the max (or even your compiler might be able to do a decent task). Test and benchmark everything. Choose what performs best.

Is it time consuming to do all of this? Maybe it takes the same time as to exchange opinions and make (potentially wrong) decisions based on speculations and incomplete knowledge. But for sure you'll gain experience, insight, and the best solution you can come up with for now.

Cheers

Andy

"There is nothing like zero latency and zero delay in reality..."

Oh yes, you always can do this - regardless of the method - but for a certain price.

Imagine you calculate a filter kernel (a set of coefficients) for a usual FIR filter. Now you cut it in half and discard one half of it. so the former center-coefficient comes directly as the very first.

So you got what you asked for - but DO NOT look on the filter characteristics... this is the price you must pay.