Reply by robert bristow-johnson February 25, 20072007-02-25
On Feb 25, 2:37 pm, "Andor" <andor.bari...@gmail.com> wrote:

> You have way too many letters for me to work this out on the back of > an envelope with a glass of wine ... as I said, I'll look next > Tuesday.
it's the same thing as before, the cost of doing one frame's worth of fast convolution with a basic radix-2 FFT is gonna likely be of the form A*N*log2(N) + B*N + C*log2(N) + D where N = 2^p and p is a positive integer. the number butterflies is proportional to N*log2(N), the number of groups per pass add to N/2, the number of passes is log2(N), the constant setup time is D, the number of complex multiplications to do in the frequency domain is N or N/2. this allows your frame to hop ahead N - (L-1) samples. so the cost per sample is A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) now for a given L, you pick your N=2^p to minimize that expression, but even if you took p as continuous, taking the derivative of the cost per sample w.r.t. p (and setting to zero to find the min is a bitch) but the real issue is for a particular L, do you want to use a 1024 point FFT or a 2048 point FFT (let's assume the choice is between those two, for illustration)? so, given that L, which does better, a fast convolver with FFT of size N or an FFT of size 2N? there will be some L (perhaps not an integer value) that's right on the crossover where the N size and 2N size perform equally well. that's when A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) is equal to A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1) you can solve for L (or more easily, L-1 which you can call the "order" of your FIR, that's the number you give to MATLAB's Parks- McClellan utility - poorly named "remez"). i think, assuming you know your relative cost coefs of A, B, C, D and given an N=2^p, the value of L-1 that sit's on the crossover point is (2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1
> > > anyway, it's apparent that when your FIR gets long, it really is worth > > it to do fast convolution since, say in the case of L=400, all but 399 > > samples of your circular convolution (7793) may be saved, you only > > need to toss 399 samples. it gets (relative to TD processing) better > > and better, the longer your FIR gets. > > Well, no. For any N, there is a second break-even point, close to N, > for which TD method is faster again (mentioned in the second table I > posted).
are you saying that if L increases and gets close to N, it's better to do it the straight time-domain method? i don't think so. as L gets bigger, when a size N FFT is less optimal, then there's the size 2N. once the FIR is long enough that a small FFT convolver beats the time- domain O(N) method, that's it. any longer L means it's better to do in the frequency-domain with OLA or OLS.
> Another thing that needs to be looked at is at which points > multiple partitions are faster than a single partition of the impulse > response (for large FFTs that have the data in SDRAM, it might be > faster to compute several smaller FFTs instead).
that's a complication i wasn't trying to tackle. i've been assuming that each size had sufficient memory resources.
> Also, optimal > partitioning is another topic, especially if different partition sizes > are required (to reduce input/output latency). I think this has been > solved and is the central issue of the MS patent, but not really sure.
i'd be interested in what M$ patent you are referring to. i know this optimal FFT size thing (in the straight forward fast convolver) was done in the early 80's or late 70's by an interesting Northwestern University prof i knew when i was in grad school (this guy has a strange political persona - a Holocaust denier, wrote a book defending his beliefs, student protests, etc). i don't think that M$ can patent the basic idea of that optimization. r b-j
Reply by Andor February 25, 20072007-02-25
On 24 Feb., 04:58, "robert bristow-johnson"
<r...@audioimagination.com> wrote:
> On Feb 23, 9:50 pm, "Andor" <andor.bari...@gmail.com> wrote: > > > > > > > > > Since I can't sleep I might as well do that. It turns out that the > > crossover-point is very low. You get: > > > N | Range of L (FIR length) where this N (FFT size) is optimal > > ------|----------------------------------------------------------- > > TD. | 1 - 31 > > 512 | 32 - 37 > > 1024 | 38 - 91 > > 2048 | 92 - 197 > > 4096 | 198 - 399 > > 8192 | 400 - ... > > > Interesting that it is only worth going to frequency domain for FFT > > sizes of 512 and larger. Note how I changed your FIR into TD. As all > > algorithms compute FIRs the name FIR to differentiate time- from > > frequency-domain methods is misleading. > > for that particular DSP, this is very useful information (assuming > it's reliable, which i do not doubt). someone (at ADI) should thank > you for it (i don't use the 21161N, but i'll thank you for it > anyway).
It's pretty reliable. I might have screwed up some of the constants, I'll check on Tuesday when I'm back at work, where I have the code. BTW, these numbers are valid for all SIMD SHARCs, up to the youngest generation (21367/8/9, 2127xx). The code didn't change, just the peripheries and the clock speed.
> i thought about it, too, and given a general cost function > (per sample) for the FFT, iFFT, and frequency domain multiplication: > > (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) > > the crossover points would be when this equality occurs: > > (A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1) > = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) > > N = 2^p is an integer power of 2. > > and, if i didn't fuck it up, that would be when > > (2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1 > > if the terms with C and D are negligible, then this becomes > > (2N)/(log2(N) + 2 + B/A) = L-1 > > as N gets large and we neglect B/A, it's > > (2N)/(log2(N)+2) = L-1 > > let's look at N = 4096 = 2^12, we get L = 586, a bit bigger than 400 > so i'm wondering if i screwed up or if you did or i'm making too many > assumptions/approximations. someone else should look at this and > blast it out. > > maybe B/A can't be neglected. using your numbers, it looks like it's > about B/A = 6.5 in the case of N=4096 and L=399. when i do the same > for N=2048 and L=197, i get B/A to be about 7.8, so i am a little > dubious that both of us did this right. BTW, since A is about the > cost of a radix-2 butterfly (there are N/2*log2(N) butterflies in the > FFT and the same number in the iFFT) and B is about the cost of a > frequency domain multiplication (for N points), i would expect B/A to > be smaller than 1, not larger. maybe Steven Johnson can pipe in.
You have way too many letters for me to work this out on the back of an envelope with a glass of wine ... as I said, I'll look next Tuesday.
> > anyway, it's apparent that when your FIR gets long, it really is worth > it to do fast convolution since, say in the case of L=400, all but 399 > samples of your circular convolution (7793) may be saved, you only > need to toss 399 samples. it gets (relative to TD processing) better > and better, the longer your FIR gets.
Well, no. For any N, there is a second break-even point, close to N, for which TD method is faster again (mentioned in the second table I posted). Another thing that needs to be looked at is at which points multiple partitions are faster than a single partition of the impulse response (for large FFTs that have the data in SDRAM, it might be faster to compute several smaller FFTs instead). Also, optimal partitioning is another topic, especially if different partition sizes are required (to reduce input/output latency). I think this has been solved and is the central issue of the MS patent, but not really sure.
> an exersize in fiddling.
Another definition of engineering :-) ?
> > r b-j
Regards, Andor
Reply by February 24, 20072007-02-24
On Feb 22, 11:33 pm, huang...@gmail.com wrote:
> Hi, > I was reading up on older threads, and I was wondering if everyone > could post their experiences with fast convolution crossover points? I > know that theoretically, if you have a FIR filter with 30-60 taps > (depending on the source), it is more efficient resource wise to > perform convolution in the frequency domain using overlap add/save. > However, can people post what they have learned in the real world? > e.g., after how many coefficients would it be more efficient to > perform filtering in the frequency domain for DSPs, FPGAs: > > Fixed Point DSPs: > DSP name: > Coefficient range: > > Floating Point DSPs: > DSP name (TI C6xxx, etc.): > Coefficient range: 32-128 from previous posts? > > FPGA: > FPGA name (Xilinx, Altera, etc.): > Coefficient range: 250-1000 from previous posts? > > The more recent the experience, the better! (e.g. numbers for the > latest DSPs, FPGAs). > > thanks a bunch!!! > > Huang
One thing I've learned is that the crossover point moves right when the FIR interpolates. The time domain convolution can be easily optimized to remove the zero multiplies, but the larger, zero-padded output FFT will have to process the zeros. It is also the case that the crossover point moves left when the FIR decimates by a nice number. This allows a smaller output FFT size to be used. John
Reply by robert bristow-johnson February 23, 20072007-02-23
On Feb 23, 9:50 pm, "Andor" <andor.bari...@gmail.com> wrote:
> > Since I can't sleep I might as well do that. It turns out that the > crossover-point is very low. You get: > > N | Range of L (FIR length) where this N (FFT size) is optimal > ------|----------------------------------------------------------- > TD. | 1 - 31 > 512 | 32 - 37 > 1024 | 38 - 91 > 2048 | 92 - 197 > 4096 | 198 - 399 > 8192 | 400 - ... > > Interesting that it is only worth going to frequency domain for FFT > sizes of 512 and larger. Note how I changed your FIR into TD. As all > algorithms compute FIRs the name FIR to differentiate time- from > frequency-domain methods is misleading.
for that particular DSP, this is very useful information (assuming it's reliable, which i do not doubt). someone (at ADI) should thank you for it (i don't use the 21161N, but i'll thank you for it anyway). i thought about it, too, and given a general cost function (per sample) for the FFT, iFFT, and frequency domain multiplication: (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) the crossover points would be when this equality occurs: (A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1) = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) N = 2^p is an integer power of 2. and, if i didn't fuck it up, that would be when (2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1 if the terms with C and D are negligible, then this becomes (2N)/(log2(N) + 2 + B/A) = L-1 as N gets large and we neglect B/A, it's (2N)/(log2(N)+2) = L-1 let's look at N = 4096 = 2^12, we get L = 586, a bit bigger than 400 so i'm wondering if i screwed up or if you did or i'm making too many assumptions/approximations. someone else should look at this and blast it out. maybe B/A can't be neglected. using your numbers, it looks like it's about B/A = 6.5 in the case of N=4096 and L=399. when i do the same for N=2048 and L=197, i get B/A to be about 7.8, so i am a little dubious that both of us did this right. BTW, since A is about the cost of a radix-2 butterfly (there are N/2*log2(N) butterflies in the FFT and the same number in the iFFT) and B is about the cost of a frequency domain multiplication (for N points), i would expect B/A to be smaller than 1, not larger. maybe Steven Johnson can pipe in. anyway, it's apparent that when your FIR gets long, it really is worth it to do fast convolution since, say in the case of L=400, all but 399 samples of your circular convolution (7793) may be saved, you only need to toss 399 samples. it gets (relative to TD processing) better and better, the longer your FIR gets. an exersize in fiddling. r b-j
Reply by Andor February 23, 20072007-02-23
robert bristow-johnson wrote:
> Andor, > > i think a table that might look like this (for a particular engine): > > N | Range of L (FIR length) where this N (FFT size) is optimal > ------|----------------------------------------------------------- > FIR | 1 - L1-1 > 64 | L1 - L2-1 > 128 | L2 - L3-1 > 256 | L3 - L4-1 > 512 | ... > 1024 | ... > 2048 | > 4096 | ... > 8192 | > > ... would be the most useful so once the minimum L is determined to > get a particular job done, then you look up what size FFT to use.
Since I can't sleep I might as well do that. It turns out that the crossover-point is very low. You get: N | Range of L (FIR length) where this N (FFT size) is optimal ------|----------------------------------------------------------- TD. | 1 - 31 512 | 32 - 37 1024 | 38 - 91 2048 | 92 - 197 4096 | 198 - 399 8192 | 400 - ... Interesting that it is only worth going to frequency domain for FFT sizes of 512 and larger. Note how I changed your FIR into TD. As all algorithms compute FIRs the name FIR to differentiate time- from frequency-domain methods is misleading.
> because even if a particular L is not the highest efficiency (however > you're defining that) for a particular N,
The highest efficiency L has maximum processing speed-up over the time- domain method.
> the per sample processing > costs always goes up as L increases. that is because, with N fixed, > the denominator > > 2 (FFTCost(N/2) + 2 N) + N)/(N-L+1) > > gets smaller as L gets larger. so, for a fixed N the cost per sample > continues to rise as L gets larger. > > then, as L gets larger you get a crossover point where > > 2 (FFTCost((2N)/2) + 2 (2N)) + (2N))/((2N)-L+1) > > is virtually equal to the previous expression, and that's the break > even point for bumping N up to the next power of 2. > > that's my spin on it.
Regards, Andor
Reply by robert bristow-johnson February 23, 20072007-02-23

Andor,

i think a table that might look like this (for a particular engine):

 N     | Range of L (FIR length) where this N (FFT size) is optimal
 ------|-----------------------------------------------------------
 FIR   |    1  -  L1-1
 64    |   L1  -  L2-1
 128   |   L2  -  L3-1
 256   |   L3  -  L4-1
 512   |   ...
 1024  |   ...
 2048  |
 4096  |   ...
 8192  |


... would be the most useful so once the minimum L is determined to
get a particular job done, then you look up what size FFT to use.
because even if a particular L is not the highest efficiency (however
you're defining that) for a particular N, the per sample processing
costs always goes up as L increases.  that is because, with N fixed,
the denominator

2 (FFTCost(N/2) + 2 N) + N)/(N-L+1)

gets smaller as L gets larger.  so, for a fixed N the cost per sample
continues to rise as L gets larger.

then, as L gets larger you get a crossover point where

2 (FFTCost((2N)/2) + 2 (2N)) + (2N))/((2N)-L+1)

is virtually equal to the previous expression, and that's the break
even point for bumping N up to the next power of 2.

that's my spin on it.

r b-j



Reply by Andor February 23, 20072007-02-23
I wrote:
...
> N | Break-Even L | Maximum Efficiency L | Efficiency Factor > ------|-----------------|-----------------------|------------------ > 128 | 47 | 67 | 1.082 > 256 | 34 | 170 | 1.96 > 512 | 32 | 389 | 3.142 > 1024 | 32 | 847 | 4.744 > 2048 | 34 | 1790 | 6.93 > 4096 | 35 | 3722 | 9.926 > 8192 | 36 | 7649 | 14.05
The upper limit could also be of interest: N | Break-Even L | Maximum Efficiency L | Efficiency Factor ------|-----------------|-----------------------|------------------ 128 | 47/82 | 67 | 1.082 256 | 34/223 | 170 | 1.96 512 | 32/481 | 389 | 3.142 1024 | 32/992 | 847 | 4.744 2048 | 34/2015 | 1790 | 6.93 4096 | 35/4062 | 3722 | 9.926 8192 | 36/8156 | 7649 | 14.05
Reply by Andor February 23, 20072007-02-23
On 23 Feb., 11:48, "Andor" <andor.bari...@gmail.com> wrote:
...
> If I find the time today, I can give an example for the ADSP-21161N DSP.
Let's say the length of the impulse response was L and the size of the FFT N (L < N). Using overlap-save, you can compute M = N - L +1 output samples, by transforming a block of N input samples, multiplying by the (zero-padded) N-point FFT of the impulse response, and applying the inverse FFT. The nice thing about using frequency domain filtering is that you can increase the efficiency by increasing the FFT size (there are bounds because of memory sizes of course). The steps you have to take are: 1. Compute N/2-point complex FFT of real N-point input, with linear (in N) pre-processing. 2. Multiply two complex N/2-point spectra together. 3. Compute N/2-point complex inverse FFT (again with linear post- processing). The costs for the FFT you can get from the vendor site [1]. The pre- and post-processing is a complex mixing stage costing about 2 N DSP cycles. The complex multiply of two N/2-point spectra costs about N DSP cycles. This cost is to compute M samples, so the relative cost per sample is CostOfFrequencyDomainFiltering(L,N) = (2 (FFTCost(N/2) + 2 N) + N)/(N-L +1) The cost of simple time domain convolution (impulse response length = L) is the cost of the convolution sum per sample, ie. CostOfTimeDomainFiltering(L) = 1/2 L. The factor 1/2 comes from the fact that the 21161N is a SIMD processor and executes two MACs per instruction. As you can see above, the larger N, the smaller the relative cost of the the frequency domain method. You should also note that there are two break-even points. For small L and for L close to N, time-domain filtering is more efficient than frequency domain filtering! Here is a table with the break-even and maximum efficiency points for some N: N | Break-Even L | Maximum Efficiency L | Efficiency Factor ------|-----------------|-----------------------|------------------ 128 | 47 | 67 | 1.082 256 | 34 | 170 | 1.96 512 | 32 | 389 | 3.142 1024 | 32 | 847 | 4.744 2048 | 34 | 1790 | 6.93 4096 | 35 | 3722 | 9.926 8192 | 36 | 7649 | 14.05 You can read this table as follows: "On a ADSP-21161N and using a N=1024 point FFT, the break-even point between time-domain and frequency-domain filtering is for impulse responses of size L=32. The maximum efficiency is achieved for L=847, at which point the frequency- domain method is 4.744 times faster than the time-domain method." Regards, Andor [1] I got the FFT benchmarks for the 21161N from this site: http://www.analog.com/processors/sharc/technicalLibrary/codeExamples/2116x_simd_code.html
Reply by Andor February 23, 20072007-02-23
On 23 Feb., 05:33, huang...@gmail.com wrote:
> Hi, > I was reading up on older threads, and I was wondering if everyone > could post their experiences with fast convolution crossover points? I > know that theoretically, if you have a FIR filter with 30-60 taps > (depending on the source), it is more efficient resource wise to > perform convolution in the frequency domain using overlap add/save.
Those crossover values are not theoretical but empirical. In theory, you can compute the crossover point yourself. You'll need the computation cost for an FFT, a complex multiply/add and a real multiply/add on your DSP. Also, you have to know how many samples of convolution you want to generat per FFT block (there is an optimal value, also depending on the FFT cost, see for example http://groups.google.com/group/comp.dsp/msg/f4db8236019f395d). If I find the time today, I can give an example for the ADSP-21161N DSP. Regards, Andor
Reply by February 23, 20072007-02-23
Hi,
  I was reading up on older threads, and I was wondering if everyone
could post their experiences with fast convolution crossover points? I
know that theoretically, if you have a FIR filter with 30-60 taps
(depending on the source), it is more efficient resource wise to
perform convolution in the frequency domain using overlap add/save.
However, can people post what they have learned in the real world?
e.g., after how many coefficients would it be more efficient to
perform filtering in the frequency domain for DSPs, FPGAs:

Fixed Point DSPs:
DSP name:
Coefficient range:

Floating Point DSPs:
DSP name (TI C6xxx, etc.):
Coefficient range: 32-128 from previous posts?

FPGA:
FPGA name (Xilinx, Altera, etc.):
Coefficient range: 250-1000 from previous posts?

The more recent the experience, the better! (e.g. numbers for the
latest DSPs, FPGAs).

thanks a bunch!!!

Huang