On Feb 25, 2:37 pm, "Andor" <andor.bari...@gmail.com> wrote:

> You have way too many letters for me to work this out on the back of
> an envelope with a glass of wine ... as I said, I'll look next
> Tuesday.

it's the same thing as before, the cost of doing one frame's worth of
fast convolution with a basic radix-2 FFT is gonna likely be of the
form

    A*N*log2(N) + B*N + C*log2(N) + D

where N = 2^p and p is a positive integer.

the number butterflies is proportional to N*log2(N), the number of
groups per pass add to N/2, the number of passes is log2(N), the
constant setup time is D, the number of complex multiplications to do
in the frequency domain is N or N/2.  this allows your frame to hop
ahead N - (L-1) samples.  so the cost per sample is

    A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)

now for a given L, you pick your N=2^p to minimize that expression,
but even if you took p as continuous, taking the derivative of the
cost per sample w.r.t. p (and setting to zero to find the min is a
bitch) but the real issue is for a particular L, do you want to use a
1024 point FFT or a 2048 point FFT (let's assume the choice is between
those two, for illustration)?  so, given that L, which does better, a
fast convolver with FFT of size N or an FFT of size 2N?  there will be
some L (perhaps not an integer value) that's right on the crossover
where the N size and 2N size perform equally well.  that's when

    A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)

is equal to

    A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1)

you can solve for L (or more easily, L-1 which you can call the
"order" of your FIR, that's the number you give to MATLAB's Parks-
McClellan utility - poorly named "remez").  i think, assuming you know
your relative cost coefs of A, B, C, D and given an N=2^p, the value
of L-1 that sit's on the crossover point is

    (2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1

>
> > anyway, it's apparent that when your FIR gets long, it really is worth
> > it to do fast convolution since, say in the case of L=400, all but 399
> > samples of your circular convolution (7793) may be saved, you only
> > need to toss 399 samples.  it gets (relative to TD processing) better
> > and better, the longer your FIR gets.
>
> Well, no. For any N, there is a second break-even point, close to N,
> for which TD method is faster again (mentioned in the second table I
> posted).

are you saying that if L increases and gets close to N, it's better to
do it the straight time-domain method?  i don't think so.  as L gets
bigger, when a size N FFT is less optimal, then there's the size 2N.
once the FIR is long enough that a small FFT convolver beats the time-
domain O(N) method, that's it.  any longer L means it's better to do
in the frequency-domain with OLA or OLS.

> Another thing that needs to be looked at is at which points
> multiple partitions are faster than a single partition of the impulse
> response (for large FFTs that have the data in SDRAM, it might be
> faster to compute several smaller FFTs instead).

that's a complication i wasn't trying to tackle.  i've been assuming
that each size had sufficient memory resources.

> Also, optimal
> partitioning is another topic, especially if different partition sizes
> are required (to reduce input/output latency). I think this has been
> solved and is the central issue of the MS patent, but not really sure.

i'd be interested in what M$ patent you are referring to.  i know this
optimal FFT size thing (in the straight forward fast convolver) was
done in the early 80's or late 70's by an interesting Northwestern
University prof i knew when i was in grad school (this guy has a
strange political persona - a Holocaust denier, wrote a book defending
his beliefs, student protests, etc).  i don't think that M$ can patent
the basic idea of that optimization.

r b-j

On 24 Feb., 04:58, "robert bristow-johnson"
<r...@audioimagination.com> wrote:
> On Feb 23, 9:50 pm, "Andor" <andor.bari...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Since I can't sleep I might as well do that. It turns out that the
> > crossover-point is very low. You get:
>
> > N     | Range of L (FIR length) where this N (FFT size) is optimal
> > ------|-----------------------------------------------------------
> > TD.   |    1  -  31
> > 512   |   32  -  37
> > 1024  |   38  -  91
> > 2048  |   92  -  197
> > 4096  |   198 -  399
> > 8192  |   400 - ...
>
> > Interesting that it is only worth going to frequency domain for FFT
> > sizes of 512 and larger. Note how I changed your FIR into TD. As all
> > algorithms compute FIRs the name FIR to differentiate time- from
> > frequency-domain methods is misleading.
>
> for that particular DSP, this is very useful information (assuming
> it's reliable, which i do not doubt).  someone (at ADI) should thank
> you for it (i don't use the 21161N, but i'll thank you for it
> anyway).

It's pretty reliable. I might have screwed up some of the constants,
I'll check on Tuesday when I'm back at work, where I have the code.
BTW, these numbers are valid for all SIMD SHARCs, up to the youngest
generation (21367/8/9, 2127xx). The code didn't change, just the
peripheries and the clock speed.

>  i thought about it, too, and given a general cost function
> (per sample) for the FFT, iFFT, and frequency domain multiplication:
>
>   (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)
>
> the crossover points would be when this equality occurs:
>
>   (A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1)
>        =   (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)
>
> N = 2^p  is an integer power of 2.
>
> and, if i didn't fuck it up, that would be when
>
> (2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1
>
> if the terms with C and D are negligible, then this becomes
>
> (2N)/(log2(N) + 2 + B/A) = L-1
>
> as N gets large and we neglect B/A, it's
>
> (2N)/(log2(N)+2) = L-1
>
> let's look at N = 4096 = 2^12,  we get L = 586, a bit bigger than 400
> so i'm wondering if i screwed up or if you did or i'm making too many
> assumptions/approximations.  someone else should look at this and
> blast it out.
>
> maybe B/A can't be neglected.  using your numbers, it looks like it's
> about B/A = 6.5 in the case of N=4096 and L=399.  when i do the same
> for N=2048 and L=197, i get B/A to be about 7.8, so i am a little
> dubious that both of us did this right.  BTW, since A is about the
> cost of a radix-2 butterfly (there are N/2*log2(N) butterflies in the
> FFT and the same number in the iFFT) and B is about the cost of a
> frequency domain multiplication (for N points), i would expect B/A to
> be smaller than 1, not larger.  maybe Steven Johnson can pipe in.

You have way too many letters for me to work this out on the back of
an envelope with a glass of wine ... as I said, I'll look next
Tuesday.

>
> anyway, it's apparent that when your FIR gets long, it really is worth
> it to do fast convolution since, say in the case of L=400, all but 399
> samples of your circular convolution (7793) may be saved, you only
> need to toss 399 samples.  it gets (relative to TD processing) better
> and better, the longer your FIR gets.

Well, no. For any N, there is a second break-even point, close to N,
for which TD method is faster again (mentioned in the second table I
posted). Another thing that needs to be looked at is at which points
multiple partitions are faster than a single partition of the impulse
response (for large FFTs that have the data in SDRAM, it might be
faster to compute several smaller FFTs instead). Also, optimal
partitioning is another topic, especially if different partition sizes
are required (to reduce input/output latency). I think this has been
solved and is the central issue of the MS patent, but not really sure.

> an exersize in fiddling.

Another definition of engineering :-) ?

>
> r b-j

Regards,
Andor

On Feb 22, 11:33 pm, huang...@gmail.com wrote:
> Hi,
>   I was reading up on older threads, and I was wondering if everyone
> could post their experiences with fast convolution crossover points? I
> know that theoretically, if you have a FIR filter with 30-60 taps
> (depending on the source), it is more efficient resource wise to
> perform convolution in the frequency domain using overlap add/save.
> However, can people post what they have learned in the real world?
> e.g., after how many coefficients would it be more efficient to
> perform filtering in the frequency domain for DSPs, FPGAs:
>
> Fixed Point DSPs:
> DSP name:
> Coefficient range:
>
> Floating Point DSPs:
> DSP name (TI C6xxx, etc.):
> Coefficient range: 32-128 from previous posts?
>
> FPGA:
> FPGA name (Xilinx, Altera, etc.):
> Coefficient range: 250-1000 from previous posts?
>
> The more recent the experience, the better! (e.g. numbers for the
> latest DSPs, FPGAs).
>
> thanks a bunch!!!
>
> Huang

One thing I've learned is that the crossover point moves right when
the FIR interpolates. The time domain convolution can be easily
optimized to remove the zero multiplies, but the larger, zero-padded
output FFT will have to process the zeros.

It is also the case that the crossover point moves left when the FIR
decimates by a nice number. This allows a smaller output FFT size to
be used.

John

On Feb 23, 9:50 pm, "Andor" <andor.bari...@gmail.com> wrote:
>
> Since I can't sleep I might as well do that. It turns out that the
> crossover-point is very low. You get:
>
> N     | Range of L (FIR length) where this N (FFT size) is optimal
> ------|-----------------------------------------------------------
> TD.   |    1  -  31
> 512   |   32  -  37
> 1024  |   38  -  91
> 2048  |   92  -  197
> 4096  |   198 -  399
> 8192  |   400 - ...
>
> Interesting that it is only worth going to frequency domain for FFT
> sizes of 512 and larger. Note how I changed your FIR into TD. As all
> algorithms compute FIRs the name FIR to differentiate time- from
> frequency-domain methods is misleading.

for that particular DSP, this is very useful information (assuming
it's reliable, which i do not doubt).  someone (at ADI) should thank
you for it (i don't use the 21161N, but i'll thank you for it
anyway).  i thought about it, too, and given a general cost function
(per sample) for the FFT, iFFT, and frequency domain multiplication:

  (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)

the crossover points would be when this equality occurs:

  (A*2N*log2(2N) + B*2N + C*log2(2N) + D)/(2N-L+1)
       =   (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1)

N = 2^p  is an integer power of 2.

and, if i didn't fuck it up, that would be when

(2*A*N - C*log2(N/2) - D)/(A*log2(4N) + B + C/N) = L-1

if the terms with C and D are negligible, then this becomes

(2N)/(log2(N) + 2 + B/A) = L-1

as N gets large and we neglect B/A, it's

(2N)/(log2(N)+2) = L-1

let's look at N = 4096 = 2^12,  we get L = 586, a bit bigger than 400
so i'm wondering if i screwed up or if you did or i'm making too many
assumptions/approximations.  someone else should look at this and
blast it out.

maybe B/A can't be neglected.  using your numbers, it looks like it's
about B/A = 6.5 in the case of N=4096 and L=399.  when i do the same
for N=2048 and L=197, i get B/A to be about 7.8, so i am a little
dubious that both of us did this right.  BTW, since A is about the
cost of a radix-2 butterfly (there are N/2*log2(N) butterflies in the
FFT and the same number in the iFFT) and B is about the cost of a
frequency domain multiplication (for N points), i would expect B/A to
be smaller than 1, not larger.  maybe Steven Johnson can pipe in.

anyway, it's apparent that when your FIR gets long, it really is worth
it to do fast convolution since, say in the case of L=400, all but 399
samples of your circular convolution (7793) may be saved, you only
need to toss 399 samples.  it gets (relative to TD processing) better
and better, the longer your FIR gets.

an exersize in fiddling.

r b-j

robert bristow-johnson wrote:
> Andor,
>
> i think a table that might look like this (for a particular engine):
>
>  N     | Range of L (FIR length) where this N (FFT size) is optimal
>  ------|-----------------------------------------------------------
>  FIR   |    1  -  L1-1
>  64    |   L1  -  L2-1
>  128   |   L2  -  L3-1
>  256   |   L3  -  L4-1
>  512   |   ...
>  1024  |   ...
>  2048  |
>  4096  |   ...
>  8192  |
>
> ... would be the most useful so once the minimum L is determined to
> get a particular job done, then you look up what size FFT to use.

Since I can't sleep I might as well do that. It turns out that the
crossover-point is very low. You get:

N     | Range of L (FIR length) where this N (FFT size) is optimal
------|-----------------------------------------------------------
TD.   |    1  -  31
512   |   32  -  37
1024  |   38  -  91
2048  |   92  -  197
4096  |   198 -  399
8192  |   400 - ...

Interesting that it is only worth going to frequency domain for FFT
sizes of 512 and larger. Note how I changed your FIR into TD. As all
algorithms compute FIRs the name FIR to differentiate time- from
frequency-domain methods is misleading.

> because even if a particular L is not the highest efficiency (however
> you're defining that) for a particular N,

The highest efficiency L has maximum processing speed-up over the time-
domain method.

> the per sample processing
> costs always goes up as L increases.  that is because, with N fixed,
> the denominator
>
> 2 (FFTCost(N/2) + 2 N) + N)/(N-L+1)
>
> gets smaller as L gets larger.  so, for a fixed N the cost per sample
> continues to rise as L gets larger.
>
> then, as L gets larger you get a crossover point where
>
> 2 (FFTCost((2N)/2) + 2 (2N)) + (2N))/((2N)-L+1)
>
> is virtually equal to the previous expression, and that's the break
> even point for bumping N up to the next power of 2.
>
> that's my spin on it.

Regards,
Andor


Andor,

i think a table that might look like this (for a particular engine):

 N     | Range of L (FIR length) where this N (FFT size) is optimal
 ------|-----------------------------------------------------------
 FIR   |    1  -  L1-1
 64    |   L1  -  L2-1
 128   |   L2  -  L3-1
 256   |   L3  -  L4-1
 512   |   ...
 1024  |   ...
 2048  |
 4096  |   ...
 8192  |


... would be the most useful so once the minimum L is determined to
get a particular job done, then you look up what size FFT to use.
because even if a particular L is not the highest efficiency (however
you're defining that) for a particular N, the per sample processing
costs always goes up as L increases.  that is because, with N fixed,
the denominator

2 (FFTCost(N/2) + 2 N) + N)/(N-L+1)

gets smaller as L gets larger.  so, for a fixed N the cost per sample
continues to rise as L gets larger.

then, as L gets larger you get a crossover point where

2 (FFTCost((2N)/2) + 2 (2N)) + (2N))/((2N)-L+1)

is virtually equal to the previous expression, and that's the break
even point for bumping N up to the next power of 2.

that's my spin on it.

r b-j

I wrote:
...
> N     | Break-Even L    | Maximum Efficiency L  | Efficiency Factor
> ------|-----------------|-----------------------|------------------
> 128   | 47              | 67                    | 1.082
> 256   | 34              | 170                   | 1.96
> 512   | 32              | 389                   | 3.142
> 1024  | 32              | 847                   | 4.744
> 2048  | 34              | 1790                  | 6.93
> 4096  | 35              | 3722                  | 9.926
> 8192  | 36              | 7649                  | 14.05

The upper limit could also be of interest:

N     | Break-Even L    | Maximum Efficiency L  | Efficiency Factor
------|-----------------|-----------------------|------------------
128   | 47/82           | 67                    | 1.082
256   | 34/223          | 170                   | 1.96
512   | 32/481          | 389                   | 3.142
1024  | 32/992          | 847                   | 4.744
2048  | 34/2015         | 1790                  | 6.93
4096  | 35/4062         | 3722                  | 9.926
8192  | 36/8156         | 7649                  | 14.05

On 23 Feb., 11:48, "Andor" <andor.bari...@gmail.com> wrote:
...
> If I find the time today, I can give an example for the ADSP-21161N DSP.

Let's say the length of the impulse response was L and the size of the
FFT N (L < N). Using overlap-save, you can compute M = N - L +1 output
samples, by transforming a block of N input samples, multiplying by
the (zero-padded) N-point FFT of the impulse response, and applying
the inverse FFT.

The nice thing about using frequency domain filtering is that you can
increase the efficiency by increasing the FFT size (there are bounds
because of memory sizes of course). The steps you have to take are:
1. Compute N/2-point complex FFT of real N-point input, with linear
(in N) pre-processing.
2. Multiply two complex N/2-point spectra together.
3. Compute N/2-point complex inverse FFT (again with linear post-
processing).
The costs for the FFT you can get from the vendor site [1]. The pre-
and post-processing is a complex mixing stage costing about 2 N DSP
cycles. The complex multiply of two N/2-point spectra costs about N
DSP cycles. This cost is to compute M samples, so the relative cost
per sample is

CostOfFrequencyDomainFiltering(L,N) = (2 (FFTCost(N/2) + 2 N) + N)/(N-L
+1)

The cost of simple time domain convolution (impulse response length =
L) is the cost of the convolution sum per sample, ie.

CostOfTimeDomainFiltering(L) = 1/2 L.

The factor 1/2 comes from the fact that the 21161N is a SIMD processor
and executes two MACs per instruction. As you can see above, the
larger N, the smaller the relative cost of the the frequency domain
method. You should also note that there are two break-even points. For
small L and for L close to N, time-domain filtering is more efficient
than frequency domain filtering!

Here is a table with the break-even and maximum efficiency points for
some N:

N     | Break-Even L    | Maximum Efficiency L  | Efficiency Factor
------|-----------------|-----------------------|------------------
128   | 47              | 67                    | 1.082
256   | 34              | 170                   | 1.96
512   | 32              | 389                   | 3.142
1024  | 32              | 847                   | 4.744
2048  | 34              | 1790                  | 6.93
4096  | 35              | 3722                  | 9.926
8192  | 36              | 7649                  | 14.05

You can read this table as follows: "On a ADSP-21161N and using a
N=1024 point FFT, the break-even point between time-domain and
frequency-domain filtering is for impulse responses of size L=32. The
maximum efficiency is achieved for L=847, at which point the frequency-
domain method is 4.744 times faster than the time-domain method."

Regards,
Andor

[1] I got the FFT benchmarks for the 21161N from this site:
http://www.analog.com/processors/sharc/technicalLibrary/codeExamples/2116x_simd_code.html

On 23 Feb., 05:33, huang...@gmail.com wrote:
> Hi,
>   I was reading up on older threads, and I was wondering if everyone
> could post their experiences with fast convolution crossover points? I
> know that theoretically, if you have a FIR filter with 30-60 taps
> (depending on the source), it is more efficient resource wise to
> perform convolution in the frequency domain using overlap add/save.

Those crossover values are not theoretical but empirical. In theory,
you can compute the crossover point yourself. You'll need the
computation cost for an FFT, a complex multiply/add and a real
multiply/add on your DSP. Also, you have to know how many samples of
convolution you want to generat per FFT block (there is an optimal
value, also depending on the FFT cost, see for example
http://groups.google.com/group/comp.dsp/msg/f4db8236019f395d).  If I
find the time today, I can give an example for the ADSP-21161N DSP.

Regards,
Andor

Hi,
  I was reading up on older threads, and I was wondering if everyone
could post their experiences with fast convolution crossover points? I
know that theoretically, if you have a FIR filter with 30-60 taps
(depending on the source), it is more efficient resource wise to
perform convolution in the frequency domain using overlap add/save.
However, can people post what they have learned in the real world?
e.g., after how many coefficients would it be more efficient to
perform filtering in the frequency domain for DSPs, FPGAs:

Fixed Point DSPs:
DSP name:
Coefficient range:

Floating Point DSPs:
DSP name (TI C6xxx, etc.):
Coefficient range: 32-128 from previous posts?

FPGA:
FPGA name (Xilinx, Altera, etc.):
Coefficient range: 250-1000 from previous posts?

The more recent the experience, the better! (e.g. numbers for the
latest DSPs, FPGAs).

thanks a bunch!!!

Huang