On Dec 25, 4:40 am, Andor <andor.bari...@gmail.com> wrote:
> SteveSmith wrote:
> > Keep in mind that it has been over 15 years since I prepared Fig 18-3, so
> > my explanation may have some gaps.  Here is the link to the figure:
>
> >http://www.dspguide.com/ch18/3.htm
>
> > This graph is based on several measured values with smooth lines drawn
> > between.  In retrospect, this graph should have shown the data markers for
> > these measurement points. (In particular, the curve for the FFT graph is
> > not this smooth, since the FFT length jumps by powers of two.)   The
> > measured data  were taken on a 100 MHz Pentium processor, running under
> > DOS, in compiled QuickBasic.  I know that one of the measured points was
> > for a 400 point filter kernel with 1024 point FFTs being used.   My guess
> > is that the other points were powers of two of this, e.g. an 800 point
> > filter kernel with 2048 point FFTs.
>
> > I didn't investigate how the length of the FFT affects the execution
> > speed.  For instance, I simply state that a 600 point filter kernel could
> > use an FFT of 1024 or 2048 points. Robert, it seems you looked at this
> > issue, but I couldn't see the answer in your post.  For a filter kernel
> > of length, L, what is the optimum FFT size, N?
>
> I calculated this for the SHARC DSP (2nd generation with SIMD) in this
> post:
>
> http://groups.google.com/group/comp.dsp/browse_frm/thread/0985ad0ca9d...

i remember that.  i was also singing the same tired old tune then.

> The numbers are highly hard- and software dependent.

but i'd bet that it cost/sample function still fits the form:

   cost(L,N) = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-(L-1)) .

other than independently hand-coded FFTs (for every N=2^p), i would
think that using a vanilla flavored, radix-2, general N=2^p FFT
program, the fast convolution would fit that cost function for
constant A, B, C, and D.

r b-j

SteveSmith wrote:
> Keep in mind that it has been over 15 years since I prepared Fig 18-3, so
> my explanation may have some gaps.  Here is the link to the figure:
>
> http://www.dspguide.com/ch18/3.htm
>
> This graph is based on several measured values with smooth lines drawn
> between.  In retrospect, this graph should have shown the data markers for
> these measurement points. (In particular, the curve for the FFT graph is
> not this smooth, since the FFT length jumps by powers of two.)   The
> measured data  were taken on a 100 MHz Pentium processor, running under
> DOS, in compiled QuickBasic.  I know that one of the measured points was
> for a 400 point filter kernel with 1024 point FFTs being used.   My guess
> is that the other points were powers of two of this, e.g. an 800 point
> filter kernel with 2048 point FFTs.
>
> I didn't investigate how the length of the FFT affects the execution
> speed.  For instance, I simply state that a 600 point filter kernel could
> use an FFT of 1024 or 2048 points. Robert, it seems you looked at this
> issue, but I couldn't see the answer in your post.  For a filter kernel
> of length, L, what is the optimum FFT size, N?

I calculated this for the SHARC DSP (2nd generation with SIMD) in this
post:

http://groups.google.com/group/comp.dsp/browse_frm/thread/0985ad0ca9d10309/e416169ba33926a0?#e416169ba33926a0

The numbers are highly hard- and software dependent. The ADI library
for complex vector MAC uses 6 cycles per MAC*), and I coded it in 5
cycles per MAC, etc. I once worked on another system with shared
SDRAM, where the bottle neck was the memory interface. That system was
also more complex, in that the impulse response was partitioned into
multiple blocks of different size (the Gardener concept), same thing
that the OP has in mind. In such systems, calculating the break-even
point isn't that simple, but the idea remains the same.

Regards,
Andor

*) see here:
http://www.analog.com/processors/sharc/technicalLibrary/codeExamples/2116x_simd_code.html

On Dec 23, 9:59 am, "peterworth" <peterwo...@gmail.com> wrote:
> >it's fine.  now you're doing overlap-ADD fast convolution, right?  or
> >is it overlap-SAVE?
>
> >r b-j
>
> overlap-add, yes. but this is the difficult part it seems - i can do the
> FFT side of things, it's just a case of working out what to overlap with
> what and what to sum etc. if overlap-save is any simpler i'd give that a
> go..

Peter, it's a lot of work, but is what you're asking for is for
someone to explain to you how to turn circular convolution (that's the
only kind of convolution you can do directly with the DFT or FFT) into
linear convolution (which is what an FIR filter does)?

it *is* an issue of partitioning the signal into frames or blocks
(same thing, different semantic) with N-(L-1) fresh new samples in
each block.  in OLA you pad the end of those N-(L-1) samples with L-1
zeros and with OLS, you pad the beginning of the N-(L-1) samples with
the last L-1 samples from the previous block (you're reusing L-1
samples).  for the very first block of OLS, you pad the beginning of
the first N-(L-1) samples drawn from the beginning of your input with
L-1 zeros.

now, in both OLA and OLS, you have *pre*-computed the DFT of the FIR
impulse response h[n] which is defined for 0 <= n <= L-1 and is zero-
padded for the remaining N-L samples (i.e. h[n] = 0 for L <= n <=
N-1).  so this H[k] = DFT{h[n]} is computed *once* and the cost of
that single computation is not included in counting the cost of "fast
convolution".

when you have two sequences of numbers, x[n] and h[n] both of length N
(but N-L samples of h[n] are zero as well as L-1 samples of x[n] in
OLA), if you DFT both x[n] and h[n], point-wise multiply X[k] and H[k]
together, and inverse DFT the product sequence, y[n], will be the
*circular* convolution of x[n] and h[n].

     H[k] = DFT{ h[n] }
     X[k] = DFT{ x[n] }
     Y[k] = DFT{ y[n] }  ( or y[n] = iDFT{ Y[k] } )

     Y[k] = H[k]*X[k]   (here "*" means multiplication not
convolution)

the result from fiddling with the DFT is

            N-1
     y[n] = SUM{ h[m]*x[n-m] }
            m=0

or

            N-1
     y[n] = SUM{ x[m]*h[n-m] }
            m=0

where x[n] and h[n] (as well as y[n]) are understood to be
periodically extended with period N.  that is

     x[n+N] = x[n]   for all n
     h[n+N] = h[n]   for all n

so in the above summations if n-m < 0, then you must understand that

     x[n-m] = x[n-m+N]
or   h[n-m] = h[n-m+N]

on the right hand side, if n-m < 0 and m < N, then 0 <= n-m+N < N and
x[n-m] or h[n-m] are well defined.  do you understand this circular
convolution and can compare to linear convolution (where there is no
wrap around)?

now, in Overlap-Save (OLS), you have all N samples of x[n] represent
nice contiguous samples (no zeros are padded here), but if you perform
the circular convolution by DFTing multiplying and inverse DFTing, you
will see that the first L-1 samples of y[n] are *bad* (not in
agreement with regular convolution) because (using the second
summation above) you can see that for 0 <= n < L-1 the h[n-m] will
have some non-zero samples wrap around and straddle the summation
between the x[N-1] (which is in the position of x[-1]) and x[0].  so
you can't use those L-1 samples of y[n] and must throw them away.  but
you SAVE the N-(L-1) samples of y[n] that are good.  in the next frame
or block you fetch the following N-(L-1) samples, append them to the
end of last L-1 samples of the present block and do it all again.  the
first "good" sample of y[n] (i think that would be y[L-1]) in the
current frame would pick up right where the last "good" sample of y[n]
(or y[N-1]) of the previous frame left off.

in Overlap-Add (OLA), which is a little more complicated, there are no
"bad" samples per se, but there are L-1 "incomplete" samples in the
output of each frame.  you can imagine that your signal is partitioned
into blocks with N-(L-1) non-zero samples with an infinite number of
zeros appended on each side of the block and each of those sequences
are aligned so exactly when the sequence for block #B transitions from
non-zero samples to the first zero-pad sample on the right (at x[N-
(L-1)]), is when the sequence for block #B+1 transitions on the left
(at x[0]) from the left zero-pad sample to the first non-zero sample.
so if you were to add up all of these aligned sequences, you would get
your original x[n] without missing a sample nor doubling up on any
samples.  (remember x[0] of block #B+1 is the sample that would be in
the position of x[N-(L-1)] of block #B.

now, because an FIR filter is LTI (linear and time-invariant), the
output of your filter given the sum of these zero-padded and aligned
sequences (which is just the original unpartitioned input) is the sum
of the filter outputs of each zero-padded sequence when those outputs
are also aligned as the inputs.  that's basic LTI theory, right?

so, in OLA, there are no "bad" samples because the non-zero portion of
h[n-m] does not straddle any non-zero discontinuities of x[n] in the
convolution.  but it *is* straddling that transition from zero at
x[N-1] to non-zero at x[0].  during that straddling, the output y[n]
will be "ramping up" as fewer zero-pad samples (from x[N-(L-1)] to
x[N-1]) are included in the region where h[n-m] is non-zero.  so the
first L-1 samples of y[n] will be ramping up (when 0 <= n <= L-2).
likewise the last L-1 samples of y[n] will be ramping down since h[n-
m] will be straddling the transition between the last non-zero sample
x[N-L] and first zero-pad at x[N-(L-1)] (when N-(L-1) <= n <= N-1).
the last L-1 samples of the previous frame (that are ramping down) are
overlapped (due to this alignment previously alluded to) and ADDed to
the first L-1 samples of the current frame (that are ramping up), thus
"completing" those L-1 samples.  but also for OLA, you still are
computing only N-(L-1) "complete" output samples per frame or block.

if this doesn't answer your question, all i can recommend is get a
good book.  Steve's or Rick's or O&S or something.

r b-j

On Dec 24, 1:06 pm, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
>
>
> I didn't investigate how the length of the FFT affects the execution
> speed.  For instance, I simply state that a 600 point filter kernel could
> use an FFT of 1024 or 2048 points.

but it wouldn't be optimum.  the MATLAB program i posted indicates
that for L=600, the FFT should be 2^13 or 8192 points, to minimize the
per sample computational cost.  maybe if "B" was much bigger than "A",
that would change.

> Robert, it seems you looked at this
> issue, but I couldn't see the answer in your post.  For a filter kernel
> of length, L, what is the optimum FFT size, N?

it's in the comments of the MATLAB program that I posted.  i'll
restate it a little differently (without the p = log2(N) symbol).

given this cost function for the cost of computing one entire frame or
block (that results in N-(L-1) new samples of output):

   Cost per block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-(L-1))

the "E" weighted cost of moving around N-(L-1) samples will come out
in the wash; no matter what your block size is, that cost is always E
per sample, so i'm leaving it out.  this is also assuming a radix-2
FFT.  i would guess that the cost related to the "A" term (which is
proportional to the total number of butterflies), is the most
significant, which is one reason i set the other 3 cost coefficients
to zero as defaults in the MATLAB program.

so, given that cost per block, and your payoff for that cost is N-
(L-1) processed output samples, the net cost per sample is

   Cost per sample = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-(L-1)) + E

now you can see that, for a *fixed* N, that the larger L gets, you are
getting fewer "good" samples per block coming out of OLS (same for
OLA), for the cost of the block.  you are dividing by a smaller number
N-(L-1) and thus the cost per processed samples increases (as it would
linearly for the straight-forward FIR computation).

so the result that i got (i tried to spell out how i got it) is that
the optimal value of N, assuming it's an integer power of 2 is such an
N so that the following two inequalities are satisfied:

    L-1 >= (A*N - C*(log2(N)-2) - D)/(A*(log2(N)+1) + B + C*2/N)

and

    L-1 <= (A*2*N - C*(log2(N)-1) - D)/(A*(log2(N)+2) + B + C/N)

if we ignore the costs that are proportional to N, log2(N), and 1
(keeping only the cost proportional to N*log2(N)) these inequalities
become

    N/(log2(N)+1)  <=  L-1  <=  (2*N)/(log2(N)+2)

now that directly defines the range of optimal L, given a power of 2,
N, but you can turn it around and fix L and search for the N=2^p so
that the L you chose lies in the range indicated by the inequalities
above.

this is what the MATLAB program shows.  if you or anyone runs it, i
might suggest using the "axis" command to zoom in on the graph
somewhere and then pick a segment of the red curve (a single arc) and
connect that to the blue curve of the segment immediately to the right
of it, and then connect that blue segment to a green segment
immediately to the right of that.  what you will see is a *single* arc
indicating a single particular value for the FFT, N.  for that fixed
N, it is the cost/sample of processing as L increases (and note that
the cost is increasing with L).  for the red segment of the arc, that
is the region where N is too large to be optimal (there is another
blue curve for a different N just below it), but as L gets larger the
curve becomes blue, indicating that this particular N is optimal.  but
then as L gets even larger, the arc becomes green indicating that this
particular N is too small for those values of L.

now, if you were to relate this cost to the cost of implementing the
FIR directly (which is proportional to N), you would have to come up
with a ratio of the cost of doing a butterfly (with all of its complex
math), which is "A", to the cost of doing a single tap (multiply-
accumulate, increment pointers) of the conventional FIR.  just looking
at the old 56K benchmarks, that cost ratio might be 6 to 1.  with that
dumb benchmark (and assuming the cost of a complex multiply is 4 times
the cost of a real MAC), i set A=6 and B=(4/2+2) and ran the program.
it looks like the efficiency of the fast convolution overtakes the
efficiency of the straight-forward method somewhere arount L = 65 taps
which seems to be in very close agreement with your information.
given the numbers for A and B that i guessed at, an FIR with, say,
L=70 taps (about as small as you would want to use the FFT method)
would use an FFT of N=512 points to be most efficient.

r b-j

Keep in mind that it has been over 15 years since I prepared Fig 18-3, so
my explanation may have some gaps.  Here is the link to the figure: 

http://www.dspguide.com/ch18/3.htm

This graph is based on several measured values with smooth lines drawn
between.  In retrospect, this graph should have shown the data markers for
these measurement points. (In particular, the curve for the FFT graph is
not this smooth, since the FFT length jumps by powers of two.)   The
measured data  were taken on a 100 MHz Pentium processor, running under
DOS, in compiled QuickBasic.  I know that one of the measured points was
for a 400 point filter kernel with 1024 point FFTs being used.   My guess
is that the other points were powers of two of this, e.g. an 800 point
filter kernel with 2048 point FFTs.

I didn&rsquo;t investigate how the length of the FFT affects the execution
speed.  For instance, I simply state that a 600 point filter kernel could
use an FFT of 1024 or 2048 points. Robert, it seems you looked at this
issue, but I couldn&rsquo;t see the answer in your post.  For a filter kernel
of length, L, what is the optimum FFT size, N?

Regards,
Steve Smith
Steve.Smith@SpectrumSDI.com

hey guys!  i gotta sorta holiday DSP present for you. Steve Smith's
plot on Figure 18-3 represented some interesting data regarding the
cost of fast convolution (which is something that i messed around with
theoretically a little bit).  i had some old notes on how to approach
the optimization problem (given an FIR length, L, how to choose the
FFT size, N, so that the number of operations per processed sample is
minimum).  so i took the notes and coded it in MATLAB (i'm sure that
Octave will work) and it's at the bottom of this post.  have fun with
it.  i think it's pretty conclusive.

anyway, it *does* show that if you optimize the FFT size that the cost
of fast convolution increases linearly with the logarithm of the FIR
length L.  i.e., if your FIR length gets big enough, doubling the FIR
length, L, increases the time cost of processing by a constant unit of
time.

Melly Clistmist everyone,

r b-j

On Dec 22, 6:22 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
>
> Now, I have a question for Steve Smith:
>
> this has to do with Figure 3-18 on page 183 of your book,

i meant to say Figure 18-3 on page 318.

> http://www.dspguide.com/CH18.PDF.  it's displaying the net execution
> time for the fast convolution in ms/L (L is the FIR length, or FIR
> order+1).  can you tell me, as a function of the FIR length L, what
> size FFT (N) did you use?  then we know, per block, that the total
> cost of processing is divided by (N-L+1) to get the cost of processing
> per sample.
>
> i've posted to this newsgroup a couple of times, a methodology for
> picking the optimal value of N as a function of L, the desired FIR
> length (that is usually determined by the minimum performance needs in
> the filter spec as it gets designed by P-McC or similar).
>
> so, for comparison, the value of N i would chose would be from this:
>
> the cost of the FFT would likely be of this form:
>
>   A1*N*log2(N) + B1*N + C1*log2(N) + D1
>
> for some constants, A1, B1, C1, and D1, given the nature of the
> hardware and software that implements the FFT.  we expect the cost of
> multiplication in the frequency domain to be
>
>                  B2*N              + D2
>
> and there might be more overhead regarding moving samples around.  in
> OLS, i would expect that cost to be proportional to the number of
> samples processed per block
>
>                  E*(N-L+1)
>
> where E is the cost per sample of moving data.  anyway, given those
> cost functions, the total cost to process one frame of samples would
> be
>
> Cost/Block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-L+1)
>
> where the A, B, C, D, and E coefficients are easily determined from
> the cost coefficients above (remember there are 2 FFTs).  then, the
> cost per sample, the performance measure you have plotted in your
> book, Steve, is a function that should look like:
>
> Cost/Samp = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) + E
>
> Now, given an arbitrary (but fixed) L, you can come up with a
> recursion equations that compares the Cost/Samp for N that are
> adjacent powers of 2 (well call them N and N/2).  That difference is:
>
>   (A*N*log2(N)     + B*N   + C*log2(N)   + D)/(N-L+1)
> - (A*N/2*log2(N/2) + B*N/2 + C*log2(N/2) + D)/(N/2-L+1)
>
> (note the E's kill each other off).
>
> It intially looks like you can simplify that cost delta function, but
> because the two denominators are different, you can't do too much to
> it (oh, maybe you can put it on a common denominator) except evaluate
> it, given A, B, C, D, and L, for increasing powers of 2 for N.   you
> can also look at the same cost function, but in terms of log2(N) and
> with that variable as continuous (so we're essentiall sampling that
> function every integer value of log2(N), for radix-2 FFTs).  when you
> do that, you can show by derivative, that there is only one value of
> log2(N) that minimizes the Cost/Samp function.  so that means you can
> take the above cost difference equation (with L fixed) and repeatedly
> double N until it first becomes negative.  that would be one power of
> two beyond your optimal power of 2 for N.
>
> anyway, what i would like to know, Steve, is how you chose your FFT
> size, N, as a function of FIR length, L?  and then assuming you pick
> that N, is that the cost/L function that you have plotted in Figure
> 18-3?
>

%
% An investigation in optimal radix-2 FFT size, N = 2^p,
% for fast convolution with FIR length of L
%
%    Cost function:
%
%    cost/sample = (cost/block) / (num_processed_samples/block)
%
%                = (A*p*N + B*N + C*p + D) / (N - (L-1))
%
%    N = 2^p = size of radix-2 FFT
%    p = log2(N) = number of passes in radix-2 FFT
%    L = FIR length  ( L-1 is FIR order )
%    N - (L-1) = number of samples processed in each block
%
%    p*N/2 = number of FFT butterflies in one radix-2 FFT
%    N-1  = number of FFT groups in one radix-2 FFT
%    N/2  = number of complex multiplies needed for scaling spectrum
%    p    = number of FFT passes in one radix-2 FFT
%
%    A   = cost of a single FFT butterfly
%    B   = twice of overhead cost for setting up each FFT group
%           + half cost of one complex multiply in transfer function
%    C   = twice of overhead cost for setting up each FFT pass
%    D+B = twice of overhead cost for setting up FFT
%           + other constant costs per block
%
% Fixing the FIR length to L, the optimal power, p,
% of two for N = 2^p is such a p that
%
%    L-1 >= (A*2^p - C*(p-2) - D)/(A*(p+1) + B + C*2^(1-p))
%
% and
%
%    L-1 <= (A*2^(p+1) - C*(p-1) - D)/(A*(p+2) + B + C*2^(-p))
%
% So, instead of solving for p as a function of L, for every
% integer value of p (or for every power of two, N), we delimit
% the range of L that has that particular integer power of p
% optimal for the range of L.  If, as L increases, it exceeds the
% optimal range, p is incremented by 1 (which changes the range so
% that L is now inside the range for which p is optimal).
%
% The blue graph is the cost function for increasing L using the
% optimal value for p or N while the red and green graphs are the
% cost functions for FFT size N twice as large as the optimal
% and half as large as optimal N, repectively.  Note that the red
% and green graphs never get less than the blue (optimal) graph
% and touch it only at points where the FIR length L is about to
% have the optimal N double.
%
% From Robert Bristow-Johnson, (c) 2007
%

if ~exist('maxP')

    maxP = 32

end

if ~exist('L_per_oct')

    L_per_oct = 32

end

if ~exist('A')

    A = 1

end

if ~exist('B')

    B = 0

end

if ~exist('C')

    C = 0

end

if ~exist('D')

    D = 0

end

p = linspace(1, 40, 40);
L_optimal = (A*2.^p - C*(p-2) - D)./(A*(p+1) + B + C*2.^(-p+1)) + 1;
clear p;

LL = round(2.^linspace(1, maxP, maxP*L_per_oct+1));
cost = zeros(size(LL));

p = 1;
for k=1:maxP*L_per_oct+1
    L = LL(k);
    if (L >= L_optimal(p+1))
        p = p+1;
    end;
    N = 2^p;                    % optimal FFT size N for FIR length L
    cost(k) = (A*p*N + B*N + C*p + D)/(N - (L-1)); % cost function
end;
semilogx(LL, cost, 'b');

if 1        % set to 1 for comparative plots
    hold on;

    p = 1;
    for k=1:maxP*L_per_oct+1
        L = LL(k);
        if (L >= L_optimal(p+1))
            p = p+1;
        end;
        N = 2^(p-1);    % half of optimal FFT size N for FIR length L
        cost(k) = (A*(p-1)*N + B*N + C*(p-1) + D)/(N - (L-1));
    end;
    semilogx(LL, cost, 'g');

    p = 1;
    for k=1:maxP*L_per_oct+1
        L = LL(k);
        if (L > L_optimal(p+1))
            p = p+1;
        end;
        N = 2^(p+1);      % twice optimal FFT size N for FIR length L
        cost(k) = (A*(p+1)*N + B*N + C*(p+1) + D)/(N - (L-1));
    end;
    semilogx(LL, cost, 'r');

    hold off;
end;

>Sorry for the bad link-- here is the correct one: 
>
>http://www.dspguide.com/CH18.PDF
>

an introduction to the basics is just what i need (and couldn't find),
cheers.

>it's fine.  now you're doing overlap-ADD fast convolution, right?  or
>is it overlap-SAVE?
>
>r b-j

overlap-add, yes. but this is the difficult part it seems - i can do the
FFT side of things, it's just a case of working out what to overlap with
what and what to sum etc. if overlap-save is any simpler i'd give that a
go..

On Dec 22, 9:32 am, Andor <andor.bari...@gmail.com> wrote:
> On 21 Dez., 01:05, robert bristow-johnson <r...@audioimagination.com>
> wrote:
>
> > On Dec 19, 11:35 am, "peterworth" <peterwo...@gmail.com> wrote:
>
> > > >> y(n) = h(n) * x(n) = sum_{k=0}^{N-1} h(k) x(n-k)
>
> > > >> = sum_{k=0}^{N1-1} h(k) x(n-k) + sum_{k=N1}^{N-1} h(k) x(n-k)
>
> > i'm not sure i quite get this.  i am trying to relate it to OLA...
>
> In OLA or OLS, the input signal is partitioned into blocks, and each
> block is convolved with an impulse response. However, if the impulse
> response is long, it also has to be partitioned into blocks (or if the
> delay of the convolution is to be smaller than the block size). The
> above sum split was just to show how partitioning of the impulse
> response into two blocks leads to two parallel convolutions.
>
>
>
> > > sorry to ask lots of questions..
>
> > it's fine.  now you're doing overlap-ADD fast convolution, right?  or
> > is it overlap-SAVE?
>
> OLA is, in my eyes, really just an academic technique. It has no
> advantages over OLS, whereas OLS has several advantages over OLA.
> Don't you agree?

OLA has the added cost of adding the tails of adjacent blocks.  but in
OLS, you have to make sure you take the right samples (and discard the
L-1 bad ones), but i don't see a cost to that whereas there are L-1
additions that OLA must make that OLS need not.

i wonder if, in the numerical scheme of things, the rougher edge of
the discontinuity in the OLS data is worser than the zero-padding of
L-1 samples in OLA.  dunno, but that is the only possible advantage
that OLA *might* have over OLS, but i wouldn't count on it and i'm too
lazy to investigate it very far.

Now, I have a question for Steve Smith:

this has to do with Figure 3-18 on page 183 of your book,
http://www.dspguide.com/CH18.PDF .  it's displaying the net execution
time for the fast convolution in ms/L (L is the FIR length, or FIR
order+1).  can you tell me, as a function of the FIR length L, what
size FFT (N) did you use.  then we know, per block, that the total
cost of processing is divided by (N-L+1) to get the cost of processing
per sample.

i've posted to this newsgroup a couple of times, a methodology for
picking the optimal value of N as a function of L, the desired FIR
length (that is usually determined by the minimum performance needs in
the filter spec as it gets designed by P-McC or similar).

so, for comparison, the value of N i would chose would be from this:

the cost of the FFT would likely be of this form:

  A1*N*log2(N) + B1*N + C1*log2(N) + D1

for some constants, A1, B1, C1, and D1, given the nature of the
hardware and software that implements the FFT.  we expect the cost of
multiplication in the frequency domain to be

                 B2*N              + D2

and there might be more overhead regarding moving samples around.  in
OLS, i would expect that cost to be proportional to the number of
samples processed per block

                 E*(N-L+1)

where E is the cost per sample of moving data.  anyway, given those
cost functions, the total cost to process one frame of samples would
be

Cost/Block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-L+1)

where the A, B, C, D, and E coefficients are easily determined from
the cost coefficients above (remember there are 2 FFTs).  then, the
cost per sample, the performance measure you have plotted in your
book, Steve, is a function that should look like:

Cost/Samp = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) + E

Now, given an arbitrary (but fixed) L, you can come up with a
recursion equations that compares the Cost/Samp for N that are
adjacent powers of 2 (well call them N and N/2).  That difference is:

  (A*N*log2(N)     + B*N   + C*log2(N)   + D)/(N-L+1)
- (A*N/2*log2(N/2) + B*N/2 + C*log2(N/2) + D)/(N/2-L+1)

(note the E's kill each other off).

It intially looks like you can simplify that cost delta function, but
because the two denominators are different, you can't do too much to
it (oh, maybe you can put it on a common denominator) except evaluate
it, given A, B, C, D, and L, for increasing powers of 2 for N.   you
can also look at the same cost function, but in terms of log2(N) and
with that variable as continuous (so we're essentiall sampling that
function every integer value of log2(N), for radix-2 FFTs).  when you
do that, you can show by derivative, that there is only one value of
log2(N) that minimizes the Cost/Samp function.  so that means you can
take the above cost difference equation (with L fixed) and repeatedly
double N until it first becomes negative.  that would be one power of
two beyond your optimal power of 2 for N.

anyway, what i would like to know, Steve, is how you chose your FFT
size, N, as a function of FIR length, L.  and then assuming you pick
that N, is that the cost/L function that you have plotted in Figure
18-3?

it's a sorta cliche' that "the devil is in the details", but in cases
like these, i actually think the opposite.

Melly Clistmast everyone,

r b-j