Reply by SteveSmith December 26, 20072007-12-26
Hi Robert,
I agree with you assessment.  The “FFT” curve in my Fig. 18-3 would be
about 20% lower if longer FFTs had been used for the measured data.   

I used your general method on FFT lengths from 512 to 16,384.  If one is
satisfied to be within about 10% of optimal operation, the FFT length
needs to be at least 3-4 times the length of the filter kernel.  If you
want to be within a few percent, its more like 6-8 times the length of the
filter kernel.  These are much longer than I expected– thanks for the
good data. 

Regards,
Steve Smith
Steve.Smith1@SpectrumSDI.com    
Reply by robert bristow-johnson December 26, 20072007-12-26
On Dec 25, 4:40 am, Andor <andor.bari...@gmail.com> wrote:
> SteveSmith wrote: > > Keep in mind that it has been over 15 years since I prepared Fig 18-3, so > > my explanation may have some gaps. Here is the link to the figure: > > >http://www.dspguide.com/ch18/3.htm > > > This graph is based on several measured values with smooth lines drawn > > between. In retrospect, this graph should have shown the data markers for > > these measurement points. (In particular, the curve for the FFT graph is > > not this smooth, since the FFT length jumps by powers of two.) The > > measured data were taken on a 100 MHz Pentium processor, running under > > DOS, in compiled QuickBasic. I know that one of the measured points was > > for a 400 point filter kernel with 1024 point FFTs being used. My guess > > is that the other points were powers of two of this, e.g. an 800 point > > filter kernel with 2048 point FFTs. > > > I didn't investigate how the length of the FFT affects the execution > > speed. For instance, I simply state that a 600 point filter kernel could > > use an FFT of 1024 or 2048 points. Robert, it seems you looked at this > > issue, but I couldn't see the answer in your post. For a filter kernel > > of length, L, what is the optimum FFT size, N? > > I calculated this for the SHARC DSP (2nd generation with SIMD) in this > post: > > http://groups.google.com/group/comp.dsp/browse_frm/thread/0985ad0ca9d...
i remember that. i was also singing the same tired old tune then.
> The numbers are highly hard- and software dependent.
but i'd bet that it cost/sample function still fits the form: cost(L,N) = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-(L-1)) . other than independently hand-coded FFTs (for every N=2^p), i would think that using a vanilla flavored, radix-2, general N=2^p FFT program, the fast convolution would fit that cost function for constant A, B, C, and D. r b-j
Reply by Andor December 25, 20072007-12-25
SteveSmith wrote:
> Keep in mind that it has been over 15 years since I prepared Fig 18-3, so > my explanation may have some gaps. Here is the link to the figure: > > http://www.dspguide.com/ch18/3.htm > > This graph is based on several measured values with smooth lines drawn > between. In retrospect, this graph should have shown the data markers for > these measurement points. (In particular, the curve for the FFT graph is > not this smooth, since the FFT length jumps by powers of two.) The > measured data were taken on a 100 MHz Pentium processor, running under > DOS, in compiled QuickBasic. I know that one of the measured points was > for a 400 point filter kernel with 1024 point FFTs being used. My guess > is that the other points were powers of two of this, e.g. an 800 point > filter kernel with 2048 point FFTs. > > I didn't investigate how the length of the FFT affects the execution > speed. For instance, I simply state that a 600 point filter kernel could > use an FFT of 1024 or 2048 points. Robert, it seems you looked at this > issue, but I couldn't see the answer in your post. For a filter kernel > of length, L, what is the optimum FFT size, N?
I calculated this for the SHARC DSP (2nd generation with SIMD) in this post: http://groups.google.com/group/comp.dsp/browse_frm/thread/0985ad0ca9d10309/e416169ba33926a0?#e416169ba33926a0 The numbers are highly hard- and software dependent. The ADI library for complex vector MAC uses 6 cycles per MAC*), and I coded it in 5 cycles per MAC, etc. I once worked on another system with shared SDRAM, where the bottle neck was the memory interface. That system was also more complex, in that the impulse response was partitioned into multiple blocks of different size (the Gardener concept), same thing that the OP has in mind. In such systems, calculating the break-even point isn't that simple, but the idea remains the same. Regards, Andor *) see here: http://www.analog.com/processors/sharc/technicalLibrary/codeExamples/2116x_simd_code.html
Reply by robert bristow-johnson December 24, 20072007-12-24
On Dec 23, 9:59 am, "peterworth" <peterwo...@gmail.com> wrote:
> >it's fine. now you're doing overlap-ADD fast convolution, right? or > >is it overlap-SAVE? > > >r b-j > > overlap-add, yes. but this is the difficult part it seems - i can do the > FFT side of things, it's just a case of working out what to overlap with > what and what to sum etc. if overlap-save is any simpler i'd give that a > go..
Peter, it's a lot of work, but is what you're asking for is for someone to explain to you how to turn circular convolution (that's the only kind of convolution you can do directly with the DFT or FFT) into linear convolution (which is what an FIR filter does)? it *is* an issue of partitioning the signal into frames or blocks (same thing, different semantic) with N-(L-1) fresh new samples in each block. in OLA you pad the end of those N-(L-1) samples with L-1 zeros and with OLS, you pad the beginning of the N-(L-1) samples with the last L-1 samples from the previous block (you're reusing L-1 samples). for the very first block of OLS, you pad the beginning of the first N-(L-1) samples drawn from the beginning of your input with L-1 zeros. now, in both OLA and OLS, you have *pre*-computed the DFT of the FIR impulse response h[n] which is defined for 0 <= n <= L-1 and is zero- padded for the remaining N-L samples (i.e. h[n] = 0 for L <= n <= N-1). so this H[k] = DFT{h[n]} is computed *once* and the cost of that single computation is not included in counting the cost of "fast convolution". when you have two sequences of numbers, x[n] and h[n] both of length N (but N-L samples of h[n] are zero as well as L-1 samples of x[n] in OLA), if you DFT both x[n] and h[n], point-wise multiply X[k] and H[k] together, and inverse DFT the product sequence, y[n], will be the *circular* convolution of x[n] and h[n]. H[k] = DFT{ h[n] } X[k] = DFT{ x[n] } Y[k] = DFT{ y[n] } ( or y[n] = iDFT{ Y[k] } ) Y[k] = H[k]*X[k] (here "*" means multiplication not convolution) the result from fiddling with the DFT is N-1 y[n] = SUM{ h[m]*x[n-m] } m=0 or N-1 y[n] = SUM{ x[m]*h[n-m] } m=0 where x[n] and h[n] (as well as y[n]) are understood to be periodically extended with period N. that is x[n+N] = x[n] for all n h[n+N] = h[n] for all n so in the above summations if n-m < 0, then you must understand that x[n-m] = x[n-m+N] or h[n-m] = h[n-m+N] on the right hand side, if n-m < 0 and m < N, then 0 <= n-m+N < N and x[n-m] or h[n-m] are well defined. do you understand this circular convolution and can compare to linear convolution (where there is no wrap around)? now, in Overlap-Save (OLS), you have all N samples of x[n] represent nice contiguous samples (no zeros are padded here), but if you perform the circular convolution by DFTing multiplying and inverse DFTing, you will see that the first L-1 samples of y[n] are *bad* (not in agreement with regular convolution) because (using the second summation above) you can see that for 0 <= n < L-1 the h[n-m] will have some non-zero samples wrap around and straddle the summation between the x[N-1] (which is in the position of x[-1]) and x[0]. so you can't use those L-1 samples of y[n] and must throw them away. but you SAVE the N-(L-1) samples of y[n] that are good. in the next frame or block you fetch the following N-(L-1) samples, append them to the end of last L-1 samples of the present block and do it all again. the first "good" sample of y[n] (i think that would be y[L-1]) in the current frame would pick up right where the last "good" sample of y[n] (or y[N-1]) of the previous frame left off. in Overlap-Add (OLA), which is a little more complicated, there are no "bad" samples per se, but there are L-1 "incomplete" samples in the output of each frame. you can imagine that your signal is partitioned into blocks with N-(L-1) non-zero samples with an infinite number of zeros appended on each side of the block and each of those sequences are aligned so exactly when the sequence for block #B transitions from non-zero samples to the first zero-pad sample on the right (at x[N- (L-1)]), is when the sequence for block #B+1 transitions on the left (at x[0]) from the left zero-pad sample to the first non-zero sample. so if you were to add up all of these aligned sequences, you would get your original x[n] without missing a sample nor doubling up on any samples. (remember x[0] of block #B+1 is the sample that would be in the position of x[N-(L-1)] of block #B. now, because an FIR filter is LTI (linear and time-invariant), the output of your filter given the sum of these zero-padded and aligned sequences (which is just the original unpartitioned input) is the sum of the filter outputs of each zero-padded sequence when those outputs are also aligned as the inputs. that's basic LTI theory, right? so, in OLA, there are no "bad" samples because the non-zero portion of h[n-m] does not straddle any non-zero discontinuities of x[n] in the convolution. but it *is* straddling that transition from zero at x[N-1] to non-zero at x[0]. during that straddling, the output y[n] will be "ramping up" as fewer zero-pad samples (from x[N-(L-1)] to x[N-1]) are included in the region where h[n-m] is non-zero. so the first L-1 samples of y[n] will be ramping up (when 0 <= n <= L-2). likewise the last L-1 samples of y[n] will be ramping down since h[n- m] will be straddling the transition between the last non-zero sample x[N-L] and first zero-pad at x[N-(L-1)] (when N-(L-1) <= n <= N-1). the last L-1 samples of the previous frame (that are ramping down) are overlapped (due to this alignment previously alluded to) and ADDed to the first L-1 samples of the current frame (that are ramping up), thus "completing" those L-1 samples. but also for OLA, you still are computing only N-(L-1) "complete" output samples per frame or block. if this doesn't answer your question, all i can recommend is get a good book. Steve's or Rick's or O&S or something. r b-j
Reply by robert bristow-johnson December 24, 20072007-12-24
On Dec 24, 1:06 pm, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
> > > I didn't investigate how the length of the FFT affects the execution > speed. For instance, I simply state that a 600 point filter kernel could > use an FFT of 1024 or 2048 points.
but it wouldn't be optimum. the MATLAB program i posted indicates that for L=600, the FFT should be 2^13 or 8192 points, to minimize the per sample computational cost. maybe if "B" was much bigger than "A", that would change.
> Robert, it seems you looked at this > issue, but I couldn't see the answer in your post. For a filter kernel > of length, L, what is the optimum FFT size, N?
it's in the comments of the MATLAB program that I posted. i'll restate it a little differently (without the p = log2(N) symbol). given this cost function for the cost of computing one entire frame or block (that results in N-(L-1) new samples of output): Cost per block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-(L-1)) the "E" weighted cost of moving around N-(L-1) samples will come out in the wash; no matter what your block size is, that cost is always E per sample, so i'm leaving it out. this is also assuming a radix-2 FFT. i would guess that the cost related to the "A" term (which is proportional to the total number of butterflies), is the most significant, which is one reason i set the other 3 cost coefficients to zero as defaults in the MATLAB program. so, given that cost per block, and your payoff for that cost is N- (L-1) processed output samples, the net cost per sample is Cost per sample = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-(L-1)) + E now you can see that, for a *fixed* N, that the larger L gets, you are getting fewer "good" samples per block coming out of OLS (same for OLA), for the cost of the block. you are dividing by a smaller number N-(L-1) and thus the cost per processed samples increases (as it would linearly for the straight-forward FIR computation). so the result that i got (i tried to spell out how i got it) is that the optimal value of N, assuming it's an integer power of 2 is such an N so that the following two inequalities are satisfied: L-1 >= (A*N - C*(log2(N)-2) - D)/(A*(log2(N)+1) + B + C*2/N) and L-1 <= (A*2*N - C*(log2(N)-1) - D)/(A*(log2(N)+2) + B + C/N) if we ignore the costs that are proportional to N, log2(N), and 1 (keeping only the cost proportional to N*log2(N)) these inequalities become N/(log2(N)+1) <= L-1 <= (2*N)/(log2(N)+2) now that directly defines the range of optimal L, given a power of 2, N, but you can turn it around and fix L and search for the N=2^p so that the L you chose lies in the range indicated by the inequalities above. this is what the MATLAB program shows. if you or anyone runs it, i might suggest using the "axis" command to zoom in on the graph somewhere and then pick a segment of the red curve (a single arc) and connect that to the blue curve of the segment immediately to the right of it, and then connect that blue segment to a green segment immediately to the right of that. what you will see is a *single* arc indicating a single particular value for the FFT, N. for that fixed N, it is the cost/sample of processing as L increases (and note that the cost is increasing with L). for the red segment of the arc, that is the region where N is too large to be optimal (there is another blue curve for a different N just below it), but as L gets larger the curve becomes blue, indicating that this particular N is optimal. but then as L gets even larger, the arc becomes green indicating that this particular N is too small for those values of L. now, if you were to relate this cost to the cost of implementing the FIR directly (which is proportional to N), you would have to come up with a ratio of the cost of doing a butterfly (with all of its complex math), which is "A", to the cost of doing a single tap (multiply- accumulate, increment pointers) of the conventional FIR. just looking at the old 56K benchmarks, that cost ratio might be 6 to 1. with that dumb benchmark (and assuming the cost of a complex multiply is 4 times the cost of a real MAC), i set A=6 and B=(4/2+2) and ran the program. it looks like the efficiency of the fast convolution overtakes the efficiency of the straight-forward method somewhere arount L = 65 taps which seems to be in very close agreement with your information. given the numbers for A and B that i guessed at, an FIR with, say, L=70 taps (about as small as you would want to use the FFT method) would use an FFT of N=512 points to be most efficient. r b-j
Reply by SteveSmith December 24, 20072007-12-24
Keep in mind that it has been over 15 years since I prepared Fig 18-3, so
my explanation may have some gaps.  Here is the link to the figure: 

http://www.dspguide.com/ch18/3.htm

This graph is based on several measured values with smooth lines drawn
between.  In retrospect, this graph should have shown the data markers for
these measurement points. (In particular, the curve for the FFT graph is
not this smooth, since the FFT length jumps by powers of two.)   The
measured data  were taken on a 100 MHz Pentium processor, running under
DOS, in compiled QuickBasic.  I know that one of the measured points was
for a 400 point filter kernel with 1024 point FFTs being used.   My guess
is that the other points were powers of two of this, e.g. an 800 point
filter kernel with 2048 point FFTs.

I didn&rsquo;t investigate how the length of the FFT affects the execution
speed.  For instance, I simply state that a 600 point filter kernel could
use an FFT of 1024 or 2048 points. Robert, it seems you looked at this
issue, but I couldn&rsquo;t see the answer in your post.  For a filter kernel
of length, L, what is the optimum FFT size, N?

Regards,
Steve Smith
Steve.Smith@SpectrumSDI.com   


Reply by robert bristow-johnson December 23, 20072007-12-23
hey guys!  i gotta sorta holiday DSP present for you. Steve Smith's
plot on Figure 18-3 represented some interesting data regarding the
cost of fast convolution (which is something that i messed around with
theoretically a little bit).  i had some old notes on how to approach
the optimization problem (given an FIR length, L, how to choose the
FFT size, N, so that the number of operations per processed sample is
minimum).  so i took the notes and coded it in MATLAB (i'm sure that
Octave will work) and it's at the bottom of this post.  have fun with
it.  i think it's pretty conclusive.

anyway, it *does* show that if you optimize the FFT size that the cost
of fast convolution increases linearly with the logarithm of the FIR
length L.  i.e., if your FIR length gets big enough, doubling the FIR
length, L, increases the time cost of processing by a constant unit of
time.

Melly Clistmist everyone,

r b-j


On Dec 22, 6:22 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> > Now, I have a question for Steve Smith: > > this has to do with Figure 3-18 on page 183 of your book,
i meant to say Figure 18-3 on page 318.
> http://www.dspguide.com/CH18.PDF. it's displaying the net execution > time for the fast convolution in ms/L (L is the FIR length, or FIR > order+1). can you tell me, as a function of the FIR length L, what > size FFT (N) did you use? then we know, per block, that the total > cost of processing is divided by (N-L+1) to get the cost of processing > per sample. > > i've posted to this newsgroup a couple of times, a methodology for > picking the optimal value of N as a function of L, the desired FIR > length (that is usually determined by the minimum performance needs in > the filter spec as it gets designed by P-McC or similar). > > so, for comparison, the value of N i would chose would be from this: > > the cost of the FFT would likely be of this form: > > A1*N*log2(N) + B1*N + C1*log2(N) + D1 > > for some constants, A1, B1, C1, and D1, given the nature of the > hardware and software that implements the FFT. we expect the cost of > multiplication in the frequency domain to be > > B2*N + D2 > > and there might be more overhead regarding moving samples around. in > OLS, i would expect that cost to be proportional to the number of > samples processed per block > > E*(N-L+1) > > where E is the cost per sample of moving data. anyway, given those > cost functions, the total cost to process one frame of samples would > be > > Cost/Block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-L+1) > > where the A, B, C, D, and E coefficients are easily determined from > the cost coefficients above (remember there are 2 FFTs). then, the > cost per sample, the performance measure you have plotted in your > book, Steve, is a function that should look like: > > Cost/Samp = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) + E > > Now, given an arbitrary (but fixed) L, you can come up with a > recursion equations that compares the Cost/Samp for N that are > adjacent powers of 2 (well call them N and N/2). That difference is: > > (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) > - (A*N/2*log2(N/2) + B*N/2 + C*log2(N/2) + D)/(N/2-L+1) > > (note the E's kill each other off). > > It intially looks like you can simplify that cost delta function, but > because the two denominators are different, you can't do too much to > it (oh, maybe you can put it on a common denominator) except evaluate > it, given A, B, C, D, and L, for increasing powers of 2 for N. you > can also look at the same cost function, but in terms of log2(N) and > with that variable as continuous (so we're essentiall sampling that > function every integer value of log2(N), for radix-2 FFTs). when you > do that, you can show by derivative, that there is only one value of > log2(N) that minimizes the Cost/Samp function. so that means you can > take the above cost difference equation (with L fixed) and repeatedly > double N until it first becomes negative. that would be one power of > two beyond your optimal power of 2 for N. > > anyway, what i would like to know, Steve, is how you chose your FFT > size, N, as a function of FIR length, L? and then assuming you pick > that N, is that the cost/L function that you have plotted in Figure > 18-3? >
% % An investigation in optimal radix-2 FFT size, N = 2^p, % for fast convolution with FIR length of L % % Cost function: % % cost/sample = (cost/block) / (num_processed_samples/block) % % = (A*p*N + B*N + C*p + D) / (N - (L-1)) % % N = 2^p = size of radix-2 FFT % p = log2(N) = number of passes in radix-2 FFT % L = FIR length ( L-1 is FIR order ) % N - (L-1) = number of samples processed in each block % % p*N/2 = number of FFT butterflies in one radix-2 FFT % N-1 = number of FFT groups in one radix-2 FFT % N/2 = number of complex multiplies needed for scaling spectrum % p = number of FFT passes in one radix-2 FFT % % A = cost of a single FFT butterfly % B = twice of overhead cost for setting up each FFT group % + half cost of one complex multiply in transfer function % C = twice of overhead cost for setting up each FFT pass % D+B = twice of overhead cost for setting up FFT % + other constant costs per block % % Fixing the FIR length to L, the optimal power, p, % of two for N = 2^p is such a p that % % L-1 >= (A*2^p - C*(p-2) - D)/(A*(p+1) + B + C*2^(1-p)) % % and % % L-1 <= (A*2^(p+1) - C*(p-1) - D)/(A*(p+2) + B + C*2^(-p)) % % So, instead of solving for p as a function of L, for every % integer value of p (or for every power of two, N), we delimit % the range of L that has that particular integer power of p % optimal for the range of L. If, as L increases, it exceeds the % optimal range, p is incremented by 1 (which changes the range so % that L is now inside the range for which p is optimal). % % The blue graph is the cost function for increasing L using the % optimal value for p or N while the red and green graphs are the % cost functions for FFT size N twice as large as the optimal % and half as large as optimal N, repectively. Note that the red % and green graphs never get less than the blue (optimal) graph % and touch it only at points where the FIR length L is about to % have the optimal N double. % % From Robert Bristow-Johnson, (c) 2007 % if ~exist('maxP') maxP = 32 end if ~exist('L_per_oct') L_per_oct = 32 end if ~exist('A') A = 1 end if ~exist('B') B = 0 end if ~exist('C') C = 0 end if ~exist('D') D = 0 end p = linspace(1, 40, 40); L_optimal = (A*2.^p - C*(p-2) - D)./(A*(p+1) + B + C*2.^(-p+1)) + 1; clear p; LL = round(2.^linspace(1, maxP, maxP*L_per_oct+1)); cost = zeros(size(LL)); p = 1; for k=1:maxP*L_per_oct+1 L = LL(k); if (L >= L_optimal(p+1)) p = p+1; end; N = 2^p; % optimal FFT size N for FIR length L cost(k) = (A*p*N + B*N + C*p + D)/(N - (L-1)); % cost function end; semilogx(LL, cost, 'b'); if 1 % set to 1 for comparative plots hold on; p = 1; for k=1:maxP*L_per_oct+1 L = LL(k); if (L >= L_optimal(p+1)) p = p+1; end; N = 2^(p-1); % half of optimal FFT size N for FIR length L cost(k) = (A*(p-1)*N + B*N + C*(p-1) + D)/(N - (L-1)); end; semilogx(LL, cost, 'g'); p = 1; for k=1:maxP*L_per_oct+1 L = LL(k); if (L > L_optimal(p+1)) p = p+1; end; N = 2^(p+1); % twice optimal FFT size N for FIR length L cost(k) = (A*(p+1)*N + B*N + C*(p+1) + D)/(N - (L-1)); end; semilogx(LL, cost, 'r'); hold off; end;
Reply by peterworth December 23, 20072007-12-23
>Sorry for the bad link-- here is the correct one: > >http://www.dspguide.com/CH18.PDF >
an introduction to the basics is just what i need (and couldn't find), cheers.
Reply by peterworth December 23, 20072007-12-23
>it's fine. now you're doing overlap-ADD fast convolution, right? or >is it overlap-SAVE? > >r b-j
overlap-add, yes. but this is the difficult part it seems - i can do the FFT side of things, it's just a case of working out what to overlap with what and what to sum etc. if overlap-save is any simpler i'd give that a go..
Reply by robert bristow-johnson December 22, 20072007-12-22
On Dec 22, 9:32 am, Andor <andor.bari...@gmail.com> wrote:
> On 21 Dez., 01:05, robert bristow-johnson <r...@audioimagination.com> > wrote: > > > On Dec 19, 11:35 am, "peterworth" <peterwo...@gmail.com> wrote: > > > > >> y(n) = h(n) * x(n) = sum_{k=0}^{N-1} h(k) x(n-k) > > > > >> = sum_{k=0}^{N1-1} h(k) x(n-k) + sum_{k=N1}^{N-1} h(k) x(n-k) > > > i'm not sure i quite get this. i am trying to relate it to OLA... > > In OLA or OLS, the input signal is partitioned into blocks, and each > block is convolved with an impulse response. However, if the impulse > response is long, it also has to be partitioned into blocks (or if the > delay of the convolution is to be smaller than the block size). The > above sum split was just to show how partitioning of the impulse > response into two blocks leads to two parallel convolutions. > > > > > > sorry to ask lots of questions.. > > > it's fine. now you're doing overlap-ADD fast convolution, right? or > > is it overlap-SAVE? > > OLA is, in my eyes, really just an academic technique. It has no > advantages over OLS, whereas OLS has several advantages over OLA. > Don't you agree?
OLA has the added cost of adding the tails of adjacent blocks. but in OLS, you have to make sure you take the right samples (and discard the L-1 bad ones), but i don't see a cost to that whereas there are L-1 additions that OLA must make that OLS need not. i wonder if, in the numerical scheme of things, the rougher edge of the discontinuity in the OLS data is worser than the zero-padding of L-1 samples in OLA. dunno, but that is the only possible advantage that OLA *might* have over OLS, but i wouldn't count on it and i'm too lazy to investigate it very far. Now, I have a question for Steve Smith: this has to do with Figure 3-18 on page 183 of your book, http://www.dspguide.com/CH18.PDF . it's displaying the net execution time for the fast convolution in ms/L (L is the FIR length, or FIR order+1). can you tell me, as a function of the FIR length L, what size FFT (N) did you use. then we know, per block, that the total cost of processing is divided by (N-L+1) to get the cost of processing per sample. i've posted to this newsgroup a couple of times, a methodology for picking the optimal value of N as a function of L, the desired FIR length (that is usually determined by the minimum performance needs in the filter spec as it gets designed by P-McC or similar). so, for comparison, the value of N i would chose would be from this: the cost of the FFT would likely be of this form: A1*N*log2(N) + B1*N + C1*log2(N) + D1 for some constants, A1, B1, C1, and D1, given the nature of the hardware and software that implements the FFT. we expect the cost of multiplication in the frequency domain to be B2*N + D2 and there might be more overhead regarding moving samples around. in OLS, i would expect that cost to be proportional to the number of samples processed per block E*(N-L+1) where E is the cost per sample of moving data. anyway, given those cost functions, the total cost to process one frame of samples would be Cost/Block = A*N*log2(N) + B*N + C*log2(N) + D + E*(N-L+1) where the A, B, C, D, and E coefficients are easily determined from the cost coefficients above (remember there are 2 FFTs). then, the cost per sample, the performance measure you have plotted in your book, Steve, is a function that should look like: Cost/Samp = (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) + E Now, given an arbitrary (but fixed) L, you can come up with a recursion equations that compares the Cost/Samp for N that are adjacent powers of 2 (well call them N and N/2). That difference is: (A*N*log2(N) + B*N + C*log2(N) + D)/(N-L+1) - (A*N/2*log2(N/2) + B*N/2 + C*log2(N/2) + D)/(N/2-L+1) (note the E's kill each other off). It intially looks like you can simplify that cost delta function, but because the two denominators are different, you can't do too much to it (oh, maybe you can put it on a common denominator) except evaluate it, given A, B, C, D, and L, for increasing powers of 2 for N. you can also look at the same cost function, but in terms of log2(N) and with that variable as continuous (so we're essentiall sampling that function every integer value of log2(N), for radix-2 FFTs). when you do that, you can show by derivative, that there is only one value of log2(N) that minimizes the Cost/Samp function. so that means you can take the above cost difference equation (with L fixed) and repeatedly double N until it first becomes negative. that would be one power of two beyond your optimal power of 2 for N. anyway, what i would like to know, Steve, is how you chose your FFT size, N, as a function of FIR length, L. and then assuming you pick that N, is that the cost/L function that you have plotted in Figure 18-3? it's a sorta cliche' that "the devil is in the details", but in cases like these, i actually think the opposite. Melly Clistmast everyone, r b-j