Overlap-Add (OLA) STFT Processing

This chapter discusses use of the Short-Time Fourier Transform (STFT) to implement linear filtering in the frequency domain. Due to the speed of FFT convolution, the STFT provides the most efficient single-CPU implementation engine for most FIR filters encountered in audio signal processing.

Recall from §7.1 the STFT:

$\displaystyle X_m(\omega)$ $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^{\infty} x(n) w(n-mR) e^{-j\omega n}$  
  $\displaystyle =$ $\displaystyle \hbox{\sc DTFT}_\omega(x\cdot\hbox{\sc Shift}_{mR}(w))
\protect$ (9.1)


x(n) &=& \hbox{input signal at time $n$}\\
w(n) &=& \hbox{window function (\textit{e.g.}, Hamming)}\\
X_m(\omega) &=& \hbox{DTFT of windowed data centered about time $mR$}\\
R &=& \hbox{hop size, in samples, between successive windows}\\

We noted that if the window $ w(n)$ has the constant overlap-add property at hop-size $ R$ ,

$\displaystyle \sum_{m=-\infty}^{\infty} w(n-mR) \eqsp 1, \;\forall n\in{\bf Z} \quad \hbox{($w\in\hbox{\sc Cola}(R)$)},$ (9.2)

then the sum of the successive DTFTs over time equals the DTFT of the whole signal $ X(\omega)$ :

$\displaystyle \sum_{m=-\infty}^\infty X_m(\omega) \eqsp X(\omega) \eqsp \hbox{\sc DTFT}_\omega(x)$ (9.3)

Consequently, the inverse-STFT is simply the inverse-DTFT of this sum:

x(n) &=& \frac{1}{2\pi}\int_{-\pi}^\pi \sum_{m=-\infty}^\infty X_m(\omega)
e^{j\omega n} d\omega
\eqsp \sum_{m=-\infty}^\infty \frac{1}{2\pi}\int_{-\pi}^\pi X_m(\omega)
e^{j\omega n} d\omega\\
&=& \sum_{m=-\infty}^\infty x_m(n)

We may now introduce spectral modifications by multiplying each spectral frame $ X_m(\omega)$ by some filter frequency response $ H_m(\omega)$ to get

$\displaystyle Y_m(\omega) \eqsp H_m(\omega)X_m(\omega).$ (9.4)

Note that $ H_m$ can be different for each frame, giving a certain class of time-varying filters. The filtered output signal spectrum is then

$\displaystyle Y(\omega) \eqsp \sum_{m=-\infty}^\infty Y_m(\omega)$ (9.5)

so that

$\displaystyle y(n) \eqsp \sum_{m=-\infty}^\infty y_m(n)$ (9.6)


$\displaystyle y_m(n) \eqsp \hbox{\sc IDTFT}_n(Y_m) = x_m\ast h_m.$ (9.7)

This chapter discusses practical implementation of the above relations using a Fast Fourier Transform (FFT). In particular, we use an FFT to compute efficiently what may be regarded as a sampled DTFT. We will look at how sampling density must be increased along the unit circle when spectral modifications are to be performed, and we will discuss further the COLA condition on the analysis window and hop-size. In the end, our practical FFT-convolution engine will look as follows:

$\displaystyle y \eqsp \sum_{m=-\infty}^\infty \hbox{\sc Shift}_{mR} \left( \hbox{\sc FFT}_N^{-1} \left\{ H_m \cdot \hbox{\sc FFT}_N\left[\hbox{\sc Shift}_{-mR}(x)\cdot w_M \right]\right\}\right)$ (9.8)

The sum over $ m$ may be interpreted as adding separately filtered frames $ y_m=x_m\ast h_m$ . Due to this filtering, the frames must overlap, even when the rectangular window is used. As a result, the overall system is often called an overlap-add FFT processor, or ``OLA processor'' for short. It is regarded as a sequence of FFTs which may be modified, inverse-transformed, and summed. This ``hopping transform'' view of the STFT is the Fourier dual of the ``filter-bank'' interpretation to be discussed in Chapter 9.

Convolution of Short Signals

Figure: System diagram for filtering an input signal $ x(n)$ by filter $ h(n)$ to produce output $ y(n)$ as the convolution of $ x$ and $ h$ .

Figure 8.1 illustrates the conceptual operation of filtering an input signal $ x(n)$ by a filter with impulse-response $ h(n)$ to produce an output signal $ y(n)$ . By the convolution theorem for DTFTs2.3.5),

$\displaystyle (h*x) \;\longleftrightarrow\;H \cdot X$ (9.9)


$\displaystyle \hbox{\sc DTFT}_\omega(h*x)\eqsp H(\omega)X(\omega)$ (9.10)

where $ h$ and $ x$ are arbitrary real or complex sequences, and $ H$ and $ X$ are the DTFTs of $ h$ and $ x$ , respectively. The convolution of $ x$ and $ h$ is defined by

$\displaystyle y(n) \eqsp (x*h)(n) \isdefs \sum_{m=-\infty}^{\infty} x(m)h(n-m).$ (9.11)

In practice, we always use the DFT (preferably an FFT) in place of the DTFT, in which case we may write

$\displaystyle \hbox{\sc DFT}_k(h*x)\eqsp H(\omega_k)X(\omega_k)$ (9.12)

where now $ h,x,H,X\in {\bf C}^N$ (length $ N$ complex sequences). It is important to remember that the specific form of convolution implied in the DFT case is cyclic (also called circular) convolution [264]:

$\displaystyle y(n) \eqsp (x*h)(n) \isdefs \sum_{m=0}^{N-1} x(m)h(n-m)_N \protect$ (9.13)

where $ (n-m)_N$ means ``$ (n-m)$ modulo $ N$ .''

Another way to look at convolution is as the inner product of $ x$ , and $ \hbox{\sc Shift}_n[\hbox{\sc Flip}(h)]$ , where $ \hbox{\sc Flip}_n(h)\isdeftext h(-n)=h(N-n)$ , i.e.,

$\displaystyle y(n) \eqsp \langle x, \hbox{\sc Shift}_n[\hbox{\sc Flip}(h)] \rangle. % \qquad\hbox{($h$ real)}
$ (9.14)

This form describes graphical convolution in which the output sample at time $ n$ is computed as an inner product of the impulse response after flipping it about time 0 and shifting time 0 to time $ n$ . See [264, p. 105] for an illustration of graphical convolution.

Cyclic FFT Convolution

Thanks to the convolution theorem, we have two alternate ways to perform cyclic convolution in practice:

  1. Direct calculation in the time domain using (8.13)
  2. Frequency-domain convolution:
    1. Fourier Transform both signals
    2. Perform term by term multiplication of the transformed signals
    3. Inverse transform the result to get back to the time domain
For short convolutions (less than a hundred samples or so), method 1 is usually faster. However, for longer convolutions, method 2 is ultimately faster. This is because the computational complexity of direct cyclic convolution of two $ N$ -point signals is $ {\cal O}(N^2)$ , while that of FFT convolution is $ {\cal O}(N \lg N)$ . More precisely, direct cyclic convolution requires $ N^2$ multiplies and $ N(N-1)$ additions, while the exact FFT numbers depend on the particular FFT algorithm used [80,66,224,277]. Some specific cases are compared in §8.1.4 below.

Acyclic FFT Convolution

If we add enough trailing zeros to the signals being convolved, we can obtain acyclic convolution embedded within a cyclic convolution. How many zeros do we need to add? Suppose the signal $ x(n)$ consists of $ N_x$ contiguous nonzero samples at times 0 to $ N_x-1$ , preceded and followed by zeros, and suppose $ h(n)$ is nonzero only over a block of $ N_h$ samples starting at time 0. Then the acyclic convolution of $ x$ with $ h$ reduces to

$\displaystyle (x\ast h)(n) \isdefs \sum_{m=-\infty}^\infty x(m)h(n-m) \eqsp \sum_{m=0}^n x(m)h(n-m)$ (9.15)

which is zero for $ n<0$ and $ n>(N_x+N_h-1)-1$ . Thus,
$\textstyle \parbox{0.8\textwidth}{\emph{the acyclic convolution of $N_x$\ samples with $N_h$\ samples produces at most $N_x+N_h-1$\ nonzero samples.}}$
The number $ N_x+N_h-1$ is easily checked for signals of length 1 since $ \delta\ast \delta = \delta$ , where $ \delta $ is 1 at time zero and 0 at all other times. Similarly,

$\displaystyle [\delta+\hbox{\sc Shift}_1(\delta)] \ast [\delta+\hbox{\sc Shift}_1(\delta)] \eqsp \delta + 2\hbox{\sc Shift}_1(\delta) + \hbox{\sc Shift}_2(\delta)$ (9.16)

and so on.

When $ N_x$ or $ N_h$ is infinity, the convolution result can be as small as 1. For example, consider $ x=[1,r,r^2,r^3,\ldots]$ , with $ \left\vert r\right\vert<1$ , and $ h=[1,-r,0,0,\ldots]$ . Then $ x\ast h = [1, 0, 0,
\ldots]$ . This is an example of what is called deconvolution. In the frequency domain, deconvolution always involves a pole-zero cancellation. Therefore, it is only possible when $ N_x$ or $ N_h$ is infinite. In practice, deconvolution can sometimes be accomplished approximately, particularly within narrow frequency bands [119].

We thus conclude that, to embed acyclic convolution within a cyclic convolution (as provided by an FFT), we need to zero-pad both operands out to length $ N$ , where $ N$ is at least the sum of the operand lengths (minus one).

Acyclic Convolution in Matlab

In Matlab or Octave, the conv function implements acyclic convolution:

octave:1> conv([1 2],[3 4])
ans =
   3  10   8
Note that it returns an output vector which is long enough to accommodate the entire result of the convolution, unlike the filter primitive, which always returns an output signal equal in length to the input signal:
octave:2> filter([1 2],1,[3 4])
ans =
   3  10
octave:3> filter([1 2],1,[3 4 0])
ans =
   3  10   8

Pictorial View of Acyclic Convolution

Figure 8.2: Schematic depiction of the acyclic convolution of two signals.
\includegraphics[width=\textwidth ]{eps/convwaves}

Figure 8.2 shows schematically the result of convolving two zero-padded signals $ x$ and $ h$ . In this case, the signal $ x(n)$ starts some time after $ n=0$ , say at $ n=n_x$ . Since $ h(n)$ begins at time 0 , the output starts promptly at time $ n_x$ , but it takes some time to ``ramp up'' to full amplitude. (This is the transient response of the FIR filter $ h$ .) If the length of $ h$ is $ N_h$ , then the transient response is finished at time $ n=n_x+N_h-1$ . Next, when the input signal goes to zero at time $ n_x+N_x$ , the output reaches zero $ N_h-1$ samples later (after the filter ``decay time''), or time $ n_x+N_x+N_h-1$ . Thus, the total number of nonzero output samples is $ N_x+N_h-1$ .

If we don't add enough zeros, some of our convolution terms ``wrap around'' and add back upon others (due to modulo indexing). This can be called time-domain aliasing. Zero-padding in the time domain results in more samples (closer spacing) in the frequency domain, i.e., a higher `sampling rate' in the frequency domain. If we have a high enough spectral sampling rate, we can avoid time aliasing.

The motivation for implementing acyclic convolution using a zero-padded cyclic convolution is that we can use a Cooley-Tukey Fast Fourier Transform (FFT) to implement cyclic convolution when its length $ N$ is a power of 2.

Acyclic FFT Convolution in Matlab

The following example illustrates the implementation of acyclic convolution using a Cooley-Tukey FFT in matlab:

x = [1 2 3 4];
h = [1 1 1];

nx = length(x);
nh = length(h);
nfft = 2^nextpow2(nx+nh-1)
xzp = [x, zeros(1,nfft-nx)];
hzp = [h, zeros(1,nfft-nh)];
X = fft(xzp);
H = fft(hzp);

Y = H .* X;
format bank;
y = real(ifft(Y)) % zero-padded result
yt = y(1:nx+nh-1) % trim and print
yc = conv(x,h)    % for comparison
Program output:

nfft = 8
y =
  1.00  3.00  6.00  9.00  7.00  4.00  0.00  0.00
yt =
  1.00  3.00  6.00  9.00  7.00  4.00
yc =
     1     3     6     9     7     4

FFT versus Direct Convolution

Using the Matlab test program in [264],9.1FFT convolution was found to be faster than direct convolution starting at length $ N=2^6=64$ (looking only at powers of 2 for the length $ N$ ).9.2 FFT convolution was also never significantly slower at shorter lengths for which ``calling overhead'' dominates.

Running the same test program in 2011,9.3 FFT convolution using the fft function was found to be faster than conv for all (power-of-2) lengths. The speed of FFT convolution divided by that of direct convolution started out at 14 for $ N=2$ , fell to a minimum of $ 11$ at $ N=2^7=128$ , above which it started to climb as expected, reaching $ 3,160$ at $ N=2^{16}=65,536$ . Note that this comparison is unfair because the Octave fft function is a dynamically linked, separately compiled module, while conv is written in the matlab language and thus suffers more overhead from the matlab interpreter.

An analysis reported in Strum and Kirk [279, p. 521], based on the number of real multiplies, predicts that the fft is faster starting at length $ 2^7=128$ , and that direct convolution is significantly faster for very short convolutions (e.g., 16 operations for a direct length-4 convolution, versus 176 for the fft function).

See [264]9.4for further discussion of FFT algorithms and their applications.

In digital audio, FIR filters are often hundreds of taps long. For such filters, the FFT method is much faster than direct convolution in the time domain on single CPUs. On GPUs, FFT convolution is faster than direct convolution only for much longer FIR-filter lengths (in the thousands of taps [242]); this is because massively parallel hardware can perform an $ {\cal O}(N^2)$ algorithm (direct convolution) faster than a single CPU can perform an $ {\cal O}(N\,\lg N)$ algorithm (FFT convolution).

Audio FIR Filters

FIR filters shorter than the ear's ``integration time'' can generally be characterized by their magnitude frequency response (no perceivable ``delay effects''). The nominal ``integration time'' of the ear can be defined as the reciprocal of a critical bandwidth of hearing. Using Zwicker's definition of critical bandwidth [305], the smallest critical bandwidth of hearing is approximately 100 Hz (below 500 Hz). Thus, the nominal integration time of the ear is 10ms below 500 Hz. (Using the equivalent-rectangular-bandwidth (ERB) definition of critical bandwidth [179,269], longer values are obtained). At a 50 kHz sampling rate, this is 500 samples. Therefore, FIR filters shorter than the ear's ``integration time,'' i.e., perceptually ``instantaneous,'' can easily be hundreds of taps long (as discussed in the next section). FFT convolution is consequently an important implementation tool for FIR filters in digital audio applications.

Example 1: Low-Pass Filtering by FFT Convolution

In this example, we design and implement a length $ L=257$ FIR lowpass filter having a cut-off frequency at $ f_c = 600$ Hz. The filter is tested on an input signal $ x(n)$ consisting of a sum of sinusoidal components at frequencies $ (440, 880, 1000, 2000)$ Hz. We'll filter a single input frame of length $ M=256$ , which allows the FFT to be $ N=512$ samples (no wasted zero-padding).

% Signal parameters:
f = [ 440 880 1000 2000 ];      % frequencies
M = 256;                        % signal length
Fs = 5000;                      % sampling rate

% Generate a signal by adding up sinusoids:
x = zeros(1,M); % pre-allocate 'accumulator'
n = 0:(M-1);    % discrete-time grid
for fk = f;
    x = x + sin(2*pi*n*fk/Fs);

Next we design the lowpass filter using the window method:

% Filter parameters:
L = 257;    % filter length
fc = 600;   % cutoff frequency

% Design the filter using the window method:
hsupp = (-(L-1)/2:(L-1)/2);
hideal = (2*fc/Fs)*sinc(2*fc*hsupp/Fs);
h = hamming(L)' .* hideal; % h is our filter

Figure 8.3: FIR filter impulse response (top) and amplitude response (bottom).

Figure 8.3 plots the impulse response and amplitude response of our FIR filter designed by the window method. Next, the signal frame and filter impulse response are zero-padded out to the FFT size and transformed:

% Choose the next power of 2 greater than L+M-1
Nfft = 2^(ceil(log2(L+M-1))); % or 2^nextpow2(L+M-1)

% Zero pad the signal and impulse response:
xzp = [ x zeros(1,Nfft-M) ];
hzp = [ h zeros(1,Nfft-L) ];

X = fft(xzp); % signal
H = fft(hzp); % filter

Figure 8.4 shows the input signal spectrum and the filter amplitude response overlaid. We see that only one sinusoidal component falls within the pass-band.

Figure 8.4: Overlay of input signal spectrum and desired lowpass filter pass-band.

Figure 8.5: Output signal magnitude spectrum = magnitude of input spectrum times filter frequency response.

Now we perform cyclic convolution in the time domain using pointwise multiplication in the frequency domain:

Y = X .* H;
The modified spectrum is shown in Fig.8.5.

The final acyclic convolution is the inverse transform of the pointwise product in the frequency domain. The imaginary part is not quite zero as it should be due to finite numerical precision:

y = ifft(Y);
relrmserr = norm(imag(y))/norm(y) % check... should be zero
y = real(y);

Figure 8.6: Filtered output signal, with close-up showing the filter start-up transient (``pre-ring'').

Figure 8.6 shows the filter output signal in the time domain. As expected, it looks like a pure tone in steady state. Note the equal amounts of ``pre-ringing'' and ``post-ringing'' due to the use of a linear-phase FIR filter.9.5

For an input signal approximately $ 4000$ samples long, this example is 2-3 times faster than the conv function in Matlab (which is precompiled C code implementing time-domain convolution).

Example 2: Time Domain Aliasing

Figure 8.7 shows the effect of insufficient zero padding, which can be thought of as undersampling in the frequency domain. We will see aliasing in the time domain results.

The lowpass filter length is $ L= 65$ and the input signal consists of an impulse at times $ 10$ and $ M-(L-1)/4 = 85$ , where the data frame length is $ M=100$ . To avoid time aliasing (i.e., to implement acyclic convolution using an FFT), we must use an FFT size $ N$ at least as large as $ 85+65-1=149$ . In the figure, the FFT sizes $ 116$ , $ 132$ , and $ 165$ are used. Thus, the first case is heavily time aliased, the second only slightly time aliased (involving only some of the filter's ``ringing'' after the second pulse), and the third is free of time aliasing altogether.

Figure 8.7: Illustration of FFT convolution with insufficient zero padding. From the top: (1) Input signal (two impulses) and lowpass-filter impulse response; (2) heavily time-aliased convolution in which the second filter impulse has wrapped around to low times; (3) slightly time-aliased result in which some of the filter ``post-ring'' from the second pulse wraps around; (4) result with no time aliasing.

Convolving with Long Signals

We saw that we can perform efficient convolution of two finite-length sequences using a Fast Fourier Transform (FFT). There are some situations, however, in which it is impractical to use a single FFT for each convolution operand:

  • One or both of the signals being convolved is very long.
  • The filter must operate in real time. (We can't wait until the input signal ends before providing an output signal.)
Direct convolution does not have these problems. For example, given a causal finite-impulse response (FIR) $ h$ of length $ L$ , we need only store the past $ L-1$ samples of the input signal $ x$ to calculate the next output sample, since

y(n) &=& (h\ast x)(n) = \sum_{m=0}^n h(m)x(n-m)\\
&=& h(0)x(n) + h(1)x(n-1)
+\cdots+ h(L-1) x(n-L+1)

Thus, at every time $ n$ , the output $ y(n)$ can be computed as a linear combination of the current input sample $ x(n)$ and the current filter state $ \{x(n-1),\ldots,x(n-L+1)\}$ .

To obtain the benefit of high-speed FFT convolution when the input signal is very long, we simply chop up the input signal $ x$ into blocks, and perform convolution on each block separately. The output is then the sum of the separately filtered blocks. The blocks overlap because of the ``ringing'' of the filter. For a zero-phase filter, each block overlaps with both of its neighboring blocks. For causal filters, each block overlaps only with its neighbor to the right (the next block in time). The fact that signal blocks overlap and must be added together (instead of simply abutted) is the source of the name overlap-add method for FFT convolution of long sequences [7,9].

The idea of processing input blocks separately can be extended also to both operands of a convolution (both $ x$ and $ h$ in $ x\ast h$ ). The details are a straightforward extension of the single-block-signal case discussed below.

When simple FFT convolution is being performed between a signal $ x$ and FIR filter $ h$ , there is no reason to use a non-rectangular window function on each input block. A rectangular window length of $ M$ samples may advance $ M$ samples for each successive frame (hop size $ =M$ samples). In this case, the input blocks do not overlap, while the output blocks overlap by the FIR filter length (minus one sample). On the other hand, if nonlinear and/or time-varying spectral modifications to be performed, then there are good reasons to use a non-rectangular window function and a smaller hop size, as we will develop below.

Overlap-Add Decomposition

Consider breaking an input signal $ x$ into frames using a finite, zero-phase, length $ M$ window $ w$ . Then we may express the $ m$ th windowed data frame as

$\displaystyle x_m(n) \isdefs x(n)w(n-mR),\quad n \in (-\infty, +\infty)$ (9.17)


$\displaystyle x_m \isdefs x \cdot \hbox{\sc Shift}_{mR}(w)$ (9.18)


R &\isdef & \hbox{frame step (hop size)}\\
m &\isdef & \hbox{frame index}

The hop size is the number of samples between the begin-times of adjacent frames. Specifically, it is the number of samples by which we advance each successive window.

Figure 8.8 shows an input signal (top) and three successive windowed data frames using a length $ M=128$ causal Hamming window and 50% overlap ($ R=M/2=64$ ).

Figure 8.8: Input signal (top) and three successive windowed data frames.
\includegraphics[width=\textwidth ]{eps/windsig}

For frame-by-frame spectral processing to work, we must be able to reconstruct $ x$ from the individual overlapping frames, ideally by simply summing them in their original time positions. This can be written as

x(n) &=& \sum_{m=-\infty}^{\infty} x_m(n) \\
&=& \sum_{m=-\infty}^{\infty} x(n) w(n-mR) \\
&=& x(n) \sum_{m=-\infty}^{\infty} w(n-mR)

Hence, $ x= \sum_{m} x_m$ if and only if

$\displaystyle \zbox {\sum_{m\in{\bf Z}} w(n-mR) = 1, \,\forall n\in{\bf Z}.}$ (9.19)

This is the constant-overlap-add (COLA)9.6 constraint for the FFT analysis window $ w$ . It has also been called the partition of unity property.

Figure 8.9: Example of 50% overlap-add for the Bartlett (triangular) window

Figure 8.9 illustrates the appearance of 50% overlap-add for the Bartlett (triangular) window. The Bartlett window is clearly COLA for a wide variety of hop sizes, such as $ M/2$ , $ M/3$ , and so on, provided $ M/k$ is an integer (otherwise the underlying continuous triangular window must be resampled). However, when using windows defined in a library, the COLA condition should be carefully checked. For example, the following Matlab/Octave script shows that there is a problem with the standard Hamming window:

M = 33;          % window length
R = (M-1)/2;     % hop size
N = 3*M;         % overlap-add span
w = hamming(M);  % window
z = zeros(N,1);  plot(z,'-k');  hold on;  s = z;
for so=0:R:N-M
  ndx = so+1:so+M;        % current window location
  s(ndx) = s(ndx) + w;    % window overlap-add
  wzp = z; wzp(ndx) = w;  % for plot only
  plot(wzp,'--ok');       % plot just this window
plot(s,'ok');  hold off;  % plot window overlap-add
The figure produced by this matlab code is shown in Fig.8.10. As can be seen, the equal end-points sum to form an impulse in each frame of the overlap-add.

Figure 8.10: Overlap-add example for the default Hamming window in Matlab.

The Matlab window functions (such as hamming) have an optional second argument which can be either 'symmetric' (the default), or 'periodic'. The periodic case is equivalent to

w = hamming(M+1); % symmetric case
w = w(1:M);       % delete last sample for periodic case
The periodic variant solves the non-constant overlap-add problem for even $ M$ and $ R=M/2$ , but not for odd $ M$ . The problem can be solved for odd $ M$ and $ R=(M-1)/2$ while retaining symmetry as follows:
w = hamming(M); % symmetric case
w(1) = w(1)/2;  % repair constant-overlap-add for R=(M-1)/2
w(M) = w(M)/2;
Since different window types may add or subtract 1 to/from $ M$ internally, it is best to check the result using test code as above to make sure the window is COLA at the desired hop size. E.g., in Matlab:
  • hamming(M) $ \isdef $
    .54 - .46*cos(2*pi*(0:M-1)'/(M-1));
    gives constant overlap-add for $ R=(M-1)/2$ , $ (M-1)/4$ , etc., when endpoints are divided by 2 or one endpoint is zeroed
  • hanning(M) $ \isdef $
    .5*(1 - cos(2*pi*(1:M)'/(M+1)));
    does not give constant overlap-add for $ R=(M-1)/2$ , but does for $ R=(M+1)/2$

  • blackman(M) $ \isdef $
    .42 - .5*cos(2*pi*m)' + .08*cos(4*pi*m)';
    where m = (0:M-1)/(M-1), gives constant overlap-add for $ R=(M-1)/3$ when $ M$ is odd and $ R$ is an integer, and $ R=M/3$ when $ M$ is even and $ R$ is integer.

In summary, all windows obeying the constant-overlap-add constraint will yield perfect reconstruction of the original signal $ x$ from the data frames $ x_m = x\cdot\hbox{\sc Shift}_{mR}(w)$ by overlap-add (OLA). There is no constraint on window type, only that the window overlap-adds to a constant for the hop size used. In particular, $ R=1$ always yields a constant overlap-add for any window function. We will learn later (§8.3.1) that there is also a simple frequency-domain test on the window transform for the constant overlap-add property.

To emphasize an earlier point, if simple time-invariant FIR filtering is being implemented, and we don't need to work with the intermediate STFT, it is most efficient to use the rectangular window with hop size $ R=M$ , and to set $ M=N-L+1$ , where $ L$ is the length of the filter $ h$ and $ N$ is a convenient FFT size. The optimum $ M$ for a given $ L$ is an interesting exercise to work out.

COLA Examples

So far we've seen the following constant-overlap-add examples:

  • Rectangular window at 0% overlap (hop size $ R$ = window size $ M$ )
  • Bartlett window at 50% overlap ( $ R\approx M/2$ ) (Since normally $ M$ is odd, `` $ R\approx M/2$ '' means ``R=(M-1)/2,'' etc.)
  • Hamming window at 50% overlap ( $ R\approx M/2$ )
In addition, we can mention the following cases (referring to window types discussed in Chapter 3):
  • Rectangular window at 50% overlap ( $ R\approx M/2$ )
  • Hamming window at 75% overlap ($ R=M/4=25$ % hop size)
  • Any member of the Generalized Hamming family at 50% overlap
  • Any member of the Blackman family at 2/3 overlap (1/3 hop size); e.g., blackman(33,'periodic'), $ R=11$
  • Any member of the $ L$ -term Blackman-Harris family with $ R\approx M/L$ .
  • Any window with R=1 (``sliding FFT'')
Recall from §3.2.6, that many audio coders use the MLT sine window. The window is applied twice: once before the FFT (the ``analysis window'') and secondly after the inverse FFT prior to reconstruction by overlap-add (the so-called ``synthesis window''). Since the window is effectively squared, it functions as a Hann window for overlap-add purposes (a member of the Generalized Hamming family). As such, it can be used with 50% overlap. More generally, any positive COLA window can be split into an analysis and synthesis window pair by taking its square root.

STFT of COLA Decomposition

To represent practical FFT implementations, it is preferable to shift the $ m^{th}$ frame back to the time origin:

$\displaystyle {\tilde x}_m(n) \isdef x_m(n+mR) \eqsp \hbox{\sc Shift}_{-mR,n}(x_m)$ (9.20)

This is summarized in Fig.8.11. Zero-based frames are needed because the leftmost input sample is assigned to time zero by FFT algorithms. In other words, a hopping FFT effectively redefines time zero on each hop. Thus, a practical STFT is a sequence of FFTs of the zero-based frames $ {\tilde x}_0, {\tilde x}_1, \ldots$ . On the other hand, papers in the literature (such as [7,9]) work with the fixed time-origin case ( $ x_0, x_1,
\ldots$ ). Since they differ only by a time shift, it is not hard to translate back and forth.

% latex2html id marker 21934\psfrag{x}{$x$}\psfrag{Zero-centered 3rd frame x_3: M = 64, R = M/2}%
{\normalsize Zero-centered 3rd frame $x_3$: $M = 64$, $R = M/2$}\psfrag{x_3}{$x_3$} % doesn't work\psfrag{xtilde_3}{${\tilde x}_3$}\begin{figure}[htbp]
\caption{Input signal $x$\ (top), third frame
$x_3$\ in its natural time location (middle), and the third frame
shifted to time 0, ${\tilde x}_3$\ (bottom).}

Note that we may sample the DTFT of both $ x_m$ and $ {\tilde x}_m$ , because both are time-limited to $ M$ nonzero samples. The minimum information-preserving sampling interval along the unit circle in both cases is $ \Omega_M \isdeftext 2\pi/M$ . In practice, we often oversample to some extent, using $ \Omega_N$ with $ N>M$ instead. For $ {\tilde x}_m$ , we get

$\displaystyle {\tilde X}_m(\omega_k)$ $\displaystyle \isdef$ $\displaystyle \hbox{\sc Sample}_{\Omega_N,k}\left(\hbox{\sc DTFT}\left({\tilde x}_m\right)\right)$  
  $\displaystyle =$ $\displaystyle \hbox{\sc DFT}_{N,k}({\tilde x}_m),
\protect$ (9.21)

where $ \omega_k \isdef 2\pi k/N = k\Omega_N$ . For $ x_m$ we have

X_m(\omega_k) &\isdef & \hbox{\sc Sample}_{\Omega_N,k}\left(\hbox{\sc DTFT}(x_m)\right)\\
&\longleftrightarrow& \hbox{\sc Alias}_N(x_m)\\
&\neq& {\tilde x}_m \; \hbox{(in general).}

Since $ {\tilde x}_m = \hbox{\sc Shift}_{-mR}(x_m)$ , their transforms are related by the shift theorem:

{\tilde X}_m(\omega_k) &=& e^{jmR\omega_k} X_m(\omega_k) \\
{\tilde x}_m(n) &=& \hbox{\sc Alias}_{N,n+mR}(x_m)\\
&=& x_m(n+mR)_N

where $ (n+mR)_N$ denotes modulo $ N$ indexing (appropriate since the DTFTs have been sampled at intervals of $ \Omega_N = 2\pi/N$ ).

Acyclic Convolution

Getting back to acyclic convolution, we may write it as

y(n) &=& (x * h)(n), \quad n\in{\bf Z}\\
&=& \left ( \left( \sum_m x_m \right) * h \right)(n) \\
&=& \sum_m (x_m * h)(n) \qquad \hbox{(by linearity of convolution)}\\
&=& \sum_m (( \hbox{\sc Shift}_{mR}({\tilde x}_m)) * h)(n) \\
&=& \sum_m \sum_l \hbox{\sc Shift}_{mR,l}({\tilde x}_m)h(n-l) \\
&=& \sum_m \sum_l {\tilde x}_m(l-mR)h(n-l) \\
&=& \sum_m \sum_{l^{\prime}} {\tilde x}_m(l^{\prime})h(n-mR-l^{\prime})) \\
&=& \sum_m \hbox{\sc Shift}_{mR,n}( {\tilde x}_m * h) \\
&=& \sum_m \hbox{\sc Shift}_{mR} ( \hbox{\sc DTFT}^{-1}( {\tilde X}_m \cdot H) ) \\

Since $ {\tilde x}_m$ is time limited to $ [0,\ldots,M-1]$ (or $ [-(M-1)/2,(M-1)/2]$ ), $ {\tilde X}_m$ can be sampled at intervals of $ \Omega_M = 2\pi/M$ without time aliasing. If $ h$ is time-limited to $ [0,L-1]$ , then $ {\tilde x}_m * h$ will be time limited to $ M+L-1$ . Therefore, we may sample $ {\tilde X}_m\cdot H$ at intervals of

$\displaystyle \Omega_{M+L-1} = \frac{2 \pi}{L+M-1}$ (9.22)

or less along the unit circle. This is the dual of the usual sampling theorem.

We conclude that practical FFT acyclic convolution may be carried out using an FFT of any length $ N$ satisfying

$\displaystyle N \ge M+L-1,$ (9.23)

where $ M$ is the frame size and $ L$ is the filter length. Our final expression for $ y(n)$ is

y(n) &=&
\sum_m \hbox{\sc Shift}_{mR,n} \left[\frac{1}{N} \sum_{k=0}^{N-1}
{\tilde H}(\omega_k) {\tilde X}_m(\omega_k) e^{j\omega_k n T}\right]\\
\sum_m \hbox{\sc Shift}_{mR,n}\left\{ \hbox{\sc IFFT}_N[\hbox{\sc FFT}_N({\tilde x}_m)\cdot \hbox{\sc FFT}_N(h)]\right\},

where $ {\tilde X}_m$ is the length $ N$ DFT of the zero-padded $ m^{\mbox{th}}$ frame $ {\tilde x}_m$ , and $ {\tilde H}$ is the length $ N$ DFT of $ h$ , also zero-padded out to length $ N$ , with $ N\geq L+M-1$ .

Note that the terms in the outer sum overlap when $ R<M$ even if $ H(\omega_k)\equiv1$ . In general, an LTI filtering by $ H$ increases the amount of overlap among the frames.

This completes our derivation of FFT convolution between an indefinitely long signal $ x(n)$ and a reasonably short FIR filter $ h(n)$ (short enough that its zero-padded DFT can be practically computed using one FFT).

The fast-convolution processor we have derived is a special case of the Overlap-Add (OLA) method for short-time Fourier analysis, modification, and resynthesis. See [7,9] for more details.

Example of Overlap-Add Convolution

Let's look now at a specific example of FFT convolution:

We will work through the matlab for this example and display the results. First, the simulation parameters:

L = 31;         % FIR filter length in taps
fc = 600;       % lowpass cutoff frequency in Hz
fs = 4000;      % sampling rate in Hz

Nsig = 150;     % signal length in samples
period = round(L/3); % signal period in samples
FFT processing parameters:
M = L;                  % nominal window length
Nfft = 2^(ceil(log2(M+L-1))); % FFT Length
M = Nfft-L+1            % efficient window length
R = M;                  % hop size for rectangular window
Nframes = 1+floor((Nsig-M)/R);  % no. complete frames
Generate the impulse-train test signal:
sig = zeros(1,Nsig);
sig(1:period:Nsig) = ones(size(1:period:Nsig));
Design the lowpass filter using the window method:
epsilon = .0001;     % avoids 0 / 0
nfilt = (-(L-1)/2:(L-1)/2) + epsilon;
hideal = sin(2*pi*fc*nfilt/fs) ./ (pi*nfilt);
w = hamming(L); % FIR filter design by window method
h = w' .* hideal; % window the ideal impulse response

hzp = [h zeros(1,Nfft-L)];  % zero-pad h to FFT size
H = fft(hzp);               % filter frequency response
Carry out the overlap-add FFT processing:
y = zeros(1,Nsig + Nfft); % allocate output+'ringing' vector
for m = 0:(Nframes-1)
    index = m*R+1:min(m*R+M,Nsig); % indices for the mth frame
    xm = sig(index);  % windowed mth frame (rectangular window)
    xmzp = [xm zeros(1,Nfft-length(xm))]; % zero pad the signal
    Xm = fft(xmzp);
    Ym = Xm .* H;               % freq domain multiplication
    ym = real(ifft(Ym))         % inverse transform
    outindex = m*R+1:(m*R+Nfft);
    y(outindex) = y(outindex) + ym; % overlap add

The time waveforms for the first three frames ($ m=0,1,2$ ) are shown in Figures 8.12 through 8.14. Notice how the causal linear-phase filtering results in an overall signal delay by half the filter length. Also, note how frames 0 and 2 contain four impulses, while frame 1 only happens to catch three; this causes no difficulty, and the filtered result remains correct by superposition.

Figure 8.12: OLA Example, Frame 0.
Figure 8.13: OLA Example, Frame 1.
Figure 8.14: OLA Example, Frame 2.

Summary of Overlap-Add FFT Processing

Overlap-add FFT processors provide efficient implementations for FIR filters longer than 100 or so taps on single CPUs. Specifically, we ended up with:

$\displaystyle y = \sum_{m=-\infty}^\infty \hbox{\sc Shift}_{mR} \left( \hbox{\sc DFT}_N^{-1} \left\{ H \cdot \hbox{\sc DFT}_N\left[\hbox{\sc Shift}_{-mR}(x)\cdot w \right]\right\}\right)$ (9.24)

where $ \hbox{\sc Shift}()$ is acyclic in this context. Stated as a procedure, we have the following steps in an overlap-add FFT processor:
Extract the $ m$ th length $ M$ frame of data at time $ mR$ .
Shift it to the base time interval $ [0,M-1]$ (or $ [-(M-1)/2,(M-1)/2]$ ).
Optionally apply a length $ M$ analysis window $ w$ (causal or zero phase, as preferred). For simple LTI filtering, the rectangular window is fine.
Zero-pad the windowed data out to the FFT size $ N$ (a power of 2), such that $ N\geq M+L-1$ , where $ L$ is the FIR filter length.
Take the $ N$ -point FFT.
Apply the filter frequency-response $ H=\hbox{\sc FFT}_N(h)$ as a windowing operation in the frequency domain.
Take the $ N$ -point inverse FFT.
Shift the origin of the $ N$ -point result out to sample $ mR$ where it belongs.
Sum into the output buffer containing the results from prior frames (OLA step).
The condition $ N\geq M+L-1$ is necessary to avoid time aliasing, i.e., to implement acyclic convolution using an FFT; this condition is equivalent to a minimum sampling-rate requirement in the frequency domain.

A second condition is that the analysis window be COLA at the hop size used:

$\displaystyle \sum_m w(n-mR) = 1, \, \forall n\in{\bf Z}.$ (9.25)

Dual of Constant Overlap-Add

In this section, we will derive the Fourier dual of the Constant OverLap-Add (COLA) condition for STFT analysis windows (discussed in §7.1). Recall that for perfect reconstruction using a hop-size of $ R$ samples, the window must be $ \hbox{\sc Cola}(R)$ . We will find that the equivalent frequency-domain condition is that the window transform must have spectral zeros at all frequencies which are a nonzero multiple of $ 2\pi/R$ . Following established nomenclature for filter banks, we will say that such a window transform is $ \hbox{\sc Nyquist}(2\pi/R)$ .

Poisson Summation Formula

Consider the summation of N complex sinusoids having frequencies uniformly spaced around the unit circle [264]:

x(n) &\mathrel{\stackrel{\Delta}{=}}& \frac{1}{N} \sum_{k=0}^{N-1}e^{j\omega_kn} =
1 & n=0 \quad (\hbox{\sc mod}\ N) \\
0 & \mbox{elsewhere} \\
\end{array} \right. \\ [5pt]
&=& \hbox{\sc IDFT}_n(1 \cdots 1)

where $ \omega_k \mathrel{\stackrel{\Delta}{=}}2\pi k / N$ .

% latex2html id marker 22700\psfrag{d(n)}{\normalsize $\displaystyle\normalsize \sum_l\delta(n-lN)$\ }\psfrag{\makebox[0pt][l]{2}N}{\normalsize $-2N$}\psfrag{\makebox[0pt][l]{N}}{\normalsize $-N$}\psfrag{0}{\normalsize $0$}\psfrag{N}{\normalsize $N$}\psfrag{n}{\normalsize $n$}\psfrag{2N}{\normalsize $2N$}\psfrag{1}{\normalsize $1$}\begin{figure}[htbp]
\caption{Discrete-time impulse train created
by a sum of sampled complex sinusoids.}

Setting $ N=R$ (the FFT hop size) gives

$\displaystyle \zbox {\sum_m \delta(n-mR) = \frac{1}{R} \sum_{k=0}^{R-1}e^{j\omega_kn}}$ (9.26)

where $ \omega_k \mathrel{\stackrel{\Delta}{=}}2\pi k/R$ (harmonics of the frame rate).

Let us now consider these equivalent signals as inputs to an LTI system, with an impulse response given by $ w(n)$ , and frequency response equal to $ W(\omega)$ .

% latex2html id marker 22729\psfrag{w(n)}{\normalsize $w(n)$\ }\psfrag{W(w)}{\normalsize $W(\omega)$\ }\psfrag{timesum}{\normalsize $\displaystyle\sum_l\delta(n-lR)$\ }\psfrag{freqsum}{\normalsize $\frac{1}{R}\displaystyle\sum_k e^{j\omega_kn}$\ }\psfrag{t}{\normalsize $\displaystyle\sum_l w(n-lR)$\ }\psfrag{f}{\normalsize $\frac{1}{R}\displaystyle\sum_k W(\omega_k)e^{j\omega_kn}$\ }\psfrag{time}{\normalsize Time}\psfrag{freq}{\normalsize Frequency}\begin{figure}[htbp]
\caption{Linear systems theory proof of the Poisson summation formula.}

Looking across the top of Fig.8.16, for the case of input signal $ \sum_m \delta(n-mR)$ we have

$\displaystyle y(n) = \sum_m w(n-mR).$ (9.27)

Looking across the bottom of the figure, for the case of input signal

$\displaystyle x(n) = \frac{1}{R} \sum_{k=0}^{R-1}e^{j\omega_kn},$ (9.28)

we have the output signal

$\displaystyle y(n) = \frac{1}{R} \sum_{k=0}^{R-1} W(\omega_k)e^{j\omega_kn}.$ (9.29)

This second form follows from the fact that complex sinusoids $ e^{j\omega_kn}$ are eigenfunctions of linear systems--a basic result from linear systems theory [264,263].

Since the inputs were equal, the corresponding outputs must be equal too. This derives the Poisson Summation Formula (PSF):

$\displaystyle \zbox {\underbrace{\sum_m w(n-mR)}_{\hbox{\sc Alias}_R(w)} = \underbrace{\frac{1}{R}\sum_{k=0}^{R-1} W(\omega_k)e^{j\omega_k n}}_{\hbox{\sc DFT}_R^{-1} \left[\hbox{\sc Sample}_{\frac{2\pi}{R}}(W)\right]}} \quad \omega_k \isdef \frac{2\pi k}{R} \protect$ (9.30)

Note that the PSF is the Fourier dual of the sampling theorem [270], [264, Appendix G].

The continuous-time PSF is derived in §B.15.

Frequency-Domain COLA Constraints

Recall that for error-free OLA processing, we required the constant-overlap-add (COLA) window constraint:

$\displaystyle \sum_m w(n-mR) = \hbox{constant}$ (9.31)

Thanks to the PSF, we may now express the COLA constraint in the frequency domain:

$\displaystyle \zbox {w\in\hbox{\sc Cola}(R) \;\Leftrightarrow\; W(\omega_k) = 0, \quad \vert k\vert = 1,2, \dots, R-1}$ (9.32)

In other terms,
$\textstyle \parbox{0.8\textwidth}{A window $w$\ gives constant overlap-add at
hop-size $R$\ if and only if the window transform $W$\ is \emph{zero at
all harmonics of the frame rate} $2\pi/R$.}$


$\displaystyle \zbox {w \in \hbox{\sc Cola}(R) \quad \Leftrightarrow \quad W \in \hbox{\sc Nyquist}(2\pi/R)} \protect$ (9.33)

The ``Nyquist($ \Omega_R$ )'' property for a function $ W$ simply means that $ W$ is zero at all nonzero multiples of $ \Omega_R$ (all harmonics of the frame rate here).

We may also refer to (8.33) as the ``weak COLA constraint'' in the frequency domain. It gives necessary and sufficient conditions for perfect reconstruction in overlap-add FFT processors. However, when the short-time spectrum is being modified, these conditions no longer apply, and a stronger COLA constraint is preferable.

Strong COLA

An overly strong (but sufficient) condition is to require that the window transform $ W(\omega)$ be bandlimited consistent with downsampling by $ R$ :

$\displaystyle W(\omega) = 0, \quad \vert\omega\vert \geq \pi/R
\quad\quad\hbox{(sufficient for COLA)}

This condition is sufficient, but not necessary, for perfect OLA reconstruction. Strong COLA implies weak COLA, but it cannot be achieved exactly by finite-duration window functions.

When either of the strong or weak COLA conditions are met, we have

$\displaystyle \zbox {\sum_m w(n-mR) = \frac{1}{R} W(0)}$ (9.34)

That is, the overlap-add of the window $ w$ at hop-size $ R$ is equal numerically to the dc gain of the window divided by $ R$ .

PSF Dual and Graphical Equalizers

Above, we used the Poisson Summation Formula to show that the constant-overlap-add of a window in the time domain is equivalent to the condition that the window transform have zero-crossings at all harmonics of the frame rate. In this section, we look briefly at the dual case: If the window transform is COLA in the frequency domain, what is the corresponding property of the window in the time domain? As one should expect, being COLA in the frequency domain corresponds to having specific uniform zero-crossings in the time domain.

Bandpass filters that sum to a constant provides an ideal basis for a graphic equalizer. In such a filter bank, when all the ``sliders'' of the equalizer are set to the same level, the filter bank reduces to no filtering at all, as desired.

Let $ N$ denote the number of (complex) filters in our filter bank, with pass-bands uniformly distributed around the unit circle. (We will be using an FFT to implement such a filter bank.) Denote the frequency response of the ``dc channel'' by $ W(e^{j\omega T})$ . Then the constant overlap-add property of the $ N$ -channel filter bank can be expressed as

$\displaystyle W \in \hbox{\sc Cola}(2\pi/N)$ (9.35)

which means

$\displaystyle S(\omega) = \sum_{k=0}^{N-1} W\left(e^{j(\omega-\omega_k)T}\right) = \hbox{constant}$ (9.36)

where $ \omega_k T\isdef k\cdot 2\pi/N$ as usual. By the dual of the Poisson summation formula, we have

$\displaystyle \zbox {W \in \hbox{\sc Cola}(2\pi/N) \quad \Leftrightarrow \quad w \in \hbox{\sc Nyquist}(N)} \protect$ (9.37)

where $ w\in \hbox{\sc Nyquist}(N)$ means that $ w$ is zero at all nonzero integer multiples of $ N$ , i.e.,

$\displaystyle w(n)=0, \quad n=\pm N,\pm 2N, \pm 3N, \ldots\,.$ (9.38)

Thus, using the dual of the PSF, we have found that a good $ N$ -channel equalizer filter bank can be made using bandpass filters which have zero-crossings at multiples of $ N$ samples, because that property guarantees that the filter bank sums to a constant frequency response when all channel gains are equal.

The duality introduced in this section is the basis of the Filter-Bank Summation (FBS) interpretation of the short-time Fourier transform, and it is precisely the Fourier dual of the OverLap-Add (OLA) interpretation [9]. The FBS interpretation of the STFT is the subject of Chapter 9.

PSF and Weighted Overlap Add

Using ``square-root windows'' $ \sqrt{w}$ in the WOLA context, the valid hop sizes $ R$ are identical to those for $ w$ in the OLA case. More generally, given any window $ w(n)$ for use in a WOLA system, it is of interest to determine the hop sizes which yield perfect reconstruction.

Recall that, by the Poisson Summation Formula (PSF),

$\displaystyle \zbox {\underbrace{\sum_m w(n-mR)}_{\hbox{\sc Alias}_R(w)} = \underbrace{\frac{1}{R}\sum_{k=0}^{R-1} W(\omega_k)e^{j\omega_k n}}_{\hbox{\sc DFT}_R^{-1} \left[\hbox{\sc Sample}_{\frac{2\pi}{R}}(W)\right]}} \quad \omega_k \isdef \frac{2\pi k}{R} \protect$ (9.39)

For WOLA, this is easily modified to become

$\displaystyle \zbox {\underbrace{\sum_m w(n-mR)f(n-mR)}_{\hbox{\sc Alias}_R(w\cdot f)} = \underbrace{\frac{1}{R}\sum_{k=0}^{R-1} (W\ast F)(\omega_k)e^{j\omega_k n}}_{\hbox{\sc DFT}_R^{-1} \left[\hbox{\sc Sample}_{\frac{2\pi}{R}}(W\ast F)\right]}} \quad \omega_k \isdef \frac{2\pi k}{R}$ (9.40)

where $ w(n)$ is the analysis window and $ f(n)$ is the synthesis window.

When $ w=f$ , this becomes

$\displaystyle \underbrace{\sum_m w^2(n-mR)}_{\hbox{\sc Alias}_R(w^2)} = \underbrace{\frac{1}{R}\sum_{k=0}^{R-1} (W\ast W)(\omega_k)e^{j\omega_k n}}_{\hbox{\sc DFT}_R^{-1} \left[\hbox{\sc Sample}_{\frac{2\pi}{R}}(W\ast W)\right]}, \quad \omega_k \isdef \frac{2\pi k}{R}$ (9.41)

Example COLA Windows for WOLA

In a weighted overlap-add system, the following windows can be used to satisfy the constant-overlap-add condition:

  • For the rectangular window, $ w^2(n)=w(n)$ , and $ W\ast W = W$ (since $ W(\omega_k)$ is a sinc function which reduces to $ \delta(\omega_k)$ when $ \omega_k = 2\pi k / M$ , and $ \delta\ast \delta = \delta$ .

  • For the Hamming window, the critically sampled window transform has three nonzero samples (where the rectangular-window transform has one). Therefore, $ W\ast W$ has $ 3+3-1=5$ nonzero samples at critical sampling. Measuring main-lobe width from zero-crossing to zero-crossing as usual, we get $ 6\cdot 2\pi/M$ radians per sample, or ``6 side lobes'', for the width of $ W\ast W$ .

  • The squared-Blackman window transform width is $ (5+5-1)+1=10$ .

  • The square of a length $ M$ $ L$ -term Blackman-Harris-family window (where rect is $ L=1$ , Hann is $ L=2$ , etc.) has a main lobe of width $ (2(2L-1)-1)+1=4L-2$ , measured from zero-crossing to zero-crossing in ``side-lobe units'' ($ 2\pi/M$ ). This is up from $ (2L-1)+1=2L$ for the original $ L$ -term window.

  • The width of the main lobe can be used to determine the hop size in the STFT, as will be discussed further in Chapter 9.

Note that we need only find the first zero-crossing in the window transform for any member of the Blackman-Harris window family (Chapter 3), since nulls at all harmonics of that frequency will always be present (at multiples of $ 2\pi/M$ ).

Overlap-Save Method

The classical overlap-save method [198,277], unlike OLA, uses no zero padding to prevent time aliasing. Instead, it

discards output samples corrupted by time aliasing each frame, and
overlaps the input frames by the same amount.

More specifically:

  • If the input frame size is $ N$ and the filter length is $ L<N$ , then a length $ N$ FFT and IFFT are used.
  • As a result, $ L-1$ samples of the output are invalid due to time aliasing.
  • The overlap-save method writes out the good $ R=N-(L-1)$ samples and uses a hop size of $ R$ , thus recomputing the time-aliased output samples in the previous frame.
The name ``overlap-save'' comes from the fact that $ L-1$ samples of the previous frame are ``saved'' for computing the next frame.

Time Varying OLA Modifications

In the preceding sections, we assumed that the spectral modification $ H$ did not vary over time. We will now examine the implications of time-varying spectral modifications. The derivation below follows [9], except that we'll keep our previous notation:

X_m(\omega_k) &=& \hbox{sampled DTFT (FFT) of $m$th input frame, $k=0,1,\ldots,N-1$}\\
H_m(\omega_k) &=& \hbox{time varying spectral modification (new each frame)}\\
Y_m(\omega_k) &=& \hbox{$X_m(\omega_k) H_m(\omega_k) = m$th output spectrum}\\
\omega_k &=&\hbox{ $2\pi k / N$\ = $k$th spectral sample}\\
N &=& \hbox{FFT length}\\
M &=& \hbox{window $w$\ length: $x_m(n) = x(n)w(n-m)$}\\
L &=& \hbox{\emph{maximum} length of FIR filter $h_m$\ applied to each frame}\\
N &\ge& \hbox{ $M+L-1$\ to avoid time aliasing in $y_m$}

Using $ H_m$ in our OLA formulation with a hop size $ R=1$ results in

y(n) &=& \sum_{m=-\infty}^\infty y_m(n) \\
&=& \sum_{m=-\infty}^\infty \frac{1}{N}\sum_{k=0}^{N-1} X_m(\omega_k) H_m(\omega_k) e^{j\omega_kn} \\
&=& \sum_{m=-\infty}^\infty \frac{1}{N}\sum_{k=0}^{N-1}
\left[ \sum_{l=-\infty}^\infty x(l) w(l-m)e^{-j\omega_kl} \right]
H_m(\omega_k) e^{j\omega_kn} \\
&=& \sum_{l=-\infty}^\infty x(l) \sum_{m=-\infty}^\infty w(l-m)
\frac{1}{N}\sum_{k=0}^{N-1} H_m(\omega_k)
e^{j\omega_k(n-l)} \\
&=& \sum_{l=-\infty}^\infty x(l)
\sum_{m=-\infty}^\infty w(l-m) h_m(n-l) \\

Define $ r \mathrel{\stackrel{\Delta}{=}}n-l \;\Rightarrow\; l = n-r$ to get

$\displaystyle y(n)=\sum_{r=-\infty}^\infty x(n-r) \sum_{m=-\infty}^\infty h_m(r) w(n-r-m).$ (9.42)

Let's examine the term $ \displaystyle\sum_{m=-\infty}^\infty h_m(r) w(
n-r-m )$ in more detail:
  • $ h_m(r)$ describes the time variation of the $ r^{th}$ tap.
  • $ \sum_{m=-\infty}^\infty h_m(r) w[(n-r)-m] = [h_{(\cdot)}(r) \ast w](n-r)$ is a filtered version of the $ r^{th}$ tap $ h_m(r)$ . It is lowpass-filtered by w and delayed by $ r$ samples.
  • Denote the $ r$ th time-varying, lowpass-filtered, delayed-by-$ r$ filter tap by $ {\hat h}_{n-r}(r)$ . This can be interpreted as the weighting in the output at time $ r$ of an impulse entering the time-varying filter at time $ n-r$ .
Using this, we get

y(n) &=& \sum_{r=-\infty}^\infty x(n-r) {\hat h}_{n-r}(r) \\
&=& x(n) {\hat h}_n(0) \\
& & + x(n-1) {\hat h}_{n-1}(1) + x(n-2) {\hat h}_{n-2}(2) + \cdots \\
& & + x(n+1) {\hat h}_{n+1}(-1) + x(n+2) {\hat h}_{n+2}(-2) + \cdots

This is a superposition sum for an arbitrary linear, time-varying filter $ {\hat h}_{n-r}(r) = [h_{(\cdot)}(r) \ast w](n-r)$ .

Block Diagram Interpretation of Time-Varying STFT Modifications

Assuming $ {\hat h}$ is causal gives

y(n) &=& \sum_{r=0}^\infty x(n-r) {\hat h}_{n-r}(r) \\
&=& x(n) {\hat h}_n(0) + x(n-1) {\hat h}_{n-1}(1) + x(n-2) {\hat h}_{n-2}(2) + \cdots

This is depicted in Fig.8.17.

% latex2html id marker 23334\psfrag{zm1}{\large $z^{-1}$\ }\psfrag{h(0,n)}{\large$ h_n(0) $}\psfrag{h(1,n)}{\large$ h_{n-1}(1) $}\psfrag{h(2,n)}{\large$ h_{n-L+1}(L-1) $}\psfrag{+}{\large$\Sigma$}\psfrag{w(n)}{\large$ w $}\psfrag{y(n)}{\large$ y(n) $}\begin{figure}[htbp]
\caption{System diagram giving
an interpretation of the bandlimited time-varying filter coefficients
in the overlap-add STFT processor with a new filter each frame.}

The term $ h_n(k)$ can be interpreted as the FIR filter tap $ k$ at time $ n$ . Note how each tap is lowpass filtered by the FFT window $ w$ . The window thus enforces bandlimiting each filter tap to the bandwidth of the window's main lobe. For an $ L$ -term length-$ M$ Blackman-Harris window, for example, the main-lobe reaches zero at frequency $ L\Omega_M=2\pi L/M$ (see Table 5.2 in §5.5.2 for other examples). This bandlimiting places a limit on the bandwidth expansion caused by time-variation of the filter coefficients, which in turn places a limit on the maximum STFT hop-size that can be used without frequency-domain aliasing. See Allen and Rabiner 1977 [9] for further details on the bandlimiting property.

Length L FIR Frame Filters

To avoid time aliasing, we restrict the filter length to a maximum of $ L$ samples. Since $ H_m(\omega_k)$ is an arbitrary multiplicative weighting of the $ m$ th spectral frame, the frame filter need not be causal. For odd $ L$ , the filter impulse response indices may run from $ -L_h$ to $ L_h$ , where

$\displaystyle L_h \isdef \frac{L-1}{2}$ (9.43)

This gives

y(n) &=& \sum_{r=-L_h}^{L_h} x(n-r) {\hat h}_{n-r}(r) \\
&=& x(n) {\hat h}_n(0) \\
& & + x(n-1) {\hat h}_{n-1}(1) + \cdots + x(n-L_h) {\hat h}_{n-L_h}(L_h) \\
& & + x(n+1) {\hat h}_{n+1}(-1) + \cdots + x(n+L_h) {\hat h}_{n+L_h}(-L_h)

This is the general length $ L$ time-varying FIR filter convolution sum for time $ n$ , when $ L$ is odd.

Weighted Overlap Add

In the weighted overlap add (WOLA) method, we apply a second window after the inverse DFT [49] and prior to the final overlap-add to create the output signal. Such a window can be called a ``synthesis window,'' ``postwindow,'' or simply ``output window.''

Output windows are important in audio compression applications for minimizing ``blocking effects.'' The synthesis window ``fades out'' any spectral coding error at the frame boundaries, thereby suppressing audible discontinuities.

Output windows are not used in simple FFT convolution processors because the input frames are supposed to be expanded by the convolution, and a synthesis window would ``pinch off'' the ``filter ringing'' from each block, yielding incorrect results. Output windows can always be used in conjunction with spectral modifications made by means of the ``filter bank summation'' (FBS) method, which is the subject of the next chapter.

The WOLA method is most useful for nonlinear ``instantaneous'' FFT processors such as

In these and other nonlinear signal processing operations, the output window helps to suppress artifacts caused by nonlinear spectral modifications. Another common factor in these applications is that filtering effects are not desired, so no provision for ``ringing'' in the time domain is necessary. In other words, WOLA is good for ``instantaneous nonlinear spectral processing.''

WOLA Processing Steps

The sequence of operations in a WOLA processor can be expressed as follows:

  1. Extract the $ m$ th windowed frame of data $ x_m(n)=x(n)w(n-mR)$ , $ n=mR,\ldots,mR+N-1$ (assuming a length $ M\leq N$ causal window $ w$ and hop size $ R$ ).

  2. Take an FFT of the $ m$ th frame translated to time zero,
    $ {\tilde x}_m(n)=x_m(n+mR)$ , to produce the $ m$ th spectral frame
    $ {\tilde X}_m(\omega_k)$ , $ k=0,\ldots,N-1$ .

  3. Process $ {\tilde X}_m(\omega_k)$ as desired to produce $ {\tilde Y}_m(\omega_k)$ .

  4. Inverse FFT $ {\tilde Y}_m$ to produce $ {\tilde y}_m(n)$ , $ n=0,\ldots,N-1$ .

  5. Apply a synthesis window $ f(n)$ to $ {\tilde y}_m(n)$ to yield a weighted output frame $ {\tilde y}^f_m(n) = {\tilde y}_m(n)f(n)$ , $ n=0,\ldots,N-1$ .

  6. Translate the $ m$ th output frame to time $ mR$ as $ y^f_m(n) =
{\tilde y}^f_m(n-mR)$ and add to the accumulated output signal $ y(n)$ .

(The overlap-add method discussed previously is obtained from the above procedure by deleting step 5.)

To obtain perfect reconstruction in the absence of spectral modifications, we require

x(n) &=& \sum_{m=-\infty}^{\infty} x(n) w(n-mR)f(n-mR) \\
&=& x(n) \sum_{m=-\infty}^{\infty} w(n-mR)f(n-mR),

which is true if and only if

$\displaystyle \zbox {\sum_m w(n-mR)f(n-mR) = 1, \,\forall n\in{\bf Z}.}$ (9.44)

Choice of WOLA Window

The synthesis (output) window in weighted overlap-add is typically chosen to be the same as the analysis (input) window, in which case the COLA constraint becomes

$\displaystyle \zbox {\sum_m w^2(n-mR) = \hbox{constant}\,\forall n\in{\bf Z}.}$ (9.45)

We can say that $ R$ -shifts of the window $ w$ in the time domain are power complementary, whereas for OLA they were amplitude complementary.

A trivial way to construct useful windows for WOLA is to take the square root of any good OLA window. This works for all non-negative OLA windows (which covers essentially all windows in Chapter 3 other than Portnoff windows). For example, the ``root-Hann window'' can be defined for odd $ M$ by

w(n) &=& w_R(n) \sqrt{\frac{1}{2} + \frac{1}{2} \cos( 2\pi n/M) } \\
&=& w_R(n) \cos(\pi n/M),
\; n= -\frac{M-1}{2},\ldots,\frac{M-1}{2}

Notice that the root-Hann window is the same thing as the ``MLT Sine Window'' described in §3.2.6. We can similarly define the ``root-Hamming'', ``root-Blackman'', and so on, all of which give perfect reconstruction in the weighted overlap-add context.

Review of Zero Padding

Expanding on the discussion in §2.5, zero-padding is used for

Next Section:
The Filter Bank Summation (FBS) Interpretation of the Short Time Fourier Transform (STFT)
Previous Section:
Time-Frequency Displays