The Filter Bank Summation (FBS) Interpretation of the Short Time Fourier Transform (STFT)
Dual Views of the Short Time Fourier Transform (STFT)
In the overlapadd formulation of Chapter 8, we used a hopping window to extract timelimited signals to which we applied the DFT. Assuming for the moment that the hop size (the ``sliding DFT''), we have
This is the usual definition of the ShortTime Fourier Transform (STFT) (§7.1). In this chapter, we will look at the STFT from two different points of view: the OverLapAdd (OLA) and FilterBank Summation (FBS) points of view. We will show that one is the Fourier dual of the other [9]. Next we will explore some implications of the filterbank point of view and obtain some useful insights. Finally, some applications are considered.
OverlapAdd (OLA) Interpretation of the STFT
In the OLA interpretation of the STFT, we apply a timeshifted window to our signal , selecting data near time , and compute the Fouriertransform to obtain the spectrum of the th frame. As shown in Fig.9.1, the STFT is viewed as a timeordered sequence of spectra, one per frame, with the frames overlapping in time.FilterBank Summation (FBS) Interpretation of the STFT
We can group the terms in the STFT definition differently to obtain the filterbank interpretation:As will be explained further below (and illustrated further in Figures 9.3, 9.4, and 9.5), under the filterbank interpretation, the spectrum of is first rotated along the unit circle in the plane so as to shift frequency down to 0 (via modulation by in the time domain), thus forming the heterodyned signal . Next, the heterodyned signal is lowpassfiltered to a narrow band about frequency 0 (via convolving with the timereversed window ). The STFT is thus interpreted as a frequencyordered collection of narrowband timedomain signals, as depicted in Fig.9.2. In other words, the STFT can be seen as a uniform filter bank in which the input signal is converted to a set of timedomain output signals , , one for each channel of the channel filter bank. Expanding on the previous paragraph, the STFT (9.2) is computed by the following operations:
 Frequencyshift by to get .
 Convolve
with
to get
:
(10.3)
FBS and Perfect Reconstruction
An important property of the STFT established in Chapter 8 is that it is exactly invertible when the analysis window satisfies the constantoverlapadd constraint. That is, neglecting numerical roundoff error, the inverse STFT reproduces the original input signal exactly. This is called the perfect reconstruction property of the STFT, and modern filter banks are usually designed with this property in mind [287]. In the OLA processors of Chapter 8, perfect reconstruction was assured by using FFT analysis windows having the ConstantOverlapAdd (COLA) property at the particular hopsize used (see §8.2.1). In the Filter Bank Summation (FBS) interpretation of the STFT (Eq. (9.1)), it is the analysis filterbank frequency responses that are constrained to be COLA. We will take a look at this more closely below.STFT Filter Bank
Each channel of an STFT filter bank implements the processing shown in Fig.9.4. The same processing is shown in the frequency domain in Fig.9.5. Note that the window transform is complexconjugated because the window is flipped in the time domain, i.e., when is real [264]. 
Computational Examples in Matlab
In this section, we will take a look at some STFT filterbank output signals when the input signal is a ``chirp.'' A chirp signal is generally defined as a sinusoid having a linearly changing frequency over time:N=10; % number of filters = DFT length fs=1000; % sampling frequency (arbitrary) D=1; % duration in seconds L = ceil(fs*D)+1; % signal duration (samples) n = 0:L1; % discretetime axis (samples) t = n/fs; % discretetime axis (sec) x = chirp(t,0,D,fs/2); % sine sweep from 0 Hz to fs/2 Hz %x = echirp(t,0,D,fs/2); % for complex "analytic" chirp x = x(1:L); % trim trailing zeros at end h = ones(1,N); % Simple DFT lowpass = rectangular window %h = hamming(N); % Better DFT lowpass = Hamming window X = zeros(N,L); % X will be the filter bank output for k=1:N % Loop over channels wk = 2*pi*(k1)/N; xk = exp(j*wk*n).* x; % Modulation by complex exponential X(k,:) = filter(h,1,xk); endFigure 9.6 shows the input and outputsignal real parts for a tenchannel DFT filter bank based on the rectangular window as derived above. The imaginary parts of the channelfilter output signals are similar so they're not shown. Notice how the amplitude envelope in each channel follows closely the amplitude response of the runningsum lowpass filter. This is more clearly seen when the absolute values of the output signals are viewed, as shown in Fig.9.7.
function x = chirp(t,f0,t1,f1); beta = (f1f0)./t1; x = cos(2*pi * ( 0.5* beta .* (t.^2) + f0*t));We can replace this real chirp with a complex ``analytic'' chirp by replacing the last line above by the following:
x = exp(j*(2*pi * ( 0.5* beta .* (t.^2) + f0*t)));Since the analytic chirp does not contain a negativefrequency component which beats with the positivefrequency component, we obtain the cleaner looking output moduli shown in Fig.9.9. Since our chirp frequency goes from zero to half the sampling rate, we are no longer exciting the negativefrequency channels. (To fully traverse the frequency axis with a complex chirp, we would need to sweep it from to .) We see in Fig.9.9 that there is indeed relatively little response in the ``negativefrequency channels'' for which , but there is some noticeable ``leakage'' from channel 0 into channel , and channel 5 similarly leaks into channel 6. Since the channel passbands overlap approximately 75%, this is not unexpected. The automatic vertical scaling in the channel 7 and 8 plots shows clearly the sidelobe structure of the Hamming window. Finally, notice also that the length startup transient is visible in each channel output just after time 0.
The DFT Filter Bank
To obtain insight into the operation of filter banks implemented using an FFT, this section will derive the details of the DFT Filter Bank. More general STFT filter banks are obtained by using different windows and hop sizes, but otherwise are no different from the basic DFT filter bank. The Discrete Fourier Transform (DFT) is defined by [264](10.4) 
where is the input signal at time , and . In this section, we will show how the DFT can be computed exactly from a bank of FIR bandpass filters, where each bandpass filter is implemented as a demodulator followed by a lowpass filter. We will then find that the inverse DFT is computed by remodulating and summing the output of this filter bank. In this way, the DFT filter bank is shown to be a perfectreconstruction filter bank. The STFT is then an extension of the DFT filter bank to include nonrectangular analysis windows and a downsampling factor .
The RunningSum Lowpass Filter
Perhaps the simplest FIR lowpass filter is the socalled runningsum lowpass filter [175]. The impulse response of the length runningsum lowpass filter is given byFigure 9.10 depicts the generic operation of filtering by to produce , where is the impulse response of the filter. The output signal is given by the convolution of and :
(10.6) 
so that its frequency response is
(10.7) 
previously in Chapter 5 (§3.1.2) and elsewhere as the Fourier transform (DTFT) of a sampled rectangular pulse (or rectangular window). Note that the dc gain of the length running sum filter is . We could use a moving average instead of a running sum ( ) to obtain unity dc gain. Figure 9.11 shows the amplitude response of the runningsum lowpass filter for length . The gain at dc is , and nulls occur at and . These nulls occur at the sinusoidal frequencies having respectively one and two periods under the 5sample ``rectangular window''. (Three periods would need at least samples, so doesn't ``fit''.) Since the passband about dc is not flat, it is better to call this a ``dcpass filter'' rather than a ``lowpass filter.'' We could also call it a dc sampling filter.^{10.1}
Modulation by a Complex Sinusoid
Figure 9.12 shows the system diagram for complex demodulation.^{10.2}The input signal is multiplied by a complex sinusoid to produce the frequencyshifted result(10.8) 
Given a signal expressed as a sum of sinusoids,
(10.9) 
then the demodulation produces
(10.10) 
We see that frequency is downshifted to . In particular, frequency (the ``center frequency'') is downshifted to dc.
Making a Bandpass Filter from a Lowpass Filter

Uniform RunningSum Filter Banks
Using a length runningsum filter, let's make bandpass filters tuned to center frequencies(10.11) 
Since the bandwidths, as defined, are , the filter passbands overlap by 50%. A superposition of the bandpass frequency responses for is shown in Fig.9.14. Also shown is the frequencyresponse sum, which we will show to be exactly constant and equal to . This gives our filter bank the perfect reconstruction property. We can simply add the outputs of the filters in the filter bank to recreate our input signal exactly. This is the source of the name FilterBank Summation (FBS).
System Diagram of the RunningSum Filter Bank
Figure 9.15 shows the system diagram of the complete channel filter bank constructed using length FIR runningsum lowpass filters. The th channel computes:DFT Filter Bank
Recall that the Length Discrete Fourier Transform (DFT) is defined as(10.13) 
Comparing this to (9.12), we see that the filterbank output , , is precisely the DFT of the input signal when , i.e.,
(10.14) 
In other words, the filterbank output at time (the set of samples for ), equals the DFT of the first samples of ( , ). That is, taking a snapshot of all filterbank channels at time yields the DFT of the input data from time 0 through . More generally, for all , we will call Fig.9.15 the DFT filter bank. The DFT filter bank is the special case of the STFT for which a rectangular window and hop size are used. The sliding DFT is obtained by advancing successive DFTs by one sample:
(10.15) 
When for any integer , the Sliding DFT coincides with the DFT filter bank. At other times, they differ by a linear phase term. (Exercise: find the linear phase term.) The Sliding DFT redefines the time origin every sampling period (each modulation term within the DFT starts at time 0 for each transform), while the DFT Filter Bank does not redefine the time origin (modulation terms are ``free running'' as they would be in an analog filter bank). Since ``DFT time'' repeats every samples, the two treatments coincide every samples (i.e., for every integer ). When is a power of 2, the DFT can be implemented using a CooleyTukey Fast Fourier Transform (FFT) using only operations per transform. By keeping track of the linear phase term (an modification), a DFT Filter Bank can be implemented efficiently using an FFT. Uniform FIR filter banks are very often implemented in practice using FFT software such as fftw. Note that the channel bandwidths are narrow compared with half the sampling rate (especially for large ), so that the filter bank output signals are oversampled, in general. We will later look at downsampling the channel signals to obtain a ``hopping FFT'' filter bank. ``Sliding'' and ``hopping'' FFTs are special cases of the discretetime Short Time Fourier Transform (STFT). The STFT normally also uses a window function other than the rectangular window used in this development (the runningsum lowpass filter).
Inverse DFT and the DFT Filter Bank Sum
The Length inverse DFT is given by [264](10.16) 
This suggests that the DFT Filter Bank can be inverted by simply remodulating the baseband filterbank signals , summing over , and dividing by for proper normalization. That is, we are led to conjecture that
(10.17) 
This is in fact true, as we will later see. (It is straightforward to show as an exercise.)
FBS Window Constraints for R=1
Recall that in overlapadd (Chapter 8), perfect reconstruction required only that the analysis window meet a constant overlapadd (COLA) constraint:(10.18) 
where is any constant (always true for ).^{10.3} The Filter Bank Summation (FBS) is interpreted as a demodulation (frequency shift by ) and subsequent lowpass filtering by . Therefore, to resynthesize our original signal, we need to remodulate each baseband signal and sum up the channels. For (no downsampling), this sum is given by [9]
We have thus derived the following sufficient condition for perfect reconstruction [213]:
Since normally our windows are shorter than , this condition holds trivially. In the overlapadd context, we also had guaranteed perfect reconstruction in this case ( ), because every window overlapadds to a constant at displacement .
Nyquist(N) Windows
In (9.19) of the previous section, we derived that the FBS reconstruction sum gives(10.20) 
where . From this we see that if (where is the window length and is the DFT size), as is normally the case, then for . This is the Fourier dual of the ``strong COLA constraint'' for OLA (see §8.3.2). When it holds, we have
(10.21) 
This is simply a gain term, and so we are able to recover the original signal exactly. (Zerophase windows are appropriate here.) If the window length is larger than the number of analysis frequencies ( ), we can still obtain perfect reconstruction provided
(10.22) 
When this holds, we say the window is . (This is the dual of the weak COLA constraint introduced in §8.3.1.) Portnoff windows, discussed in §9.7, make use of this result; they are longer than the DFT size and therefore must be used in timealiased form [62]. An advantage of Portnoff windows is that they give reduced overlap among the channel filter passbands. In the limit, a Portnoff window can approach a sinc function having its zerocrossings at all nonzero multiples of samples, thereby yielding an ideal channel filter with bandwidth . Figure 9.16 compares example Hamming and Portnoff windows.
Duality of COLA and Nyquist Conditions
Let denote constant overlapadd using hop size . Then we have (by the Poisson summation formula Eq. (8.30))Specific Windows
 Recall that the rectangular window transform is , implying the rectangular window itself is , which is obvious.
 The window transform for the Hamming family is , implying that Hamming windows are , which we also knew.
 The rectangular window transform is also for any integer , implying that all hop sizes given by for are COLA.
 Because its side lobes are the same width as the sinc side lobes, the Hamming window transform is also ,for any integer , implying hop sizes are good, for . Thus, the available hop sizes for the Hamming window family include all of those for the rectangular window except one ( ).
The Nyquist Property on the Unit Circle
As a degenerate case, note that is COLA for any window, while no window transform is except the zero window. (since it would have to be zero at dc, and we do not consider such windows). Did the theory break down for ? Intuitively, the condition on the window transform ensures that all nonzero multiples of the timedomainframerate will be zeroed out over the interval along the frequency axis. When the framerate equals the sampling rate ( ), there are no framerate multiples in the range . (The range gives the same result.) When , there is exactly one framerate multiple at . When , there are two at . When , they are at and , and so on. We can cleanly handle the special case of by defining all functions over the unit circle as being when there are no framerate multiples in the range . Thus, a discretetime spectrum is said to be if , for all , where (the ``floor function'') denotes the greatest integer less than or equal to .Portnoff Windows
In 1976 [212], Portnoff observed that any window of the formsinc  (10.23) 
being by construction, will obey the weak constraint, where is the number of spectral samples. In this result, is any window function whatsoever, and the sinc function is defined as usual by
sinc  (10.24) 
(the unitamplitude sinc function with zeros at all nonzero integers). Portnoff suggested that, in practical usage, windowed data segments longer than the FFT size should be timealiased about length prior to performing an FFT. This result is readily derived from the definition of the timenormalized STFT introduced in Eq. (8.21):
Downsampled STFT Filter Banks
We now look at STFT filter banks which are downsampled by the factor . The downsampling factor corresponds to a hop size of samples in the overlapadd view of the STFT. From the filterbank point of view, the impact of is aliasing in the channel signals when the lowpass filter (analysis window) is less than ideal. When the conditions for perfect reconstruction are met, this aliasing will be canceled in the reconstruction (when the filterbank channel signals are remodulated and summed).Downsampled STFT Filter Bank
So far we have considered only (the ``sliding'' DFT) in our filterbank interpretation of the STFT. For we obtain a downsampled version of :(10.25) 
i.e., is simply evaluated at every sample, as shown in Fig.9.17.
Note that this can be considered an implementation of a phase vocoder filter bank [212]. (See §G.5 for an introduction to the vocoder.)
Filter Bank Reconstruction
Since the channel signals are downsampled, we generally need interpolation in the reconstruction. Figure 9.18 indicates how we might pursue this. From studying the overlapadd framework, we know that the inverse STFT is exact when the window is , that is, when is constant. In only these cases can the STFT be considered a perfect reconstruction filter bank. From the Poisson Summation Formula in §8.3.1, we know that a condition equivalent to the COLA condition is that the window transform have notches at all harmonics of the frame rate, i.e., for . In the present context (filterbank point of view), perfect reconstruction appears impossible for , because for ideal reconstruction after downsampling, the channel antialiasing filter ( ) and interpolation filter ( ) have to be ideal lowpass filters. This is a true conclusion in any single channel, but not for the filter bank as a whole. We know, for example, from the overlapadd interpretation of the STFT that perfect reconstruction occurs for hopsizes greater than 1 as long as the COLA condition is met. This is an interesting paradox to which we will return shortly. What we would expect in the filterbank context is that the reconstruction can be made arbitrarily accurate given better and better lowpass filters and which cut off at (the folding frequency associated with downsampling by ). This is the right way to think about the STFT when spectral modifications are involved. In Chapter 11 we will develop the general topic of perfect reconstruction filter banks, and derive various STFT processors as special cases.
Downsampling with AntiAliasing
In OLA, the hop size is governed by the COLA constraint(10.26) 
In FBS, is the downsampling factor in each of the filterbank channels, and thus the window serves as the antialiasing filter (see Fig.9.19). We see that to avoid aliasing, must be bandlimited to , as illustrated schematically in Fig.9.20.
Properly AntiAliasing Window Transforms
For simplicity, define windowtransform bandlimits at first zerocrossings about the main lobe. Given the first zero of at , we obtain(10.27) 
The following table gives maximum hop sizes for various window types in the BlackmanHarris family, where is both the number of constantpluscosine terms in the window definition (§3.3) and the halfmainlobe width in units of sidelobe widths . Also shown in the table is the maximum COLA hop size we determined in Chapter 8.
L  Window Type (Length )  
1  Rectangular  M/2  M 
2  Generalized Hamming  M/4  M/2 
3  Blackman Family  M/6  M/3 
L  term BlackmanHarris  M/2L  M/L 
Hop Sizes for WOLA
In the weighted overlapadd method, with the synthesis (output) window equal to the analysis (input) window, we have the following modification of the recommended maximum hopsize table:L  In and Out Window (Length )  
1  Rectangular ( )  M/2  M 
2  Generalized Hamming ( )  M/6  M/3 
3  Blackman Family ( )  M/10  M/5 
L  term BlackmanHarris  M/(4L2)  M/(2L1) 
 is equal to divided by the mainlobe width in ``side lobes'', while
 is divided by the first notch frequency in the window transform (lowest available frame rate at which all framerate harmonics are notched).
 For windows in the BlackmanHarris families, and with mainlobe widths defined from zerocrossing to zerocrossing, .
ConstantOverlapAdd (COLA) Cases
 Weak COLA: Window transform has zeros at framerate harmonics:
 Perfect OLA reconstruction
 Relies on aliasing cancellation in frequency domain
 Aliasing cancellation is disturbed by spectral modifications
 See Portnoff for further details
 Strong COLA: Window transform is bandlimited consistent with
downsampling by the frame rate:
 Perfect OLA reconstruction
 No aliasing
 better for spectral modifications
 Timedomain window infinitely long in ideal case
Hamming OverlapAdd Example
Matlab code:M = 33; % window length w = hamming(M); R = (M1)/2; % maximum hop size w(M) = 0; % 'periodic Hamming' (for COLA) %w(M) = w(M)/2; % another solution, %w(1) = w(1)/2; % interesting to compare
PeriodicHamming OLA from Poisson Summation Formula
Matlab code:ff = 1/R; % frame rate (fs=1) N = 6*M; % no. samples to look at OLA sp = ones(N,1)*sum(w)/R; % dc term (COLA term) ubound = sp(1); % try easytocompute upper bound lbound = ubound; % and lower bound n = (0:N1)'; for (k=1:R1) % traverse framerate harmonics f=ff*k; csin = exp(j*2*pi*f*n); % framerate harmonic % find exact window transform at frequency f Wf = w' * conj(csin(1:M)); hum = Wf*csin; % contribution to OLA "hum" sp = sp + hum/R; % "Poisson summation" into OLA % Update lower and upper bounds: Wfb = abs(Wf); ubound = ubound + Wfb/R; % build upper bound lbound = lbound  Wfb/R; % build lower bound endIn this example, the overlapadd is theoretically a perfect constant (equal to ) because the frame rate and all its harmonics coincide with nulls in the window transform (see Fig.9.24). A plot of the steadystate overlapadd and that computed using the Poisson Summation Formula (not shown) is constant to within numerical precision. The difference between the actual overlapadd and that computed using the PSF is shown in Fig.9.23. We verify that the difference is on the order of , which is close enough to zero in doubleprecision (64bit) floatingpoint computations. We thus verify that the overlapadd of a length Hamming window using a hop size of samples is constant to within machine precision. Figure 9.24 shows the zeropadded DFT of the modified Hamming window we're using ( ) with the framerate harmonics marked. In this example ( ), the upper half of the main lobe aliases into the lower half of the main lobe. (In fact, all energy above the folding frequency aliases into the lower half of the main lobe.) While this window and hop size still give perfect reconstruction under the STFT, spectral modifications will disturb the aliasing cancellation during reconstruction. This ``undersampled'' configuration is suitable as a basis for compression applications. Note that if we were to cut in half to , then the folding frequency in Fig.9.24 would coincide with the first null in the window transform. Since the frame rate and all its harmonics continue to land on nulls in the window transform, overlapadd is still exact. At this reduced hop size, however, the STFT becomes much more robust to spectral modifications, because all aliasing in the effective downsampled filter bank is now weighted by the side lobes of the window transform, with no aliasing components coming from within the main lobe. This is the central result of [9].
Kaiser OverlapAdd Example
Matlab code:M = 33; % Window length beta = 8; w = kaiser(M,beta); R = floor(1.7*(M1)/(beta+1)); % ROUGH estimate (gives R=6)Figure 9.25 plots the overlapadded Kaiser windows, and Fig.9.26 shows the steadystate overlapadd (a time segment sometime after the first 30 samples). The ``predicted'' OLA is computed using the Poisson Summation Formula using the same matlab code as before. Note that the Poisson summation formula gives exact results to within numerical precision. The upper (lower) bound was computed by summing (subtracting) the windowtransform magnitudes at all framerate harmonics to (from) the dc gain of the window. This is one example of how the PSF can be used to estimate upper and lower bounds on OLA error. The difference between measured steadystate overlapadd and that computed using the Poisson summation formula is shown in Fig.9.27. Again the two methods agree to within numerical precision. Finally, Fig.9.28 shows the Kaiser window transform, with marks indicating the folding frequency at the chosen hop size , as well as the framerate and twice the frame rate. We see that the frame rate (hop size) has been well chosen for this window, as the folding frequency lies very close to what would be called the ``stop band'' of the Kaiser window transform. The ``stopband rejection'' can be seen to be approximately dB (height of highest side lobe in Fig.9.28). We conclude that this examplea length 33 Kaiser window with and hopsize  represents a reasonably highquality audio STFT that will be robust in the presence of spectral modifications. We expect such robustness whenever the folding frequency lies above the main lobe of the window transform. Remember that, for robustness in the presence of spectral modifications, the frame rate should be more than twice the highest mainlobe frequency.
STFT with Modifications
FBS Fixed Modifications
Consider applying a fixed (timeinvariant) filter to each before resynthesizing the signal:(10.28) 
where, is the sampled frequency response of a filter with impulse response
(10.29) 
Let's examine the result this has on the signal in the time domain:
(10.30) 
while OLA gives (for )
(10.31) 
 In FBS, the analysis window smooths the filter frequency response by timelimiting the corresponding impulse response.
 In OLA, the analysis window can only affect scaling.
Time Varying Modifications in FBS
Consider now applying a time varying modification.(10.32) 
where
(10.33) 
refers to the tap of the FIR filter at time .
Points to Note
 We saw that in OLA with time varying modifications and (a ``sliding'' DFT), the window served as a lowpass filter on each individual tap of the FIR filter being implemented.
 In the more typical case in which is the window length divided by a small integer like  , we may think of the window as specifying a type of crossfade from the LTI filter for one frame to the LTI filter for the next frame.
 Using a Bartlett (triangular) window with % overlap, ( ), the sequence of FIR filters used is obtained simply by linearly interpolating the LTI filter for one frame to the LTI filter for the next.
 In FBS, there is no limitation on how fast the filter may vary with time, but its length is limited to that of the window .
 In OLA, there is no limit on length (just add more zeropadding), but the filter taps are bandlimited to the spectral width of the window.
 FBS filters are timelimited by , while OLA filters are bandlimited by (another dual relation).
 Recall for comparison that each frame in the OLA method is filtered
according to
(10.34)
where denotes .  Timevarying FBS filters are instantly in ``steady state''
 FBS filters must be changed very slowly to avoid clicks and pops (discontinuity distortion is likely when the filter changes)
STFT Summary and Conclusions
The ShortTime Fourier Transform (STFT) may be viewed either as an OverLapAdd (OLA) processor, or as a FilterBank Sum (FBS). We derived two conditions for perfect reconstruction which are Fourier duals of each other: For OLA, the window must overlapadd to a constant in the time domain. By the Poisson summation formula, this is equivalent to having window transform nulls at all nonzero multiples of the frame rate .
 For FBS, the window transform must overlapadd to a constant in the frequency domain, and this is equivalent to having window nulls in the time domain at all nonzero multiples of the transform size .
Next Section:
Applications of the STFT
Previous Section:
OverlapAdd (OLA) STFT Processing