In the overlap-add formulation of Chapter 8, we used a hopping window to extract time-limited signals to which we applied the DFT. Assuming for the moment that the hop size (the ``sliding DFT''), we have
This is the usual definition of the Short-Time Fourier Transform (STFT) (§7.1). In this chapter, we will look at the STFT from two different points of view: the OverLap-Add (OLA) and Filter-Bank Summation (FBS) points of view. We will show that one is the Fourier dual of the other . Next we will explore some implications of the filter-bank point of view and obtain some useful insights. Finally, some applications are considered.
Overlap-Add (OLA) Interpretation of the STFTIn the OLA interpretation of the STFT, we apply a time-shifted window to our signal , selecting data near time , and compute the Fourier-transform to obtain the spectrum of the th frame. As shown in Fig.9.1, the STFT is viewed as a time-ordered sequence of spectra, one per frame, with the frames overlapping in time.
As will be explained further below (and illustrated further in Figures 9.3, 9.4, and 9.5), under the filter-bank interpretation, the spectrum of is first rotated along the unit circle in the plane so as to shift frequency down to 0 (via modulation by in the time domain), thus forming the heterodyned signal . Next, the heterodyned signal is lowpass-filtered to a narrow band about frequency 0 (via convolving with the time-reversed window ). The STFT is thus interpreted as a frequency-ordered collection of narrow-band time-domain signals, as depicted in Fig.9.2. In other words, the STFT can be seen as a uniform filter bank in which the input signal is converted to a set of time-domain output signals , , one for each channel of the -channel filter bank. 9.2) is computed by the following operations:
- Frequency-shift by to get .
STFT established in Chapter 8 is that it is exactly invertible when the analysis window satisfies the constant-overlap-add constraint. That is, neglecting numerical round-off error, the inverse STFT reproduces the original input signal exactly. This is called the perfect reconstruction property of the STFT, and modern filter banks are usually designed with this property in mind . In the OLA processors of Chapter 8, perfect reconstruction was assured by using FFT analysis windows having the Constant-Overlap-Add (COLA) property at the particular hop-size used (see §8.2.1). In the Filter Bank Summation (FBS) interpretation of the STFT (Eq. (9.1)), it is the analysis filter-bank frequency responses that are constrained to be COLA. We will take a look at this more closely below.
STFT filter bank implements the processing shown in Fig.9.4. The same processing is shown in the frequency domain in Fig.9.5. Note that the window transform is complex-conjugated because the window is flipped in the time domain, i.e., when is real .
N=10; % number of filters = DFT length fs=1000; % sampling frequency (arbitrary) D=1; % duration in seconds L = ceil(fs*D)+1; % signal duration (samples) n = 0:L-1; % discrete-time axis (samples) t = n/fs; % discrete-time axis (sec) x = chirp(t,0,D,fs/2); % sine sweep from 0 Hz to fs/2 Hz %x = echirp(t,0,D,fs/2); % for complex "analytic" chirp x = x(1:L); % trim trailing zeros at end h = ones(1,N); % Simple DFT lowpass = rectangular window %h = hamming(N); % Better DFT lowpass = Hamming window X = zeros(N,L); % X will be the filter bank output for k=1:N % Loop over channels wk = 2*pi*(k-1)/N; xk = exp(-j*wk*n).* x; % Modulation by complex exponential X(k,:) = filter(h,1,xk); endFigure 9.6 shows the input and output-signal real parts for a ten-channel DFT filter bank based on the rectangular window as derived above. The imaginary parts of the channel-filter output signals are similar so they're not shown. Notice how the amplitude envelope in each channel follows closely the amplitude response of the running-sum lowpass filter. This is more clearly seen when the absolute values of the output signals are viewed, as shown in Fig.9.7.
function x = chirp(t,f0,t1,f1); beta = (f1-f0)./t1; x = cos(2*pi * ( 0.5* beta .* (t.^2) + f0*t));We can replace this real chirp with a complex ``analytic'' chirp by replacing the last line above by the following:
x = exp(j*(2*pi * ( 0.5* beta .* (t.^2) + f0*t)));Since the analytic chirp does not contain a negative-frequency component which beats with the positive-frequency component, we obtain the cleaner looking output moduli shown in Fig.9.9. Since our chirp frequency goes from zero to half the sampling rate, we are no longer exciting the negative-frequency channels. (To fully traverse the frequency axis with a complex chirp, we would need to sweep it from to .) We see in Fig.9.9 that there is indeed relatively little response in the ``negative-frequency channels'' for which , but there is some noticeable ``leakage'' from channel 0 into channel , and channel 5 similarly leaks into channel 6. Since the channel pass-bands overlap approximately 75%, this is not unexpected. The automatic vertical scaling in the channel 7 and 8 plots shows clearly the side-lobe structure of the Hamming window. Finally, notice also that the length start-up transient is visible in each channel output just after time 0.
filter banks implemented using an FFT, this section will derive the details of the DFT Filter Bank. More general STFT filter banks are obtained by using different windows and hop sizes, but otherwise are no different from the basic DFT filter bank. The Discrete Fourier Transform (DFT) is defined by 
where is the input signal at time , and . In this section, we will show how the DFT can be computed exactly from a bank of FIR bandpass filters, where each bandpass filter is implemented as a demodulator followed by a lowpass filter. We will then find that the inverse DFT is computed by remodulating and summing the output of this filter bank. In this way, the DFT filter bank is shown to be a perfect-reconstruction filter bank. The STFT is then an extension of the DFT filter bank to include non-rectangular analysis windows and a downsampling factor . filter is the so-called running-sum lowpass filter . The impulse response of the length running-sum lowpass filter is given by
9.10 depicts the generic operation of filtering by to produce , where is the impulse response of the filter. The output signal is given by the convolution of and :
so that its frequency response is
previously in Chapter 5 (§3.1.2) and elsewhere as the Fourier transform (DTFT) of a sampled rectangular pulse (or rectangular window). Note that the dc gain of the length running sum filter is . We could use a moving average instead of a running sum ( ) to obtain unity dc gain.
Modulation by a Complex Sinusoid
Given a signal expressed as a sum of sinusoids,
then the demodulation produces
We see that frequency is down-shifted to . In particular, frequency (the ``center frequency'') is down-shifted to dc.
filter, let's make bandpass filters tuned to center frequencies
Since the bandwidths, as defined, are , the filter pass-bands overlap by 50%. A superposition of the bandpass frequency responses for is shown in Fig.9.14. Also shown is the frequency-response sum, which we will show to be exactly constant and equal to . This gives our filter bank the perfect reconstruction property. We can simply add the outputs of the filters in the filter bank to recreate our input signal exactly. This is the source of the name Filter-Bank Summation (FBS). 9.15 shows the system diagram of the complete -channel filter bank constructed using length FIR running-sum lowpass filters. The th channel computes:
Fourier Transform (DFT) is defined as
Comparing this to (9.12), we see that the filter-bank output , , is precisely the DFT of the input signal when , i.e.,
In other words, the filter-bank output at time (the set of samples for ), equals the DFT of the first samples of ( , ). That is, taking a snapshot of all filter-bank channels at time yields the DFT of the input data from time 0 through . More generally, for all , we will call Fig.9.15 the DFT filter bank. The DFT filter bank is the special case of the STFT for which a rectangular window and hop size are used. The sliding DFT is obtained by advancing successive DFTs by one sample:
When for any integer , the Sliding DFT coincides with the DFT filter bank. At other times, they differ by a linear phase term. (Exercise: find the linear phase term.) The Sliding DFT redefines the time origin every sampling period (each modulation term within the DFT starts at time 0 for each transform), while the DFT Filter Bank does not redefine the time origin (modulation terms are ``free running'' as they would be in an analog filter bank). Since ``DFT time'' repeats every samples, the two treatments coincide every samples (i.e., for every integer ). When is a power of 2, the DFT can be implemented using a Cooley-Tukey Fast Fourier Transform (FFT) using only operations per transform. By keeping track of the linear phase term (an modification), a DFT Filter Bank can be implemented efficiently using an FFT. Uniform FIR filter banks are very often implemented in practice using FFT software such as fftw. Note that the channel bandwidths are narrow compared with half the sampling rate (especially for large ), so that the filter bank output signals are oversampled, in general. We will later look at downsampling the channel signals to obtain a ``hopping FFT'' filter bank. ``Sliding'' and ``hopping'' FFTs are special cases of the discrete-time Short Time Fourier Transform (STFT). The STFT normally also uses a window function other than the rectangular window used in this development (the running-sum lowpass filter).
DFT is given by 
This suggests that the DFT Filter Bank can be inverted by simply remodulating the baseband filter-bank signals , summing over , and dividing by for proper normalization. That is, we are led to conjecture that
This is in fact true, as we will later see. (It is straightforward to show as an exercise.)
8), perfect reconstruction required only that the analysis window meet a constant overlap-add (COLA) constraint:
where is any constant (always true for ).10.3 The Filter Bank Summation (FBS) is interpreted as a demodulation (frequency shift by ) and subsequent lowpass filtering by . Therefore, to resynthesize our original signal, we need to remodulate each baseband signal and sum up the channels. For (no downsampling), this sum is given by 
We have thus derived the following sufficient condition for perfect reconstruction :
Since normally our windows are shorter than , this condition holds trivially. In the overlap-add context, we also had guaranteed perfect reconstruction in this case ( ), because every window overlap-adds to a constant at displacement .
9.19) of the previous section, we derived that the FBS reconstruction sum gives
where . From this we see that if (where is the window length and is the DFT size), as is normally the case, then for . This is the Fourier dual of the ``strong COLA constraint'' for OLA (see §8.3.2). When it holds, we have
This is simply a gain term, and so we are able to recover the original signal exactly. (Zero-phase windows are appropriate here.) If the window length is larger than the number of analysis frequencies ( ), we can still obtain perfect reconstruction provided
When this holds, we say the window is . (This is the dual of the weak COLA constraint introduced in §8.3.1.) Portnoff windows, discussed in §9.7, make use of this result; they are longer than the DFT size and therefore must be used in time-aliased form . An advantage of Portnoff windows is that they give reduced overlap among the channel filter pass-bands. In the limit, a Portnoff window can approach a sinc function having its zero-crossings at all nonzero multiples of samples, thereby yielding an ideal channel filter with bandwidth . Figure 9.16 compares example Hamming and Portnoff windows.
constant overlap-add using hop size . Then we have (by the Poisson summation formula Eq. (8.30))
- Recall that the rectangular window transform is , implying the rectangular window itself is , which is obvious.
- The window transform for the Hamming family is , implying that Hamming windows are , which we also knew.
- The rectangular window transform is also for any integer , implying that all hop sizes given by for are COLA.
- Because its side lobes are the same width as the sinc side lobes, the Hamming window transform is also ,for any integer , implying hop sizes are good, for . Thus, the available hop sizes for the Hamming window family include all of those for the rectangular window except one ( ).
COLA for any window, while no window transform is except the zero window. (since it would have to be zero at dc, and we do not consider such windows). Did the theory break down for ? Intuitively, the condition on the window transform ensures that all nonzero multiples of the time-domain-frame-rate will be zeroed out over the interval along the frequency axis. When the frame-rate equals the sampling rate ( ), there are no frame-rate multiples in the range . (The range gives the same result.) When , there is exactly one frame-rate multiple at . When , there are two at . When , they are at and , and so on. We can cleanly handle the special case of by defining all functions over the unit circle as being when there are no frame-rate multiples in the range . Thus, a discrete-time spectrum is said to be if , for all , where (the ``floor function'') denotes the greatest integer less than or equal to .
212], Portnoff observed that any window of the form
being by construction, will obey the weak constraint, where is the number of spectral samples. In this result, is any window function whatsoever, and the sinc function is defined as usual by
(the unit-amplitude sinc function with zeros at all nonzero integers). Portnoff suggested that, in practical usage, windowed data segments longer than the FFT size should be time-aliased about length prior to performing an FFT. This result is readily derived from the definition of the time-normalized STFT introduced in Eq. (8.21):
Downsampled STFT Filter BanksWe now look at STFT filter banks which are downsampled by the factor . The downsampling factor corresponds to a hop size of samples in the overlap-add view of the STFT. From the filter-bank point of view, the impact of is aliasing in the channel signals when the lowpass filter (analysis window) is less than ideal. When the conditions for perfect reconstruction are met, this aliasing will be canceled in the reconstruction (when the filter-bank channel signals are remodulated and summed).
Downsampled STFT Filter BankSo far we have considered only (the ``sliding'' DFT) in our filter-bank interpretation of the STFT. For we obtain a downsampled version of :
i.e., is simply evaluated at every sample, as shown in Fig.9.17.
Note that this can be considered an implementation of a phase vocoder filter bank . (See §G.5 for an introduction to the vocoder.)
Since the channel signals are downsampled, we generally need interpolation in the reconstruction. Figure 9.18 indicates how we might pursue this. From studying the overlap-add framework, we know that the inverse STFT is exact when the window is , that is, when is constant. In only these cases can the STFT be considered a perfect reconstruction filter bank. From the Poisson Summation Formula in §8.3.1, we know that a condition equivalent to the COLA condition is that the window transform have notches at all harmonics of the frame rate, i.e., for . In the present context (filter-bank point of view), perfect reconstruction appears impossible for , because for ideal reconstruction after downsampling, the channel anti-aliasing filter ( ) and interpolation filter ( ) have to be ideal lowpass filters. This is a true conclusion in any single channel, but not for the filter bank as a whole. We know, for example, from the overlap-add interpretation of the STFT that perfect reconstruction occurs for hop-sizes greater than 1 as long as the COLA condition is met. This is an interesting paradox to which we will return shortly. What we would expect in the filter-bank context is that the reconstruction can be made arbitrarily accurate given better and better lowpass filters and which cut off at (the folding frequency associated with down-sampling by ). This is the right way to think about the STFT when spectral modifications are involved. In Chapter 11 we will develop the general topic of perfect reconstruction filter banks, and derive various STFT processors as special cases.
In FBS, is the downsampling factor in each of the filter-bank channels, and thus the window serves as the anti-aliasing filter (see Fig.9.19). We see that to avoid aliasing, must be bandlimited to , as illustrated schematically in Fig.9.20. main lobe. Given the first zero of at , we obtain
The following table gives maximum hop sizes for various window types in the Blackman-Harris family, where is both the number of constant-plus-cosine terms in the window definition (§3.3) and the half-main-lobe width in units of side-lobe widths . Also shown in the table is the maximum COLA hop size we determined in Chapter 8.
|L||Window Type (Length )|
|L||In and Out Window (Length )|
|1||Rectangular ( )||M/2||M|
|2||Generalized Hamming ( )||M/6||M/3|
|3||Blackman Family ( )||M/10||M/5|
- is equal to divided by the main-lobe width in ``side lobes'', while
- is divided by the first notch frequency in the window transform (lowest available frame rate at which all frame-rate harmonics are notched).
- For windows in the Blackman-Harris families, and with main-lobe widths defined from zero-crossing to zero-crossing, .
- Weak COLA: Window transform has zeros at frame-rate harmonics:
- Strong COLA: Window transform is bandlimited consistent with
downsampling by the frame rate:
- Perfect OLA reconstruction
- No aliasing
- better for spectral modifications
- Time-domain window infinitely long in ideal case
M = 33; % window length w = hamming(M); R = (M-1)/2; % maximum hop size w(M) = 0; % 'periodic Hamming' (for COLA) %w(M) = w(M)/2; % another solution, %w(1) = w(1)/2; % interesting to compare
ff = 1/R; % frame rate (fs=1) N = 6*M; % no. samples to look at OLA sp = ones(N,1)*sum(w)/R; % dc term (COLA term) ubound = sp(1); % try easy-to-compute upper bound lbound = ubound; % and lower bound n = (0:N-1)'; for (k=1:R-1) % traverse frame-rate harmonics f=ff*k; csin = exp(j*2*pi*f*n); % frame-rate harmonic % find exact window transform at frequency f Wf = w' * conj(csin(1:M)); hum = Wf*csin; % contribution to OLA "hum" sp = sp + hum/R; % "Poisson summation" into OLA % Update lower and upper bounds: Wfb = abs(Wf); ubound = ubound + Wfb/R; % build upper bound lbound = lbound - Wfb/R; % build lower bound endIn this example, the overlap-add is theoretically a perfect constant (equal to ) because the frame rate and all its harmonics coincide with nulls in the window transform (see Fig.9.24). A plot of the steady-state overlap-add and that computed using the Poisson Summation Formula (not shown) is constant to within numerical precision. The difference between the actual overlap-add and that computed using the PSF is shown in Fig.9.23. We verify that the difference is on the order of , which is close enough to zero in double-precision (64-bit) floating-point computations. We thus verify that the overlap-add of a length Hamming window using a hop size of samples is constant to within machine precision.
M = 33; % Window length beta = 8; w = kaiser(M,beta); R = floor(1.7*(M-1)/(beta+1)); % ROUGH estimate (gives R=6)Figure 9.25 plots the overlap-added Kaiser windows, and Fig.9.26 shows the steady-state overlap-add (a time segment sometime after the first 30 samples). The ``predicted'' OLA is computed using the Poisson Summation Formula using the same matlab code as before. Note that the Poisson summation formula gives exact results to within numerical precision. The upper (lower) bound was computed by summing (subtracting) the window-transform magnitudes at all frame-rate harmonics to (from) the dc gain of the window. This is one example of how the PSF can be used to estimate upper and lower bounds on OLA error. 9.27. Again the two methods agree to within numerical precision. 9.28 shows the Kaiser window transform, with marks indicating the folding frequency at the chosen hop size , as well as the frame-rate and twice the frame rate. We see that the frame rate (hop size) has been well chosen for this window, as the folding frequency lies very close to what would be called the ``stop band'' of the Kaiser window transform. The ``stop-band rejection'' can be seen to be approximately dB (height of highest side lobe in Fig.9.28). We conclude that this example--a length 33 Kaiser window with and hop-size -- represents a reasonably high-quality audio STFT that will be robust in the presence of spectral modifications. We expect such robustness whenever the folding frequency lies above the main lobe of the window transform.
STFT with Modifications
filter to each before resynthesizing the signal:
where, is the sampled frequency response of a filter with impulse response
Let's examine the result this has on the signal in the time domain:
while OLA gives (for )
- In FBS, the analysis window smooths the filter frequency response by time-limiting the corresponding impulse response.
- In OLA, the analysis window can only affect scaling.
refers to the tap of the FIR filter at time .
- We saw that in OLA with time varying modifications and (a ``sliding'' DFT), the window served as a lowpass filter on each individual tap of the FIR filter being implemented.
- In the more typical case in which is the window length divided by a small integer like - , we may think of the window as specifying a type of cross-fade from the LTI filter for one frame to the LTI filter for the next frame.
- Using a Bartlett (triangular) window with % overlap, ( ), the sequence of FIR filters used is obtained simply by linearly interpolating the LTI filter for one frame to the LTI filter for the next.
- In FBS, there is no limitation on how fast the filter may vary with time, but its length is limited to that of the window .
- In OLA, there is no limit on length (just add more zero-padding), but the filter taps are band-limited to the spectral width of the window.
- FBS filters are time-limited by , while OLA filters are band-limited by (another dual relation).
- Recall for comparison that each frame in the OLA method is filtered
where denotes .
- Time-varying FBS filters are instantly in ``steady state''
- FBS filters must be changed very slowly to avoid clicks and pops (discontinuity distortion is likely when the filter changes)
Short-Time Fourier Transform (STFT) may be viewed either as an OverLap-Add (OLA) processor, or as a Filter-Bank Sum (FBS). We derived two conditions for perfect reconstruction which are Fourier duals of each other:
- For OLA, the window must overlap-add to a constant in the time domain. By the Poisson summation formula, this is equivalent to having window transform nulls at all nonzero multiples of the frame rate .
- For FBS, the window transform must overlap-add to a constant in the frequency domain, and this is equivalent to having window nulls in the time domain at all nonzero multiples of the transform size .
Applications of the STFT
Overlap-Add (OLA) STFT Processing