The Filter Bank Summation (FBS) Interpretation of the Short Time Fourier Transform (STFT)

Dual Views of the Short Time Fourier Transform (STFT)

In the overlap-add formulation of Chapter 8, we used a hopping window to extract time-limited signals to which we applied the DFT. Assuming for the moment that the hop size $ R=1$ (the ``sliding DFT''), we have

$\displaystyle \zbox {X_m(\omega_k) = \sum_{n=-\infty}^\infty [w(n-m) x(n)] e^{-j\omega_k n}.} \protect$ (10.1)

This is the usual definition of the Short-Time Fourier Transform (STFT) (§7.1). In this chapter, we will look at the STFT from two different points of view: the OverLap-Add (OLA) and Filter-Bank Summation (FBS) points of view. We will show that one is the Fourier dual of the other [9]. Next we will explore some implications of the filter-bank point of view and obtain some useful insights. Finally, some applications are considered.

Overlap-Add (OLA) Interpretation of the STFT

In the OLA interpretation of the STFT, we apply a time-shifted window $ w(n-m)$ to our signal $ x(n)$ , selecting data near time $ m$ , and compute the Fourier-transform to obtain the spectrum of the $ m$ th frame. As shown in Fig.9.1, the STFT is viewed as a time-ordered sequence of spectra, one per frame, with the frames overlapping in time.

Figure 9.1: Overlap-Add (OLA) view of the STFT
\includegraphics{eps/ola}


Filter-Bank Summation (FBS) Interpretation of the STFT

We can group the terms in the STFT definition differently to obtain the filter-bank interpretation:

$\displaystyle X_m(\omega_k)$ $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^\infty \underbrace{[ x(n)e^{-j\omega_k n}]}_{x_k(n)} w(n-m)$  
  $\displaystyle =$ $\displaystyle \left[x_k \ast \hbox{\sc Flip}(w)\right](m)
\protect$ (10.2)

As will be explained further below (and illustrated further in Figures 9.3, 9.4, and 9.5), under the filter-bank interpretation, the spectrum of $ x$ is first rotated along the unit circle in the $ z$ plane so as to shift frequency $ \omega_k$ down to 0 (via modulation by $ e^{-j\omega_k n}$ in the time domain), thus forming the heterodyned signal $ x_k(n)\isdeftext x(n)\exp(-j\omega_k
n)$ . Next, the heterodyned signal $ x_k(n)$ is lowpass-filtered to a narrow band about frequency 0 (via convolving with the time-reversed window $ \hbox{\sc Flip}(w)$ ). The STFT is thus interpreted as a frequency-ordered collection of narrow-band time-domain signals, as depicted in Fig.9.2. In other words, the STFT can be seen as a uniform filter bank in which the input signal $ x(n)$ is converted to a set of $ N$ time-domain output signals $ X_n(\omega_k)$ , $ k=0,1,\ldots,N-1$ , one for each channel of the $ N$ -channel filter bank.

Figure 9.2: Filter Bank Summation (FBS) view of the STFT
\includegraphics{eps/fbs}

Expanding on the previous paragraph, the STFT (9.2) is computed by the following operations:

  • Frequency-shift $ x(n)$ by $ -\omega_k$ to get $ x_k(n) \mathrel{\stackrel{\Delta}{=}}e^{-j\omega_k n}x(n)$ .
  • Convolve $ x_k(n)$ with $ {\tilde w}\mathrel{\stackrel{\Delta}{=}}\hbox{\sc Flip}(w)$ to get $ X_m(\omega_k)$ :

    $\displaystyle X_m(\omega_k) = \sum_{n=-\infty}^\infty x_k(n){\tilde w}(m-n) = (x_k * {\tilde w})(m)$ (10.3)

The STFT output signal $ X_m(\omega_k)$ is regarded as a time-domain signal (time index $ m$ ) coming out of the $ k$ th channel of an $ N$ -channel filter bank. The center frequency of the $ k$ th channel filter is $ \omega_k =
2\pi k/N$ , $ k=0,1,\ldots,N-1$ . Each channel output signal is a baseband signal; that is, it is centered about dc, with the ``carrier term'' $ e^{j\omega_k m}$ taken out by ``demodulation'' (frequency-shifting). In particular, the $ k$ th channel signal is constant whenever the input signal happens to be a sinusoid tuned to frequency $ \omega_k$ exactly.

Note that the STFT analysis window $ w$ is now interpreted as (the flip of) a lowpass-filter impulse response. Since the analysis window $ w$ in the STFT is typically symmetric, we usually have $ \hbox{\sc Flip}(w)=w$ . This filter is effectively frequency-shifted to provide each channel bandpass filter. If the cut-off frequency of the window transform is $ \omega_c$ (typically half a main-lobe width), then each channel signal can be downsampled significantly. This downsampling factor is the FBS counterpart of the hop size $ R$ in the OLA context.

Figure 9.3 illustrates the filter-bank interpretation for $ R=1$ (the ``sliding STFT''). The input signal $ x(n)$ is frequency-shifted by a different amount for each channel and lowpass filtered by the (flipped) window.

\begin{psfrags}
% latex2html id marker 23871\psfrag{w}{\Large$\protect\hbox{\sc Flip}(w)$}\psfrag{x(n)}{\LARGE$x(n)$}\psfrag{X0}{\LARGE$X_n(\omega_{\scriptscriptstyle 0}$)}\psfrag{X1}{\LARGE$X_n(\omega_{\scriptscriptstyle 1}$)}\psfrag{XNm1}{\LARGE$X_n(\omega_{\scriptscriptstyle {N}-1})$}\psfrag{ejw0}{\huge$e^{-j\omega_{\scriptscriptstyle 0}n}$}\psfrag{ejw1}{\huge$e^{-j\omega_{\scriptscriptstyle 1}n}$}\psfrag{ejwNm1}{\huge$e^{-j\omega_{\scriptscriptstyle {N-1}}n}$}\psfrag{dR}{\LARGE$\downarrow R$}\psfrag{X}{\LARGE$\times$}\begin{figure}[htbp]
\includegraphics[width=3in]{eps/fbs1}
\caption{Sliding STFT analysis filter bank.
The $k$th channel of the filter bank computes
$X_n(\omega_k)=(x_k\ast \hbox{\sc Flip}{w})(n)$, where $x_k(n)\isdeftext
x(n)\exp(-j\omega_k n)$.
}
\end{figure}
\end{psfrags}


FBS and Perfect Reconstruction

An important property of the STFT established in Chapter 8 is that it is exactly invertible when the analysis window satisfies the constant-overlap-add constraint. That is, neglecting numerical round-off error, the inverse STFT reproduces the original input signal exactly. This is called the perfect reconstruction property of the STFT, and modern filter banks are usually designed with this property in mind [287].

In the OLA processors of Chapter 8, perfect reconstruction was assured by using FFT analysis windows $ w$ having the Constant-Overlap-Add (COLA) property at the particular hop-size $ R$ used (see §8.2.1).

In the Filter Bank Summation (FBS) interpretation of the STFT (Eq.$ \,$ (9.1)), it is the analysis filter-bank frequency responses $ W(\omega-\omega_k)$ that are constrained to be COLA. We will take a look at this more closely below.


STFT Filter Bank

Each channel of an STFT filter bank implements the processing shown in Fig.9.4. The same processing is shown in the frequency domain in Fig.9.5. Note that the window transform $ W(\omega)$ is complex-conjugated because the window $ w$ is flipped in the time domain, i.e., $ w(-n)\;\leftrightarrow\;\overline{W(\omega)}$ when $ w$ is real [264].

Figure: One channel of the STFT filter bank computing $ X_n(\omega_k)=(x_k\ast \protect{\tilde w})(n)$ , where $ x_k(n)\isdeftext x(n)\exp(-j\omega_k
n)$ , and $ \protect{\tilde w}\protect\isdeftext \protect\hbox{\sc Flip}(w)$ .
\includegraphics{eps/fbs-chan}

Figure: One channel of the STFT filter bank in the frequency domain ( $ \overline {W}$ denotes the complex conjugate of $ W$ ).
\includegraphics{eps/fbs-chan-fd}

These channels are then arranged in parallel to form a filter bank, as shown in Fig.9.3. In practice, we need to know under what conditions the channel filters $ w$ will yield perfect reconstruction when the channel signals are remodulated and summed. (A sufficient condition for the sliding STFT is that the channel frequency responses overlap-add to a constant over the unit circle in the frequency domain.) Furthermore, since the channel signals are heavily oversampled, particularly when the chosen window $ w$ has low side-lobe levels, we would like to be able to downsample the channel signals without loss of information. It is indeed possible to downsample the channel signals while retaining the perfect reconstruction property, as we will see in §9.8.1.

Computational Examples in Matlab

In this section, we will take a look at some STFT filter-bank output signals when the input signal is a ``chirp.'' A chirp signal is generally defined as a sinusoid having a linearly changing frequency over time:

\begin{eqnarray*}
x(t) &\isdef & \cos(\theta_t+\phi)\\
\frac{d\theta_t}{dt} &=& \alpha t + \omega_0
\end{eqnarray*}

The matlab code is as follows:

N=10;           % number of filters = DFT length
fs=1000;        % sampling frequency (arbitrary)
D=1;            % duration in seconds

L = ceil(fs*D)+1; % signal duration (samples)
n = 0:L-1;        % discrete-time axis (samples)
t = n/fs;         % discrete-time axis (sec)
x = chirp(t,0,D,fs/2);   % sine sweep from 0 Hz to fs/2 Hz
%x = echirp(t,0,D,fs/2); % for complex "analytic" chirp
x = x(1:L);       % trim trailing zeros at end
h = ones(1,N);    % Simple DFT lowpass = rectangular window
%h = hamming(N);  % Better DFT lowpass = Hamming window
X = zeros(N,L);   % X will be the filter bank output
for k=1:N         % Loop over channels
  wk = 2*pi*(k-1)/N;
  xk = exp(-j*wk*n).* x;  % Modulation by complex exponential
  X(k,:) = filter(h,1,xk);
end

Figure 9.6 shows the input and output-signal real parts for a ten-channel DFT filter bank based on the rectangular window as derived above. The imaginary parts of the channel-filter output signals are similar so they're not shown. Notice how the amplitude envelope in each channel follows closely the amplitude response of the running-sum lowpass filter. This is more clearly seen when the absolute values of the output signals are viewed, as shown in Fig.9.7.

Figure 9.6: 10-Channel DFT filter bank real-chirp response. Output real part versus time in each channel (rectangular window).
\includegraphics[width=\twidth,height=6.5in]{eps/dcrrrp}

Figure 9.7: 10-Channel DFT filter bank real-chirp response. Output magnitude versus time in each channel (rectangular window).
\includegraphics[width=\twidth,height=6.5in]{eps/dcrr}

Replacing the rectangular window with the Hamming window gives much improved channel isolation at the cost of doubling the channel bandwidth, as shown in Fig.9.8. Now the window-transform side lobes (lowpass filter stop-band response) are not really visible to the eye. The intense ``beating'' near dc and half the sampling rate is caused by the fact that we used a real chirp. The matlab for this chirp boils down to the following:

function x = chirp(t,f0,t1,f1);
beta = (f1-f0)./t1;
x = cos(2*pi * ( 0.5* beta .* (t.^2) + f0*t));
We can replace this real chirp with a complex ``analytic'' chirp by replacing the last line above by the following:
x = exp(j*(2*pi * ( 0.5* beta .* (t.^2) + f0*t)));
Since the analytic chirp does not contain a negative-frequency component which beats with the positive-frequency component, we obtain the cleaner looking output moduli shown in Fig.9.9.

Since our chirp frequency goes from zero to half the sampling rate, we are no longer exciting the negative-frequency channels. (To fully traverse the frequency axis with a complex chirp, we would need to sweep it from $ -\pi$ to $ \pi$ .) We see in Fig.9.9 that there is indeed relatively little response in the ``negative-frequency channels'' for which $ k=6,7,8,9$ , but there is some noticeable ``leakage'' from channel 0 into channel $ 9$ , and channel 5 similarly leaks into channel 6. Since the channel pass-bands overlap approximately 75%, this is not unexpected. The automatic vertical scaling in the channel 7 and 8 plots shows clearly the side-lobe structure of the Hamming window. Finally, notice also that the length $ N-1=9$ start-up transient is visible in each channel output just after time 0.

Figure 9.8: Hamming-window DFT filter bank real-chirp response. Output magnitude versus time in each channel.
\includegraphics[width=\twidth,height=6.5in]{eps/dcrh}

Figure 9.9: Hamming-window DFT filter bank analytic chirp response. Output magnitude versus time in each channel.
\includegraphics[width=\twidth,height=6.5in]{eps/dacrh}


The DFT Filter Bank

To obtain insight into the operation of filter banks implemented using an FFT, this section will derive the details of the DFT Filter Bank. More general STFT filter banks are obtained by using different windows and hop sizes, but otherwise are no different from the basic DFT filter bank.

The Discrete Fourier Transform (DFT) is defined by [264]

$\displaystyle X(\omega_k) = \sum_{n=0}^{N-1} x(n) e^{-j\omega_k n}$ (10.4)

where $ x(n)$ is the input signal at time $ n$ , and $ \omega_k\isdef 2\pi k/N$ . In this section, we will show how the DFT can be computed exactly from a bank of $ N$ FIR bandpass filters, where each bandpass filter is implemented as a demodulator followed by a lowpass filter. We will then find that the inverse DFT is computed by remodulating and summing the output of this filter bank. In this way, the DFT filter bank is shown to be a perfect-reconstruction filter bank. The STFT is then an extension of the DFT filter bank to include non-rectangular analysis windows $ w$ and a downsampling factor $ R$ .

The Running-Sum Lowpass Filter

Perhaps the simplest FIR lowpass filter is the so-called running-sum lowpass filter [175]. The impulse response of the length $ N$ running-sum lowpass filter is given by

$\displaystyle h(n) \isdef \left\{\begin{array}{ll} 1, & n=0,1,2,...,N-1 \\ [5pt] 0, & \hbox{otherwise.} \\ \end{array} \right. \protect$ (10.5)

Figure 9.10: Black-box view of application of the running-sum lowpass filter.
\includegraphics{eps/blackbox}

Figure 9.10 depicts the generic operation of filtering $ x(n)$ by $ h(n)$ to produce $ y(n)$ , where $ h(n)$ is the impulse response of the filter. The output signal is given by the convolution of $ x$ and $ h$ :

\begin{eqnarray*}
y(n) &=& (h\ast x)(n)
\isdef \sum_{m=-\infty}^{\infty} h(m) x(n-m)
= \sum_{m=0}^{N-1} x(n-m)\\
&=& x(n) + x(n-1) + \cdots + x(n-N+1).
\end{eqnarray*}

In this form, it is clear why the filter (9.5) is called ``running sum'' filter. Dividing it by $ N$ , it becomes a ``moving average'' filter, averaging the most recent $ N$ input samples.

The transfer function of the running-sum filter is given by [263]

$\displaystyle H(z) = 1 + z^{-1}+ \cdots + z^{-N+1} = \frac{1-z^{-N}}{1-z^{-1}},$ (10.6)

so that its frequency response is

\begin{eqnarray*}
H(e^{j\omega}) &=& \frac{1-e^{-j\omega N}}{1-e^{-j\omega }}
= \frac{e^{-j\omega N/2}}{e^{-j\omega /2}}
\frac{\sin(\omega N/2)}{\sin(\omega /2)}\\ [10pt]
&\isdef &
Ne^{-j\omega(N-1)/2} \hbox{asinc}_N(\omega ).
\end{eqnarray*}

Recall that the term $ e^{-j\omega(N-1)/2}$ is a linear phase term corresponding to a delay of $ (N-1)/2$ samples (half of the FIR filter order). This arises because we defined the running-sum lowpass filter as a causal, linear phase filter.

We encountered the ``aliased sinc function''

$\displaystyle \hbox{asinc}_N(\omega ) \isdef \frac{\sin(\omega N/2)}{N\cdot\sin(\omega /2)}$ (10.7)

previously in Chapter 53.1.2) and elsewhere as the Fourier transform (DTFT) of a sampled rectangular pulse (or rectangular window).

Note that the dc gain of the length $ N$ running sum filter is $ N$ . We could use a moving average instead of a running sum ( $ h
\leftarrow h/N$ ) to obtain unity dc gain.

Figure: Running-sum amplitude response for $ N=5$
\includegraphics[width=4in]{eps/sincabs}

Figure 9.11 shows the amplitude response of the running-sum lowpass filter for length $ N=5$ . The gain at dc is $ N=5$ , and nulls occur at $ \omega = \pm2\pi/5$ and $ \pm4\pi/5$ . These nulls occur at the sinusoidal frequencies having respectively one and two periods under the 5-sample ``rectangular window''. (Three periods would need at least $ 2\cdot 3 = 6$ samples, so $ 6\pi/5$ doesn't ``fit''.) Since the pass-band about dc is not flat, it is better to call this a ``dc-pass filter'' rather than a ``lowpass filter.'' We could also call it a dc sampling filter.10.1


Modulation by a Complex Sinusoid

Figure: System diagram for complex demodulation (frequency-shifting) by $ -\omega _c$ .
\includegraphics{eps/modulation}

Figure 9.12 shows the system diagram for complex demodulation.10.2The input signal $ x(n)$ is multiplied by a complex sinusoid to produce the frequency-shifted result

$\displaystyle x_c(n) = e^{-j\omega_c n} x(n).$ (10.8)

Given a signal expressed as a sum of sinusoids,

$\displaystyle x(n) = \sum_{k=1}^{N_x} a_k e^{j\omega_k n}, \quad a_k\in{\bf C},$ (10.9)

then the demodulation produces

$\displaystyle x_c(n) \isdef x(n) e^{-j\omega_c n} = \sum_{k=1}^{N_x} a_k e^{j(\omega_k -\omega_c) n}.$ (10.10)

We see that frequency $ \omega_k$ is down-shifted to $ \omega_k-\omega_c$ . In particular, frequency $ \omega_c$ (the ``center frequency'') is down-shifted to dc.


Making a Bandpass Filter from a Lowpass Filter

Figure 9.13: Construction of a bandpass filter using demodulation, lowpass filtering, and remodulation.
\includegraphics{eps/BPF}

Figure 9.13 shows how a bandpass filter can be made using a lowpass filter together with modulation. The input spectrum is frequency-shifted by $ -\omega _c$ , lowpass filtered, then frequency-shifted by $ +\omega_c$ , thereby creating a bandpass filter centered at frequency $ \omega_c$ . From our experience with rectangular-window transforms (Fig.9.11 being one example), we can say that the bandpass-filter bandwidth is equal to the main-lobe width of the aliased sinc function, or $ 4\pi/N$ radians per sample (measured from zero-crossing to zero-crossing).


Uniform Running-Sum Filter Banks

Using a length $ N$ running-sum filter, let's make $ N$ bandpass filters tuned to center frequencies

$\displaystyle \omega_k\isdef k\frac{2\pi}{N}, \quad k=0,1,2,\ldots,N-1.$ (10.11)

Since the bandwidths, as defined, are $ 4\pi/N$ , the filter pass-bands overlap by 50%. A superposition of the bandpass frequency responses for $ N=5$ is shown in Fig.9.14. Also shown is the frequency-response sum, which we will show to be exactly constant and equal to $ N$ . This gives our filter bank the perfect reconstruction property. We can simply add the outputs of the filters in the filter bank to recreate our input signal exactly. This is the source of the name Filter-Bank Summation (FBS).

Figure: Example filter-bank channel frequency responses for $ N=5$
\includegraphics[width=3in]{eps/sincbank}

System Diagram of the Running-Sum Filter Bank

Figure 9.15: DFT Filter Bank.
\includegraphics{eps/BPFB}

Figure 9.15 shows the system diagram of the complete $ N$ -channel filter bank constructed using length $ N$ FIR running-sum lowpass filters. The $ k$ th channel computes:

$\displaystyle y_k(n)$ $\displaystyle =$ $\displaystyle (h\ast x_k)(n) = \sum_{m=0}^{N-1}h(m)x_k(n-m)$  
  $\displaystyle =$ $\displaystyle (x_k\ast h)(n) = \sum_{m=n-(N-1)}^{n}x_k(m)h(n-m)$  
  $\displaystyle =$ $\displaystyle \sum_{m=n-(N-1)}^{n}x(m)e^{-j\omega_k m }\hbox{\sc Shift}_{n,m}(\hbox{\sc Flip}(h))$  
  $\displaystyle =$ $\displaystyle \sum_{m=n-(N-1)}^{n}x(m)e^{-j\omega_k m }
\protect$ (10.12)


DFT Filter Bank

Recall that the Length $ N$ Discrete Fourier Transform (DFT) is defined as

$\displaystyle X(k) \isdef \sum_{n=0}^{N-1} x(n) e^{-j2\pi nk/N}$ (10.13)

Comparing this to (9.12), we see that the filter-bank output $ y_k(n)$ , $ k=0,1,\ldots,N-1$ , is precisely the DFT of the input signal when $ n=N-1$ , i.e.,

$\displaystyle \zbox {X(k) = y_k(N-1)}.$ (10.14)

In other words, the filter-bank output at time $ n=N-1$ (the set of $ N$ samples $ y_k(N-1)$ for $ k=0,1,2,\ldots,N-1$ ), equals the DFT of the first $ N$ samples of $ x$ ($ x(n)$ , $ n=0,\ldots,N-1$ ). That is, taking a snapshot of all filter-bank channels at time $ N-1$ yields the DFT of the input data from time 0 through $ N-1$ .

More generally, for all $ n$ , we will call Fig.9.15 the DFT filter bank. The DFT filter bank is the special case of the STFT for which a rectangular window and hop size $ R=1$ are used.

The sliding DFT is obtained by advancing successive DFTs by one sample:

$\displaystyle X_n(k) \isdef \sum_{m=0}^{N-1} x(n+m) e^{-j2\pi mk/N}$ (10.15)

When $ n=LN-1$ for any integer $ L$ , the Sliding DFT coincides with the DFT filter bank. At other times, they differ by a linear phase term. (Exercise: find the linear phase term.) The Sliding DFT redefines the time origin every sampling period (each modulation term within the DFT starts at time 0 for each transform), while the DFT Filter Bank does not redefine the time origin (modulation terms are ``free running'' as they would be in an analog filter bank). Since ``DFT time'' repeats every $ N$ samples, the two treatments coincide every $ N$ samples (i.e., $ e^{j\omega_k(n+LN)}=e^{j\omega_kn}$ for every integer $ L$ ).

When $ N$ is a power of 2, the DFT can be implemented using a Cooley-Tukey Fast Fourier Transform (FFT) using only $ {\cal O}(N\log_2(N))$ operations per transform. By keeping track of the linear phase term (an $ {\cal O}(N)$ modification), a DFT Filter Bank can be implemented efficiently using an FFT. Uniform FIR filter banks are very often implemented in practice using FFT software such as fftw.

Note that the channel bandwidths are narrow compared with half the sampling rate (especially for large $ N$ ), so that the filter bank output signals $ y_k(n)$ are oversampled, in general. We will later look at downsampling the channel signals $ y_k(n)$ to obtain a ``hopping FFT'' filter bank. ``Sliding'' and ``hopping'' FFTs are special cases of the discrete-time Short Time Fourier Transform (STFT). The STFT normally also uses a window function other than the rectangular window used in this development (the running-sum lowpass filter).


Inverse DFT and the DFT Filter Bank Sum

The Length $ N$ inverse DFT is given by [264]

$\displaystyle x(n) = \frac{1}{N}\sum_{k=0}^{N-1} X(k) e^{j2\pi nk/N}, \quad n=0,1,2,\ldots,N-1.$ (10.16)

This suggests that the DFT Filter Bank can be inverted by simply remodulating the baseband filter-bank signals $ y_k(n)$ , summing over $ k$ , and dividing by $ N$ for proper normalization. That is, we are led to conjecture that

$\displaystyle x(n-N+1) = \frac{1}{N}\sum_{k=0}^{N-1} y_k(n) e^{j2\pi nk/N}, \quad n=0,1,2,\ldots\,.$ (10.17)

This is in fact true, as we will later see. (It is straightforward to show as an exercise.)


FBS Window Constraints for R=1

Recall that in overlap-add (Chapter 8), perfect reconstruction required only that the analysis window $ w$ meet a constant overlap-add (COLA) constraint:

$\displaystyle \sum_{m=-\infty}^\infty w(n-mR) = c$ (10.18)

where $ c\neq 0$ is any constant (always true for $ R=1$ ).10.3

The Filter Bank Summation (FBS) is interpreted as a demodulation (frequency shift by $ -\omega_k$ ) and subsequent lowpass filtering by $ w$ . Therefore, to resynthesize our original signal, we need to remodulate each baseband signal and sum up the channels. For $ R=1$ (no downsampling), this sum is given by [9]

$\displaystyle \hat{x}(m)$ $\displaystyle =$ $\displaystyle \sum_{k=0}^{N-1} X_m(\omega_k)e^{j\omega_km}$  
  $\displaystyle =$ $\displaystyle \sum_{k=0}^{N-1} \left[ \sum_{n=-\infty}^\infty
x(n)w(n-m)e^{-j\omega_kn} \right] e^{j\omega_km}$  
  $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^\infty x(n)w(n-m) \sum_{k=0}^{N-1} e^{j\omega_k(m-n)}$  
  $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^\infty x(n)w(\underbrace{n-m}_{-rN})
N\sum_{r=-\infty}^\infty \delta(m-n-rN)$  
  $\displaystyle =$ $\displaystyle N\sum_{r=-\infty}^\infty {\tilde w}(rN)x(m-rN)$  
  $\displaystyle =$ $\displaystyle Nw(0) x(m) \qquad \hbox{if $w(rN)=0,\,\forall r\neq 0$}
\protect$ (10.19)

We have thus derived the following sufficient condition for perfect reconstruction [213]:
$\textstyle \parbox{0.8\textwidth}{Perfect
reconstruction is assured in the sliding STFT remodulated filter-bank
sum provided the analysis window is zero at all nonzero multiples of
$N$.}$
Since normally our windows are shorter than $ N$ , this condition holds trivially. In the overlap-add context, we also had guaranteed perfect reconstruction in this case ($ R=1$ ), because every window $ w$ overlap-adds to a constant at displacement $ R=1$ .


Nyquist(N) Windows

In (9.19) of the previous section, we derived that the FBS reconstruction sum gives

$\displaystyle \hat{x}(n) = N\sum_{r=-\infty}^\infty {\tilde w}(rN)x(n-rN)$ (10.20)

where $ {\tilde w}=\hbox{\sc Flip}(w)$ . From this we see that if $ M<N$ (where $ M$ is the window length and $ N$ is the DFT size), as is normally the case, then $ w(rN) = 0$ for $ \vert r\vert = 1,2, \dots\,$ . This is the Fourier dual of the ``strong COLA constraint'' for OLA (see §8.3.2). When it holds, we have

$\displaystyle \hat{x}(n) = N w(0) x(n).$ (10.21)

This is simply a gain term, and so we are able to recover the original signal exactly. (Zero-phase windows are appropriate here.)

If the window length is larger than the number of analysis frequencies ($ M>N$ ), we can still obtain perfect reconstruction provided

$\displaystyle w(rN) = 0, \hspace{1cm} \vert r\vert=1,2,\dots \qquad\hbox{[$w$\ is Nyquist($N$)]}$ (10.22)

When this holds, we say the window is $ \hbox{\sc Nyquist}(N)$ . (This is the dual of the weak COLA constraint introduced in §8.3.1.) Portnoff windows, discussed in §9.7, make use of this result; they are longer than the DFT size and therefore must be used in time-aliased form [62]. An advantage of Portnoff windows is that they give reduced overlap among the channel filter pass-bands. In the limit, a Portnoff window can approach a sinc function having its zero-crossings at all nonzero multiples of $ N$ samples, thereby yielding an ideal channel filter with bandwidth $ 2\pi/N$ . Figure 9.16 compares example Hamming and Portnoff windows.

Figure 3.35:
\includegraphics[width=\twidth]{eps/colawin}


Duality of COLA and Nyquist Conditions

Let $ \hbox{\sc Cola}(N)$ denote constant overlap-add using hop size $ N$ . Then we have (by the Poisson summation formula Eq.$ \,$ (8.30))

\begin{eqnarray*}
w &\in& \hbox{\sc Nyquist}(N) \Leftrightarrow W \in \hbox{\sc Cola}(2\pi/N) \qquad \hbox{(FBS)} \\ [10pt]
w &\in& \hbox{\sc Cola}(R) \Leftrightarrow W \in \hbox{\sc Nyquist}(2\pi/R) \qquad \hbox{(OLA)}
\end{eqnarray*}

Specific Windows

  • Recall that the rectangular window transform is $ \hbox{\sc Nyquist}(2\pi/M)$ , implying the rectangular window itself is $ \hbox{\sc Cola}(M)$ , which is obvious.

  • The window transform for the Hamming family is $ \hbox{\sc Nyquist}(4\pi/M)$ , implying that Hamming windows are $ \hbox{\sc Cola}(M/2)$ , which we also knew.

  • The rectangular window transform is also $ \hbox{\sc Nyquist}(K2\pi/M)$ for any integer $ 1\leq K\leq M/2$ , implying that all hop sizes given by $ R=M/K$ for $ K=1,2,\ldots,M/2$ are COLA.

  • Because its side lobes are the same width as the sinc side lobes, the Hamming window transform is also $ \hbox{\sc Nyquist}(K2\pi/M)$ ,for any integer $ 2\leq K\leq M/2$ , implying hop sizes $ R=M/K$ are good, for $ K=2,\ldots,M/2$ . Thus, the available hop sizes for the Hamming window family include all of those for the rectangular window except one ($ R=M$ ).


The Nyquist Property on the Unit Circle

As a degenerate case, note that $ R=1$ is COLA for any window, while no window transform is $ \hbox{\sc Nyquist}(2\pi)$ except the zero window. (since it would have to be zero at dc, and we do not consider such windows). Did the theory break down for $ R=1$ ?

Intuitively, the $ \hbox{\sc Nyquist}(2\pi/R)$ condition on the window transform $ W(\omega)$ ensures that all nonzero multiples of the time-domain-frame-rate $ 2\pi/R$ will be zeroed out over the interval $ [-\pi,\pi)$ along the frequency axis. When the frame-rate equals the sampling rate ($ R=1$ ), there are no frame-rate multiples in the range $ [-\pi,\pi)$ . (The range $ [0,2\pi)$ gives the same result.) When $ R=2$ , there is exactly one frame-rate multiple at $ -\pi$ . When $ R=3$ , there are two at $ \pm 2\pi/3$ . When $ R=4$ , they are at $ -\pi$ and $ \pm\pi/2$ , and so on.

We can cleanly handle the special case of $ R=1$ by defining all functions over the unit circle as being $ \hbox{\sc Nyquist}(2\pi)$ when there are no frame-rate multiples in the range $ [-\pi,\pi)$ . Thus, a discrete-time spectrum $ W(\omega), \omega\in[-\pi,\pi)$ is said to be $ \hbox{\sc Nyquist}(2\pi/K)$ if $ W(r 2\pi/K)=0$ , for all $ \vert r\vert=1,2,\ldots,\left\lfloor K/2\right\rfloor $ , where $ \left\lfloor x\right\rfloor $ (the ``floor function'') denotes the greatest integer less than or equal to $ x$ .


Portnoff Windows

In 1976 [212], Portnoff observed that any window $ w$ of the form

$\displaystyle w(n) = w_0(n)$   sinc$\displaystyle (n/N),$ (10.23)

being $ \hbox{\sc Nyquist}(N)$ by construction, will obey the weak $ \hbox{\sc Cola}(2\pi/N)$ constraint, where $ N$ is the number of spectral samples. In this result, $ w_0(n)$ is any window function whatsoever, and the sinc function is defined as usual by

sinc$\displaystyle (n) \mathrel{\stackrel{\Delta}{=}}\frac{\sin(\pi n)}{\pi n}$ (10.24)

(the unit-amplitude sinc function with zeros at all nonzero integers).

Portnoff suggested that, in practical usage, windowed data segments longer than the FFT size should be time-aliased about length $ N$ prior to performing an FFT. This result is readily derived from the definition of the time-normalized STFT introduced in Eq.$ \,$ (8.21):

\begin{eqnarray*}
{\tilde X}_m(\omega_k)
&\isdef & \hbox{\sc Sample}_{\Omega_N,k}\left(\hbox{\sc DTFT}\left({\tilde x}_m\right)\right) \\
&\isdef & \hbox{\sc Sample}_{\Omega_N,k}\left(\hbox{\sc DTFT}\left(\hbox{\sc Shift}_{-m}(x)\cdot w\right)\right) \\
&=& \sum_{n=-\infty}^\infty x(n+m)w(n)e^{-j\omega_k n}\quad\hbox{(now let $n\isdef lN+i$)}\\
&=& \sum_{l=-\infty}^\infty \sum_{i=0}^{N-1}x(lN+i+m)w(lN+i)
\underbrace{e^{-j\omega_k (lN+i)}}_{e^{-j\omega_k i}}\\
&=& \sum_{i=0}^{N-1}\left[\sum_{l=-\infty}^\infty x(lN+i+m)w(lN+i)\right]
e^{-j\omega_k i}\\
&=& \sum_{i=0}^{N-1}\hbox{\sc Alias}_{N,i}[\hbox{\sc Shift}_{-m}(x)\cdot w] e^{-j\omega_k i}\\
&\isdef & \hbox{\sc DFT}_{N,k}\{\hbox{\sc Alias}_N[\hbox{\sc Shift}_{-m}(x)\cdot w]\},
\end{eqnarray*}

where $ \omega_k \isdef 2\pi k/N \isdef k\Omega_N$ as usual.

Choosing $ M\gg N$ allows multiple side lobes of the sinc function to alias in on the main lobe. This gives channel filters in the frequency domain which are sharper bandpass filters while remaining COLA. I.e., there is less channel cross-talk in the frequency domain. However, the time-aliasing corresponds to undersampling in the frequency domain, implying less robustness to spectral modifications, since such modifications can disturb the time-domain aliasing cancellation. Since the hop size needs to be less than $ N$ , the overall filter bank based on a Portnoff window remains oversampled in the time domain.


Downsampled STFT Filter Banks

We now look at STFT filter banks which are downsampled by the factor $ R>1$ . The downsampling factor $ R$ corresponds to a hop size of $ R$ samples in the overlap-add view of the STFT. From the filter-bank point of view, the impact of $ R>1$ is aliasing in the channel signals when the lowpass filter (analysis window) is less than ideal. When the conditions for perfect reconstruction are met, this aliasing will be canceled in the reconstruction (when the filter-bank channel signals are remodulated and summed).

Downsampled STFT Filter Bank

So far we have considered only $ R=1$ (the ``sliding'' DFT) in our filter-bank interpretation of the STFT. For $ R>1$ we obtain a downsampled version of $ X_m(\omega_k)$ :

\begin{eqnarray*}
X_{mR}(\omega_k) &=& \sum_{n=-\infty}^\infty [x(n)e^{-j\omega_kn}]\tilde{w}(mR-n)
\hspace{1.2cm} (\tilde{w} \mathrel{\stackrel{\Delta}{=}}\hbox{\sc Flip}(w)) \\
&=& (x_k \ast {\tilde w})(mR)
\end{eqnarray*}

Let us define the downsampled time index as $ \tilde{m} \mathrel{\stackrel{\Delta}{=}}mR$ so that

$\displaystyle X_{\tilde{m}}(\omega_k) = \sum_{n=-\infty}^\infty [x(n)e^{-j\omega_kn}]\tilde{w}(\tilde{m}-n) \mathrel{\stackrel{\Delta}{=}}\left(x_k \ast {\tilde w}\right)(\tilde{m})$ (10.25)

i.e., $ X_{\tilde{m}}$ is simply $ X_m$ evaluated at every $ R^{th}$ sample, as shown in Fig.9.17.

\begin{psfrags}
% latex2html id marker 25320\psfrag{w}{{\Large $\protect\hbox{\sc Flip}(w)$\ }}\psfrag{x(n)}{\Large $x(n)$\ }\psfrag{Xm}{\Large $X_m$\ }\psfrag{Xmt}{\Large $X_{\tilde{m}}$\ }\psfrag{X0}{\Large $X_{\tilde{m}}(\omega_0)$\ }\psfrag{X1}{\Large $X_{\tilde{m}}(\omega_1)$\ }\psfrag{XNm1}{\Large $X_{\tilde{m}}(\omega_{N-1})$\ }\psfrag{ejw0}{\Large $e^{-j\omega_0n}$\ }\psfrag{ejw1}{\Large $e^{-j\omega_1n}$\ }\psfrag{ejwNm1}{\Large $e^{-j\omega_{N-1}n}$\ }\psfrag{dR}{\Large $\downarrow R$\ }\begin{figure}[htbp]
\includegraphics[width=\twidth]{eps/fbs2}
\caption{Downsampled STFT filter bank.}
\end{figure}
\end{psfrags}

Note that this can be considered an implementation of a phase vocoder filter bank [212]. (See §G.5 for an introduction to the vocoder.)

Filter Bank Reconstruction

\begin{psfrags}
% latex2html id marker 25351\psfrag{w}{{\Large $f$\ }} % should fix source (.draw file)\begin{figure}[htbp]
\includegraphics[width=\twidth]{eps/FBSreconstruct}
\caption{Interpolated, remodulated, filter-bank sum.}
\end{figure}
\end{psfrags}

Since the channel signals are downsampled, we generally need interpolation in the reconstruction. Figure 9.18 indicates how we might pursue this. From studying the overlap-add framework, we know that the inverse STFT is exact when the window $ w(n)$ is $ \hbox{\sc Cola}(R)$ , that is, when $ \hbox{\sc Alias}_R(w)$ is constant. In only these cases can the STFT be considered a perfect reconstruction filter bank. From the Poisson Summation Formula in §8.3.1, we know that a condition equivalent to the COLA condition is that the window transform $ W(\omega)$ have notches at all harmonics of the frame rate, i.e., $ W(2\pi k/R)=0$ for $ k=1,2,3,R-1$ . In the present context (filter-bank point of view), perfect reconstruction appears impossible for $ R>1$ , because for ideal reconstruction after downsampling, the channel anti-aliasing filter ($ w$ ) and interpolation filter ($ f$ ) have to be ideal lowpass filters. This is a true conclusion in any single channel, but not for the filter bank as a whole. We know, for example, from the overlap-add interpretation of the STFT that perfect reconstruction occurs for hop-sizes greater than 1 as long as the COLA condition is met. This is an interesting paradox to which we will return shortly.

What we would expect in the filter-bank context is that the reconstruction can be made arbitrarily accurate given better and better lowpass filters $ w$ and $ f$ which cut off at $ \omega_c = \pi/R$ (the folding frequency associated with down-sampling by $ R$ ). This is the right way to think about the STFT when spectral modifications are involved.

In Chapter 11 we will develop the general topic of perfect reconstruction filter banks, and derive various STFT processors as special cases.


Downsampling with Anti-Aliasing

Figure 9.19: Processing in one filter-bank analysis channel.
\includegraphics[width=3in]{eps/FBSonechan}

In OLA, the hop size $ R$ is governed by the COLA constraint

$\displaystyle \sum_{m=-\infty}^\infty w(n+mR) = \hbox{constant}$ (10.26)

In FBS, $ R$ is the downsampling factor in each of the filter-bank channels, and thus the window $ w$ serves as the anti-aliasing filter (see Fig.9.19). We see that to avoid aliasing, $ W(\omega)$ must be bandlimited to $ (-\pi/R, \pi/R)$ , as illustrated schematically in Fig.9.20.

Figure 9.20: Schematic illustration of a window transform that suppresses all aliasing.
\includegraphics[width=3in]{eps/IdealWindowFD}

Properly Anti-Aliasing Window Transforms

For simplicity, define window-transform bandlimits at first zero-crossings about the main lobe. Given the first zero of $ W(\omega)$ at $ L \frac{2\pi}{M} \leq \frac{\pi}{R}$ , we obtain

$\displaystyle \zbox {R_{\hbox{max}}= \frac{M}{2L}}$ (10.27)

The following table gives maximum hop sizes for various window types in the Blackman-Harris family, where $ L$ is both the number of constant-plus-cosine terms in the window definition (§3.3) and the half-main-lobe width in units of side-lobe widths $ 2\pi/M$ . Also shown in the table is the maximum COLA hop size we determined in Chapter 8.
L Window Type (Length $ M$ ) $ R_{\hbox{max}}$ $ R_{\hbox{\hbox{\sc Cola}}}$
1 Rectangular M/2 M
2 Generalized Hamming M/4 M/2
3 Blackman Family M/6 M/3
L $ L$ -term Blackman-Harris M/2L M/L
In the table, any $ R\leq R_{\hbox{max}}$ suppresses aliasing well.

It is interesting to note that the maximum COLA hop size is double the maximum downsampling factor which avoids aliasing of the main lobe of the window transform in FFT-bin signals $ X_{\tilde
m}(\omega_k)$ . Since the COLA constraint is a sufficient condition for perfect reconstruction, this aliasing is quite heavy (see Fig.9.21), yet it is all canceled in the reconstruction. The general theory of aliasing cancellation in perfect reconstruction filter banks will be taken up in Chapter 11.

Figure 9.21: Illustration of main-lobe aliasing intervals.
\includegraphics[width=3in]{eps/WindowAliasingFD}

It is important to realize that aliasing cancellation is disturbed by FBS spectral modifications.10.4For robustness in the presence of spectral modifications, it is advisable to keep $ R\leq R_{\hbox{max}}= M/(2L)$ . For compression, it is common to use $ R = 2 R_{\hbox{max}}= R_{\hbox{\hbox{\sc Cola}}} = M/L$ together with a ``synthesis window'' in a weighted overlap-add (WOLA) scheme (§8.6).


Hop Sizes for WOLA

In the weighted overlap-add method, with the synthesis (output) window equal to the analysis (input) window, we have the following modification of the recommended maximum hop-size table:

L In and Out Window (Length $ M$ ) $ R_{\hbox{max}}$ $ R_{\hbox{\hbox{\sc Cola}}}$
1 Rectangular ($ L=1$ ) M/2 M
2 Generalized Hamming ($ L=2$ ) M/6 M/3
3 Blackman Family ($ L=3$ ) M/10 M/5
L $ L$ -term Blackman-Harris M/(4L-2) M/(2L-1)
Note that the following properties hold as before in the OLA case:
  • $ R_{\hbox{max}}$ is equal to $ 2\pi$ divided by the main-lobe width in ``side lobes'', while

  • $ R_{\hbox{\hbox{\sc Cola}}}$ is $ 2\pi$ divided by the first notch frequency in the window transform (lowest available frame rate at which all frame-rate harmonics are notched).

  • For windows in the Blackman-Harris families, and with main-lobe widths defined from zero-crossing to zero-crossing, $ R_{\hbox{\hbox{\sc Cola}}} = 2 R_{\hbox{max}}$ .


Constant-Overlap-Add (COLA) Cases

  • Weak COLA: Window transform has zeros at frame-rate harmonics:

    $\displaystyle W(\omega_k) = 0, \quad k = 1,2, \dots, R-1,
\quad \omega_k \isdef \frac{2\pi k}{R} $

  • Strong COLA: Window transform is bandlimited consistent with downsampling by the frame rate:

    $\displaystyle W(\omega) = 0, \quad \vert\omega\vert \geq \pi/R $

    • Perfect OLA reconstruction
    • No aliasing
    • better for spectral modifications
    • Time-domain window infinitely long in ideal case

Hamming Overlap-Add Example

Matlab code:

M = 33;         % window length
w = hamming(M);
R = (M-1)/2;    % maximum hop size
w(M) = 0;       % 'periodic Hamming' (for COLA)
%w(M) = w(M)/2; % another solution,
%w(1) = w(1)/2; %  interesting to compare

Figure 9.22: Periodic-Hamming OLA waveforms.
\includegraphics[width=\textwidth ]{eps/olaHammingC}


Periodic-Hamming OLA from Poisson Summation Formula

Matlab code:

ff = 1/R; % frame rate (fs=1)
N = 6*M;  % no. samples to look at OLA
sp = ones(N,1)*sum(w)/R; % dc term (COLA term)
ubound = sp(1);  % try easy-to-compute upper bound
lbound = ubound; % and lower bound
n = (0:N-1)';
for (k=1:R-1) % traverse frame-rate harmonics
  f=ff*k;
  csin = exp(j*2*pi*f*n); % frame-rate harmonic
  % find exact window transform at frequency f
  Wf = w' * conj(csin(1:M));
  hum = Wf*csin;   % contribution to OLA "hum"
  sp = sp + hum/R; % "Poisson summation" into OLA
  % Update lower and upper bounds:
  Wfb = abs(Wf);
  ubound = ubound + Wfb/R; % build upper bound
  lbound = lbound - Wfb/R; % build lower bound
end

In this example, the overlap-add is theoretically a perfect constant (equal to $ 1.08$ ) because the frame rate and all its harmonics coincide with nulls in the window transform (see Fig.9.24). A plot of the steady-state overlap-add and that computed using the Poisson Summation Formula (not shown) is constant to within numerical precision. The difference between the actual overlap-add and that computed using the PSF is shown in Fig.9.23. We verify that the difference is on the order of $ 10^{-15}$ , which is close enough to zero in double-precision (64-bit) floating-point computations. We thus verify that the overlap-add of a length $ 33$ Hamming window using a hop size of $ R = (M-1)/2 = 16$ samples is constant to within machine precision.

Figure 9.23: Periodic-Hamming Poisson summation formula test.
\includegraphics[width=\textwidth ]{eps/olassmmpHammingC}

Figure 9.24 shows the zero-padded DFT of the modified Hamming window we're using ( $ w(M)\leftarrow 0$ ) with the frame-rate harmonics marked. In this example ($ R=M/2$ ), the upper half of the main lobe aliases into the lower half of the main lobe. (In fact, all energy above the folding frequency $ \pi/R$ aliases into the lower half of the main lobe.) While this window and hop size still give perfect reconstruction under the STFT, spectral modifications will disturb the aliasing cancellation during reconstruction. This ``undersampled'' configuration is suitable as a basis for compression applications.

Figure 9.24: Hamming window transform and frame-rate.
\includegraphics[width=\textwidth ]{eps/windowTransformHammingC}

Note that if we were to cut $ R$ in half to $ R=M/4$ , then the folding frequency in Fig.9.24 would coincide with the first null in the window transform. Since the frame rate and all its harmonics continue to land on nulls in the window transform, overlap-add is still exact. At this reduced hop size, however, the STFT becomes much more robust to spectral modifications, because all aliasing in the effective downsampled filter bank is now weighted by the side lobes of the window transform, with no aliasing components coming from within the main lobe. This is the central result of [9].


Kaiser Overlap-Add Example

Matlab code:

M = 33;    % Window length
beta = 8;
w = kaiser(M,beta);
R = floor(1.7*(M-1)/(beta+1)); % ROUGH estimate (gives R=6)

Figure 9.25 plots the overlap-added Kaiser windows, and Fig.9.26 shows the steady-state overlap-add (a time segment sometime after the first 30 samples). The ``predicted'' OLA is computed using the Poisson Summation Formula using the same matlab code as before. Note that the Poisson summation formula gives exact results to within numerical precision. The upper (lower) bound was computed by summing (subtracting) the window-transform magnitudes at all frame-rate harmonics to (from) the dc gain of the window. This is one example of how the PSF can be used to estimate upper and lower bounds on OLA error.

Figure 9.25: Kaiser OLA waveforms.
\includegraphics[width=\textwidth ]{eps/olakaiserC}

Figure 9.26: Kaiser OLA, steady state.
\includegraphics[width=\textwidth ]{eps/olasskaiserC}

The difference between measured steady-state overlap-add and that computed using the Poisson summation formula is shown in Fig.9.27. Again the two methods agree to within numerical precision.

Figure 9.27: Kaiser OLA from Poisson summation formula minus computed OLA.
\includegraphics[width=\textwidth ]{eps/olassmmpkaiserC}

Finally, Fig.9.28 shows the Kaiser window transform, with marks indicating the folding frequency at the chosen hop size $ R$ , as well as the frame-rate and twice the frame rate. We see that the frame rate (hop size) has been well chosen for this window, as the folding frequency lies very close to what would be called the ``stop band'' of the Kaiser window transform. The ``stop-band rejection'' can be seen to be approximately $ 58$ dB (height of highest side lobe in Fig.9.28). We conclude that this example--a length 33 Kaiser window with $ \beta=8$ and hop-size $ R=6$ -- represents a reasonably high-quality audio STFT that will be robust in the presence of spectral modifications. We expect such robustness whenever the folding frequency lies above the main lobe of the window transform.

Figure 9.28: Kaiser window transform and frame-rate.
\includegraphics[width=\textwidth ]{eps/windowTransformkaiserC}

Remember that, for robustness in the presence of spectral modifications, the frame rate should be more than twice the highest main-lobe frequency.


STFT with Modifications


FBS Fixed Modifications

Consider applying a fixed (time-invariant) filter $ H(\omega_k)$ to each $ X_m(\omega_k)$ before resynthesizing the signal:

$\displaystyle Y_m(\omega_k) = X_m(\omega_k)H(\omega_k)$ (10.28)

where, $ H(\omega_k)$ is the sampled frequency response of a filter with impulse response

$\displaystyle h(n) = \frac{1}{N} \sum_{k=0}^{N-1} H(\omega_k) e^{j\omega_kn}, \quad n=0,\ldots,N-1$ (10.29)

Let's examine the result this has on the signal in the time domain:

\begin{eqnarray*}
y(m) &=& \frac{1}{N} \sum_{k=0}^{N-1} Y_m(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{k=0}^{N-1} X_m(\omega_k)H(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{k=0}^{N-1} \left\{ \sum_{n=-\infty}^\infty x(n)w(n-m)e^{-j\omega_kn} \right\} H(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{n=-\infty}^\infty x(n)w(n-m) \sum_{k=0}^{N-1} H(\omega_k) e^{j\omega_k(m-n)} \\
&=& \sum_{n=-\infty}^\infty x(n) [ w(n-m) h(m-n)] \\
&=& \sum_{n=-\infty}^\infty x(n) [\tilde{w}(m-n)h(m-n)] \\
&=& (x*[\tilde{w} \cdot h])(m) \\
\end{eqnarray*}

We see that the result is $ x$ convolved with a windowed version of the impulse response $ h$ . This is in contrast to the OLA technique where the result gave us a windowed $ x$ filtered by $ h$ without the window having any effect on the filter, provided it obeys the COLA constraint and sufficient zero padding is used to avoid time aliasing.

In other words, FBS gives

$\displaystyle y = x * [\tilde{w} \cdot h] \;\longleftrightarrow\;X \cdot [{\tilde W}\ast H]$ (10.30)

while OLA gives (for $ R=1$ )

$\displaystyle y = x * [W(0)\cdot h] \;\longleftrightarrow\;X \cdot [W(0)\cdot H]$ (10.31)

  • In FBS, the analysis window $ w$ smooths the filter frequency response by time-limiting the corresponding impulse response.

  • In OLA, the analysis window can only affect scaling.

For these reasons, FFT implementations of FIR filters normally use the Overlap-Add method.


Time Varying Modifications in FBS

Consider now applying a time varying modification.

$\displaystyle Y_m(\omega_k) = X_m(\omega_k)H_m(\omega_k) \qquad \hbox{($R=1$)}$ (10.32)

where

$\displaystyle H_m(\omega_k) \;\longleftrightarrow\;h_m(n) = \frac{1}{N} \sum_{k=0}^{N-1} H_m(\omega_k) e^{j\omega_kn}$ (10.33)

$ h_m(n)$ refers to the $ n^{th}$ tap of the FIR filter at time $ m$ .

\begin{eqnarray*}
y(m) &=& \frac{1}{N} \sum_{k=0}^{N-1} Y_m(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{k=0}^{N-1} X_m(\omega_k)H_m(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{k=0}^{N-1} \left\{ \sum_{n=-\infty}^\infty x(n)w(n-m)e^{-j\omega_kn} \right\} H_m(\omega_k) e^{j\omega_k m} \\
&=& \frac{1}{N} \sum_{n=-\infty}^\infty x(n)w(n-m) \sum_{k=0}^{N-1} H_m(\omega_k) e^{j\omega_k(m-n)} \\
&=& \sum_{n=-\infty}^\infty x(n) [ w(n-m) h_m(m-n)] \\
&=& \sum_{n=-\infty}^\infty x(n) [\tilde{w}(m-n)h_m(m-n)] \\
&=& (x*[\tilde{w} \cdot h_m])(m) \\
\end{eqnarray*}

Hence, the result is the convolution of $ x$ with the windowed $ h_m$ .

Points to Note

  • We saw that in OLA with time varying modifications and $ R=1$ (a ``sliding'' DFT), the window served as a lowpass filter on each individual tap of the FIR filter being implemented.

  • In the more typical case in which $ R$ is the window length $ M$ divided by a small integer like $ 2$ -$ 10$ , we may think of the window as specifying a type of cross-fade from the LTI filter for one frame to the LTI filter for the next frame.

  • Using a Bartlett (triangular) window with $ 50$ % overlap, ($ R=2$ ), the sequence of FIR filters used is obtained simply by linearly interpolating the LTI filter for one frame to the LTI filter for the next.

  • In FBS, there is no limitation on how fast the filter $ h_m$ may vary with time, but its length is limited to that of the window $ w$ .

  • In OLA, there is no limit on length (just add more zero-padding), but the filter taps are band-limited to the spectral width of the window.

  • FBS filters are time-limited by $ w$ , while OLA filters are band-limited by $ w$ (another dual relation).

  • Recall for comparison that each frame in the OLA method is filtered according to

    $\displaystyle Y_m = X_m \cdot H_m = [X*W_m] \cdot H_m \;\longleftrightarrow\; \underbrace{[x \cdot w_m]}_{x_m} * h_m$ (10.34)

    where $ w_m$ denotes $ \hbox{\sc Shift}_{mR}(w)$ .
  • Time-varying FBS filters are instantly in ``steady state''
  • FBS filters must be changed very slowly to avoid clicks and pops (discontinuity distortion is likely when the filter changes)
For more details, see [9].


STFT Summary and Conclusions

The Short-Time Fourier Transform (STFT) may be viewed either as an OverLap-Add (OLA) processor, or as a Filter-Bank Sum (FBS). We derived two conditions for perfect reconstruction which are Fourier duals of each other:

  1. For OLA, the window must overlap-add to a constant in the time domain. By the Poisson summation formula, this is equivalent to having window transform nulls at all nonzero multiples of the frame rate $ 2\pi/R$ .

  2. For FBS, the window transform must overlap-add to a constant in the frequency domain, and this is equivalent to having window nulls in the time domain at all nonzero multiples of the transform size $ N$ .

In general, STFT filter banks are oversampled except when using the rectangular window of length $ M=N$ and a hop size $ R=N$ . Critical sampling is desirable for compression systems, but this can be problematic when spectral modifications are contemplated (adjacent-channel aliasing no longer canceled).

STFT filter banks are uniform filter banks, as opposed ``constant Q''. In some audio applications, it is preferable to use non-uniform filter banks which approximate the auditory filter bank. Approximate constant-Q filter banks are easily synthesized from STFT filter banks by summing adjacent frequency channels, as detailed in §10.7 below. Additional pointers can be found in Appendix E. We will look at a particular octave filter bank when we talk about wavelet filter banks (§11.9).


Next Section:
Applications of the STFT
Previous Section:
Overlap-Add (OLA) STFT Processing