The Short-Time Fourier Transform

The Short-Time Fourier Transform (STFT) (or short-term Fourier transform) is a powerful general-purpose tool for audio signal processing [7,9,8]. It defines a particularly useful class of time-frequency distributions [43] which specify complex amplitude versus time and frequency for any signal. We are primarily concerned here with tuning the STFT parameters for the following applications:

  1. Approximating the time-frequency analysis performed by the ear for purposes of spectral display.
  2. Measuring model parameters in a short-time spectrum.
In the first case, applications of audio spectral display go beyond merely looking at the spectrum. They also provide a basis for audio signal processing tasks intended to imitate human perception, such as auditory scene recognition [26,209] or automatic transcription of music [125].

Examples of the second case include estimating the decay-time-versus-frequency for vibrating strings [288] and body resonances [119], or measuring as precisely as possible the fundamental frequency of a periodic signal [106] based on tracking its many harmonics in the STFT [64].

An interesting example for which cases 1 and 2 normally coincide is pitch detection (case 1) and fundamental frequency estimation (case 2). Here, ``fundamental frequency'' is defined as the lowest frequency present in a series of harmonic overtones, while ``pitch'' is defined as the perceived fundamental frequency; perceived pitch can be measured, for example, by comparing to a harmonic reference tone such as a sawtooth waveform. (Thus, by definition, the pitch of a sawtooth waveform is its fundamental frequency.) When harmonics are stretched so that they become slightly inharmonic, pitch perception corresponds to a (possibly non-existent) compromise fundamental frequency, the harmonics of which ``best fit'' the most audible overtones in some sense. The topic of ``pitch detection'' in the signal processing literature is often really about fundamental frequency estimation [106], and this distinction is lost. This is not a problem for strictly periodic signals.

Mathematical Definition of the STFT

The usual mathematical definition of the STFT is [9]

$\displaystyle X_m(\omega)$ $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^{\infty} x(n) w(n-mR) e^{-j\omega n}$  
  $\displaystyle =$ $\displaystyle \hbox{\sc DTFT}_\omega(x\cdot\hbox{\sc Shift}_{mR}(w)),
\protect$ (8.1)

where

\begin{eqnarray*}
x(n) &=& \hbox{input signal at time $n$}\\
w(n) &=& \hbox{length $M$\ window function (\textit{e.g.}, Hamming)}\\
X_m(\omega) &=& \hbox{DTFT of windowed data centered about time $mR$}\\
R &=& \hbox{hop size, in samples, between successive DTFTs.}\\
\end{eqnarray*}

If the window $ w(n)$ has the Constant OverLap-Add (COLA) property at hop-size $ R$ , i.e., if

$\displaystyle \zbox {\sum_{m=-\infty}^{\infty} w(n-mR) = 1, \; \forall n\in{\bf Z}} \quad\mbox{($w\in\hbox{\sc Cola}(R)$)} \protect$ (8.2)

then the sum of the successive DTFTs over time equals the DTFT of the whole signal $ X(\omega)$ :

\begin{eqnarray*}
\sum_{m=-\infty}^\infty X_m(\omega)
&\isdef &
\sum_{m=-\infty}^\infty\sum_{n=-\infty}^{\infty} x(n) w(n-mR) e^{-j\omega n}\\
&=& \sum_{n=-\infty}^{\infty} x(n) e^{-j\omega n}
\underbrace{\sum_{m=-\infty}^\infty w(n-mR)}_{1\hbox{ if }w\in\hbox{\sc Cola}(R)}
\\
&=& \sum_{n=-\infty}^{\infty} x(n) e^{-j\omega n} \\
&\isdef & \hbox{\sc DTFT}_\omega(x) = X(\omega).
\end{eqnarray*}

We will say that windows satisfying $ \sum_m w(n-mR) = 1$ (or some constant) for all $ n\in{\bf Z}$ are said to be $ \hbox{\sc Cola}(R)$ . For example, the length $ M$ rectangular window is clearly $ \hbox{\sc Cola}(M)$ (no overlap). The Bartlett window and all windows in the generalized Hamming family (Chapter 3) are $ \hbox{\sc Cola}(M/2)$ (50% overlap), when the endpoints are handled correctly.8.1 A $ \hbox{\sc Cola}(M/2)$ example is depicted in Fig.8.9. Any window that is $ \hbox{\sc Cola}(R)$ is also $ \hbox{\sc Cola}(R/k)$ , for $ k=2,3,4,\ldots,R$ , provided $ R/k$ is an integer.8.2 We will explore COLA windows more completely in Chapter 8.

When using the short-time Fourier transform for signal processing, as taken up in Chapter 8, the COLA requirement is important for avoiding artifacts. For usage as a spectrum analyzer for measurement and display, the COLA requirement can often be relaxed, as doing so only means we are not weighting all information equally in our analysis. Nothing disastrous happens, for example, if we use 50% overlap with the Blackman window in a short-time spectrum analysis over time--the results look fine; however, in such a case, data falling near the edges of the window will have a slightly muted impact on the results relative to data falling near the window center, because the Blackman window is not COLA at 50% overlap.


Practical Computation of the STFT

While the definition of the STFT in (7.1) is useful for theoretical work, it is not really a specification of a practical STFT. In practice, the STFT is computed as a succession of FFTs of windowed data frames, where the window ``slides'' or ``hops'' forward through time. We now derive such an implementation of the STFT from its mathematical definition.

The STFT in (7.1) can be rewritten, adding $ mR$ to $ n$ , as

$\displaystyle X_m(\omega)$ $\displaystyle =$ $\displaystyle \sum_{n=-\infty}^{\infty} x(n+mR) w(n) e^{-j\omega (n+mR)}$  
  $\displaystyle =$ $\displaystyle e^{-j\omega mR} \sum_{n=-\infty}^{\infty} x(n+mR) w(n) e^{-j\omega n}$  
  $\displaystyle =$ $\displaystyle e^{-j\omega mR}\hbox{\sc DTFT}_\omega(\hbox{\sc Shift}_{-mR}(x) \cdot w).
\protect$ (8.3)

In this form, the data centered about time $ mR$ are translated to time 0, multiplied by the (let's assume zero-phase) window $ w$ , and then the DTFT is performed. Since the nonzero portion of the windowed data is centered on time zero, the DTFT can be replaced by the DFT (or FFT). This effectively samples the DTFT in frequency. This sampling will not cause (time) aliasing if the number of samples around the unit circle is greater than the width (in samples) of the time interval including all nonzero datapoints. In other words, sampling the frequency axis is information-preserving when the signal is properly time limited.8.3Let $ M$ denote the window length (typically an odd number) and $ N\geq M$ be the DFT length (typically a power of 2). Then sampling (7.3) at $ \omega=\omega_k= 2\pi k/N$ , $ k=0,1,2,\ldots,N-1$ , and using the fact that the window $ w(n)$ is time-limited to less than $ N$ samples centered about time zero, yields
$\displaystyle X_m(\omega_k)$ $\displaystyle =$ $\displaystyle e^{-j\omega_k mR}\sum_{n=-N/2}^{N/2-1} x(n+mR) w(n) e^{-j\omega_k n}$  
  $\displaystyle =$ $\displaystyle e^{-j\omega_k mR}\hbox{\sc DFT}_{N,\omega_k}(\hbox{\sc Shift}_{-mR}(x) \cdot w).
\protect$ (8.4)

Since indexing in the DFT is modulo $ N$ , the sum over $ n$ can be ``rotated'' to a sum from 0 to $ N-1$ as is conventionally implemented for the DFT. In practice, this means that the right half of the windowed data frame goes at the beginning of the FFT input buffer, and the left half of the windowed frame goes at the end, with zero-padding in the middle (see Fig.2.6b on page [*] for an illustration).


Summary of STFT Computation Using FFTs

  1. Read $ M$ samples of the input signal $ x$ into a local buffer of length $ N\geq M$ which is initially zeroed

    $\displaystyle \tilde{x}_m(n) \isdef x(n+mR), \qquad n=-M_h,
\ldots\,,-1,0,1,\ldots\,,
M_h
$

    We call $ x_m$ the $ m$ th frame of the input signal, and $ \tilde{x}_m=x(n+mR)$ the $ m$ th time normalized input frame (time-normalized by translating it to time zero). The frame length is $ M\isdef 2M_h+1$ , which we assume to be odd for reasons to be discussed later. The time advance $ R$ (in samples) from one frame to the next is called the hop size or step size.

  2. Multiply the data frame pointwise by a length $ M$ spectrum analysis window $ w(n), n=-M_h,\ldots\,,M_h$ to obtain the $ m$ th windowed data frame (time normalized):

    $\displaystyle \tilde{x}_m^w(n) \isdef \tilde{x}_m(n) w(n), \qquad n=-M_h,\ldots\,,M_h
$

  3. Extend $ \tilde{x}_m^w$ with zeros on both sides to obtain a zero-padded frame:

    $\displaystyle \tilde{x}_m^{w,z}(n) \isdef \left\{\begin{array}{ll} \tilde{x}_m^w(n), & \left\vert n\right\vert\leq M_h\isdef {\frac{M-1}{2}} \\ [5pt] 0, & M_h< n \leq {\frac{N}{2}}-1 \\ [5pt] 0, & -{\frac{N}{2}}\leq n < -M_h \\ \end{array} \right.$ (8.5)

    where $ N$ is chosen to be a power of two larger than $ M$ . The number $ N/M$ is the zero-padding factor. As discussed in §2.5.3, the zero-padding factor is the interpolation factor for the spectrum, i.e., each FFT bin is replaced by $ N/M$ bins, interpolating the spectrum using ideal bandlimited interpolation [264], where the ``band'' in this case is the $ M$ -sample nonzero duration of $ \tilde{x}_m^w$ in the time domain.

  4. Take a length $ N$ FFT of $ \tilde{x}_m^{w,z}$ to obtain the time-normalized, frequency-sampled STFT at time $ m$ :

    $\displaystyle \tilde{X}_m^{w,z}(\omega_k )=\sum _{n=-N/2}^{N/2-1} \tilde{x}_m^{w,z}(n) e^{-j\omega_k n T}$ (8.6)

    where $ \omega_k = 2\pi k f_s / N $ , and $ f_s=1/T$ is the sampling rate in Hz. As in any FFT, we call $ k$ the bin number.

  5. If needed, time normalization may be removed using a linear phase term to yield the sampled STFT:

    $\displaystyle X^{w,z}_m(\omega_k) = e^{-j\omega_k mR}\tilde{X}_m^{w,z}(\omega_k )$ (8.7)

    The (continuous-frequency) STFT may be approached arbitrarily closely by using more zero padding and/or other interpolation methods.

    Note that there is no irreversible time-aliasing when the STFT frequency axis $ \omega$ is sampled to the points $ \omega_k$ , provided the FFT size $ N$ is greater than or equal to the window length $ M$ .


Two Dual Interpretations of the STFT

The STFT $ \tilde{X}_m^{w,z}(\omega_k )$ can be viewed as a function of either frame-time $ m$ or bin-frequency $ k$ . We will develop both points of view in this book.

At each frame time $ m$ , the STFT can be regarded as producing a Fourier transform centered around that time. As $ m$ advances, a sequence of spectral transforms is obtained. This is depicted graphically in Fig.9.1, and it forms the basis of the overlap-add method for Fourier analysis, modification, and resynthesis [9]. It is also the basis for transform coders [16,284].

In an exact Fourier duality, each bin $ \tilde{X}_m^{w,z}(\omega_k )$ of the STFT can be regarded as a sample of the complex signal at the output of a lowpass filter whose input is $ \tilde{x}_m^{w,z}(n) e^{-j\omega_k m T}$ . As discussed in §9.1.2, this signal is obtained from $ \tilde{x}_m^{w,z}(n)$ by frequency-shifting it so that frequency $ \omega_k$ is translated down to 0 Hz. For each value of $ k$ , the time-domain signal $ \tilde{X}_m^{w,z}(\omega_k )$ , for $ m=\ldots,-2,-1,0,1,2,\ldots$ , is the output of the $ k$ th ``filter bank channel,'' for $ k=0,1,\ldots,N-1$ . In this ``filter bank'' interpretation, the hop size $ R$ can be interpreted as the downsampling factor applied to each bin-filter output, and the analysis window $ w(\,\cdot\,)$ is seen as the impulse response of the anti-aliasing filter used prior to downsampling. The window transform $ W(\omega)$ is also the frequency response of each channel filter (translated to dc). This point of view is depicted graphically in Fig.9.2 and elaborated further in Chapter 9.


The STFT as a Time-Frequency Distribution

The Short Time Fourier Transform (STFT) $ X_m(\omega_k)$ is a function of both time (frame number $ m$ ) and frequency ( $ \omega_k =
2\pi k/N$ ). It is therefore an example of a time-frequency distribution. Others include

The uniform and rectangular nature of the STFT time-frequency tiling is illustrated in Fig.7.1. The window length is proportional to the resolution cell in time, indicated by the vertical lines in Fig.7.1. The width of the main-lobe of the window-transform is proportional to the resolution cell in frequency, indicated by the horizontal lines in Fig.7.1. As detailed in Chapter 3, choosing a window length $ M$ and window type (Hamming, Blackman, etc.) chooses the ``aspect ratio'' and total area of the time-frequency resolution cells (rectangles in Fig.7.1). For an example of a non-uniform time-frequency tiling, see Fig.10.14.

Figure: Example time-frequency tiling for the STFT. Vertical line spacing indicates time resolution, and horizontal line spacing indicates frequency resolution (both fixed by window length and type). The area of the rectangular cells are bounded below by the minimum time-bandwidth product (see §B.17.1 for one definition).
\includegraphics[width=0.8\twidth]{eps/timefreq}


STFT in Matlab

The following matlab segment illustrates the above processing steps:

Xtwz = zeros(N,nframes); % pre-allocate STFT output array
M = length(w);           % M = window length, N = FFT length
zp = zeros(N-M,1);       % zero padding (to be inserted)
xoff = 0;                % current offset in input signal x
Mo2 = (M-1)/2;           % Assume M odd for simplicity here
for m=1:nframes
  xt = x(xoff+1:xoff+M); % extract frame of input data
  xtw = w .* xt;         % apply window to current frame
  xtwz = [xtw(Mo2+1:M); zp; xtw(1:Mo2)]; % windowed, zero padded
  Xtwz(:,m) = fft(xtwz); % STFT for frame m
  xoff = xoff + R;       % advance in-pointer by hop-size R
end

Notes

  • The window w is implemented in zero-centered (``zero-phase'') form (see, e.g., §2.5.4 for discussion).

  • The signal x should have at least Mo2 leading zeros for this (simplified) implementation.

  • See §F.3 for a more detailed implementation.


Next Section:
Classic Spectrograms
Previous Section:
The Panning Problem