The Short-Time Fourier Transform
The Short-Time Fourier Transform (STFT) (or short-term Fourier transform) is a powerful general-purpose tool for audio signal processing [7,9,8]. It defines a particularly useful class of time-frequency distributions [43] which specify complex amplitude versus time and frequency for any signal. We are primarily concerned here with tuning the STFT parameters for the following applications:- Approximating the time-frequency analysis performed by the ear for purposes of spectral display.
- Measuring model parameters in a short-time spectrum.
Examples of the second case include estimating the decay-time-versus-frequency for vibrating strings [288] and body resonances [119], or measuring as precisely as possible the fundamental frequency of a periodic signal [106] based on tracking its many harmonics in the STFT [64]. An interesting example for which cases 1 and 2 normally coincide is pitch detection (case 1) and fundamental frequency estimation (case 2). Here, ``fundamental frequency'' is defined as the lowest frequency present in a series of harmonic overtones, while ``pitch'' is defined as the perceived fundamental frequency; perceived pitch can be measured, for example, by comparing to a harmonic reference tone such as a sawtooth waveform. (Thus, by definition, the pitch of a sawtooth waveform is its fundamental frequency.) When harmonics are stretched so that they become slightly inharmonic, pitch perception corresponds to a (possibly non-existent) compromise fundamental frequency, the harmonics of which ``best fit'' the most audible overtones in some sense. The topic of ``pitch detection'' in the signal processing literature is often really about fundamental frequency estimation [106], and this distinction is lost. This is not a problem for strictly periodic signals.
Mathematical Definition of the STFT
The usual mathematical definition of the STFT is [9]where



then the sum of the successive DTFTs over time equals the DTFT of the whole signal













Practical Computation of the STFT
While the definition of the STFT in (7.1) is useful for theoretical work, it is not really a specification of a practical STFT. In practice, the STFT is computed as a succession of FFTs of windowed data frames, where the window ``slides'' or ``hops'' forward through time. We now derive such an implementation of the STFT from its mathematical definition. The STFT in (7.1) can be rewritten, adding

In this form, the data centered about time








Since indexing in the DFT is modulo



![[*]](../icons/crossref.png)
Summary of STFT Computation Using FFTs
- Read
samples of the input signal
into a local buffer of length
which is initially zeroed
the
th frame of the input signal, and
the
th time normalized input frame (time-normalized by translating it to time zero). The frame length is
, which we assume to be odd for reasons to be discussed later. The time advance
(in samples) from one frame to the next is called the hop size or step size.
- Multiply the data frame pointwise by a length
spectrum analysis window
to obtain the
th windowed data frame (time normalized):
- Extend
with zeros on both sides to obtain a zero-padded frame:
(8.5)
whereis chosen to be a power of two larger than
. The number
is the zero-padding factor. As discussed in §2.5.3, the zero-padding factor is the interpolation factor for the spectrum, i.e., each FFT bin is replaced by
bins, interpolating the spectrum using ideal bandlimited interpolation [264], where the ``band'' in this case is the
-sample nonzero duration of
in the time domain.
- Take a length
FFT of
to obtain the time-normalized, frequency-sampled STFT at time
:
(8.6)
where, and
is the sampling rate in Hz. As in any FFT, we call
the bin number.
- If needed, time normalization may be removed using a
linear phase term to yield the sampled STFT:
(8.7)
The (continuous-frequency) STFT may be approached arbitrarily closely by using more zero padding and/or other interpolation methods. Note that there is no irreversible time-aliasing when the STFT frequency axisis sampled to the points
, provided the FFT size
is greater than or equal to the window length
.
Two Dual Interpretations of the STFT
The STFT
















The STFT as a Time-Frequency Distribution
The Short Time Fourier Transform (STFT)



![]() |
STFT in Matlab
The following matlab segment illustrates the above processing steps:Xtwz = zeros(N,nframes); % pre-allocate STFT output array M = length(w); % M = window length, N = FFT length zp = zeros(N-M,1); % zero padding (to be inserted) xoff = 0; % current offset in input signal x Mo2 = (M-1)/2; % Assume M odd for simplicity here for m=1:nframes xt = x(xoff+1:xoff+M); % extract frame of input data xtw = w .* xt; % apply window to current frame xtwz = [xtw(Mo2+1:M); zp; xtw(1:Mo2)]; % windowed, zero padded Xtwz(:,m) = fft(xtwz); % STFT for frame m xoff = xoff + R; % advance in-pointer by hop-size R end
Notes
- The window w is implemented in zero-centered (``zero-phase'') form (see, e.g., §2.5.4 for discussion).
- The signal x should have at least Mo2 leading zeros for this (simplified) implementation.
- See §F.3 for a more detailed implementation.
Next Section:
Classic Spectrograms
Previous Section:
The Panning Problem