The Short-Time Fourier TransformThe Short-Time Fourier Transform (STFT) (or short-term Fourier transform) is a powerful general-purpose tool for audio signal processing [7,9,8]. It defines a particularly useful class of time-frequency distributions  which specify complex amplitude versus time and frequency for any signal. We are primarily concerned here with tuning the STFT parameters for the following applications:
- Approximating the time-frequency analysis performed by the ear for purposes of spectral display.
- Measuring model parameters in a short-time spectrum.
Examples of the second case include estimating the decay-time-versus-frequency for vibrating strings  and body resonances , or measuring as precisely as possible the fundamental frequency of a periodic signal  based on tracking its many harmonics in the STFT . An interesting example for which cases 1 and 2 normally coincide is pitch detection (case 1) and fundamental frequency estimation (case 2). Here, ``fundamental frequency'' is defined as the lowest frequency present in a series of harmonic overtones, while ``pitch'' is defined as the perceived fundamental frequency; perceived pitch can be measured, for example, by comparing to a harmonic reference tone such as a sawtooth waveform. (Thus, by definition, the pitch of a sawtooth waveform is its fundamental frequency.) When harmonics are stretched so that they become slightly inharmonic, pitch perception corresponds to a (possibly non-existent) compromise fundamental frequency, the harmonics of which ``best fit'' the most audible overtones in some sense. The topic of ``pitch detection'' in the signal processing literature is often really about fundamental frequency estimation , and this distinction is lost. This is not a problem for strictly periodic signals. STFT is 
then the sum of the successive DTFTs over time equals the DTFT of the whole signal :
Practical Computation of the STFTWhile the definition of the STFT in (7.1) is useful for theoretical work, it is not really a specification of a practical STFT. In practice, the STFT is computed as a succession of FFTs of windowed data frames, where the window ``slides'' or ``hops'' forward through time. We now derive such an implementation of the STFT from its mathematical definition. The STFT in (7.1) can be rewritten, adding to , as
In this form, the data centered about time are translated to time 0, multiplied by the (let's assume zero-phase) window , and then the DTFT is performed. Since the nonzero portion of the windowed data is centered on time zero, the DTFT can be replaced by the DFT (or FFT). This effectively samples the DTFT in frequency. This sampling will not cause (time) aliasing if the number of samples around the unit circle is greater than the width (in samples) of the time interval including all nonzero datapoints. In other words, sampling the frequency axis is information-preserving when the signal is properly time limited.8.3Let denote the window length (typically an odd number) and be the DFT length (typically a power of 2). Then sampling (7.3) at , , and using the fact that the window is time-limited to less than samples centered about time zero, yields
Since indexing in the DFT is modulo , the sum over can be ``rotated'' to a sum from 0 to as is conventionally implemented for the DFT. In practice, this means that the right half of the windowed data frame goes at the beginning of the FFT input buffer, and the left half of the windowed frame goes at the end, with zero-padding in the middle (see Fig.2.6b on page for an illustration).
samples of the input signal
into a local buffer of
which is initially zeroed
- Multiply the data frame pointwise by a length
to obtain the
windowed data frame (time normalized):
with zeros on both sides to obtain a
where is chosen to be a power of two larger than . The number is the zero-padding factor. As discussed in §2.5.3, the zero-padding factor is the interpolation factor for the spectrum, i.e., each FFT bin is replaced by bins, interpolating the spectrum using ideal bandlimited interpolation , where the ``band'' in this case is the -sample nonzero duration of in the time domain.
- Take a length
to obtain the time-normalized,
frequency-sampled STFT at time
where , and is the sampling rate in Hz. As in any FFT, we call the bin number.
- If needed, time normalization may be removed using a
linear phase term to yield the sampled STFT:
The (continuous-frequency) STFT may be approached arbitrarily closely by using more zero padding and/or other interpolation methods. Note that there is no irreversible time-aliasing when the STFT frequency axis is sampled to the points , provided the FFT size is greater than or equal to the window length .
STFT can be viewed as a function of either frame-time or bin-frequency . We will develop both points of view in this book. At each frame time , the STFT can be regarded as producing a Fourier transform centered around that time. As advances, a sequence of spectral transforms is obtained. This is depicted graphically in Fig.9.1, and it forms the basis of the overlap-add method for Fourier analysis, modification, and resynthesis . It is also the basis for transform coders [16,284]. In an exact Fourier duality, each bin of the STFT can be regarded as a sample of the complex signal at the output of a lowpass filter whose input is . As discussed in §9.1.2, this signal is obtained from by frequency-shifting it so that frequency is translated down to 0 Hz. For each value of , the time-domain signal , for , is the output of the th ``filter bank channel,'' for . In this ``filter bank'' interpretation, the hop size can be interpreted as the downsampling factor applied to each bin-filter output, and the analysis window is seen as the impulse response of the anti-aliasing filter used prior to downsampling. The window transform is also the frequency response of each channel filter (translated to dc). This point of view is depicted graphically in Fig.9.2 and elaborated further in Chapter 9.
Short Time Fourier Transform (STFT) is a function of both time (frame number ) and frequency ( ). It is therefore an example of a time-frequency distribution. Others include 7.1. The window length is proportional to the resolution cell in time, indicated by the vertical lines in Fig.7.1. The width of the main-lobe of the window-transform is proportional to the resolution cell in frequency, indicated by the horizontal lines in Fig.7.1. As detailed in Chapter 3, choosing a window length and window type (Hamming, Blackman, etc.) chooses the ``aspect ratio'' and total area of the time-frequency resolution cells (rectangles in Fig.7.1). For an example of a non-uniform time-frequency tiling, see Fig.10.14.
matlab segment illustrates the above processing steps:
Xtwz = zeros(N,nframes); % pre-allocate STFT output array M = length(w); % M = window length, N = FFT length zp = zeros(N-M,1); % zero padding (to be inserted) xoff = 0; % current offset in input signal x Mo2 = (M-1)/2; % Assume M odd for simplicity here for m=1:nframes xt = x(xoff+1:xoff+M); % extract frame of input data xtw = w .* xt; % apply window to current frame xtwz = [xtw(Mo2+1:M); zp; xtw(1:Mo2)]; % windowed, zero padded Xtwz(:,m) = fft(xtwz); % STFT for frame m xoff = xoff + R; % advance in-pointer by hop-size R end
- The window w is implemented in zero-centered (``zero-phase'') form (see, e.g., §2.5.4 for discussion).
- The signal x should have at least Mo2 leading zeros for this (simplified) implementation.
- See §F.3 for a more detailed implementation.
The Panning Problem