A Note on Hop Size
Before Step 2 above, the FFT hop size within the MRSTFT of Step 1 would typically be determined by the shortest window length used (and its type). However, after the non-uniform downsampling in Step 2, the effective window lengths (and shapes) have been modified. If the spectrum is not undersampled by this operation, the effective duration of the time-domain window at each frequency will always be shorter than that of the original FFT window. In principle, the shape of the effective time-domain window becomes the product of the original FFT window used in the MRSTFT times the ``auditory window,'' which is given by the inverse Fourier transform of the auditory filter frequency response (spectral interpolation kernel) translated to zero center-frequency. (This is only approximately true when the auditory filter frequency response spans multiple frequency ranges for which FFTs were performed at different resolutions.)
Since the time-domain window durations are shortened by the spectral
smoothing inherent in Step 2, the proper step size from frame to frame
is something less than that dictated by the MRSTFT windows. One
reliable method for determining the maximum allowable hop size for
each FFT in the MRSTFT is to study the inverse Fourier transform of
the widest (highest-frequency) auditory filter shape (translated to 0
Hz center-frequency) used as a smoothing kernel in that FFT. This new
window can be multiplied by the original window and overlapped and
added to itself, as in Eq.
(7.2), at various increasing
hop-sizes
(starting with
which is always valid), until the
overlap-add begins to show ripple at the frame rate
.
Alternatively, the bandwidth of the highest-frequency auditory filter
can be used to determine the appropriate hop size in the time domain,
as elaborated in Chapter 9 (especially §9.8.1).
Next Section:
Multiresolution STFT
Previous Section:
Notes