Summing STFT Bins
In the Short-Time Fourier Transform, which implements a uniform FIR filter bank (Chapter 9), each FFT bin can be regarded as one sample of the filter-bank output in one channel. It is elementary that summing adjacent filter-bank signals sums the corresponding pass-bands to create a wider pass-band. Summing adjacent FFT bins in the STFT, therefore, synthesizes one sample from a wider pass-band implemented using an FFT. This is essentially how a constant-Q transform is created from an FFT in  (using a different frequency-weighting, or ``smoothing kernel''). However, when making a filter bank, as opposed to only a transform used for spectrographic purposes, we must be able to step the FFT through time and compute properly sampled time-domain filter-bank signals.
The wider pass-band created by adjacent-channel summing requires a higher sampling rate in the time domain to avoid aliasing. As a result, the maximum STFT ``hop size'' is limited by the widest pass-band in the filter bank. For audio filter banks, low-frequency channels have narrow bandwidths, while high-frequency channels are wider, thereby forcing a smaller hop size for the STFT. This means that the low-frequency channels are heavily oversampled when the high-frequency channels are merely adequately sampled (in time) [30,88]. In an octave filter-bank, for example, the top octave, occupying the entire upper half of the spectrum, requires a time-domain step-size of no more than two samples, if aliasing of the band is to be avoided. Each octave down is then oversampled (in time) by an additional factor of 2.
Inverse Transforming STFT Bin Groups