## FFT Filter Banks

This section, based on [265], describes how to make practical*audio filter banks*using the Short Time Fourier Transform (STFT). This material bridges the filter-bank interpretation of the STFT in Chapter 9 and the discussion of multirate filter banks in Chapter 11. The filter banks of this section are based entirely on the STFT, with consideration of the basic Fourier theorems introduced in Chapter 2. In Chapter 11, filter banks such as used in audio compression are addressed. However, when

*not*doing compression,

*i.e.*, when each channel of the filter bank does not have to be critically sampled, the methods of this section may suffice in many applications.

### Audio Filter Banks

It is well known that the frequency resolution of human hearing decreases with frequency [71,276]. As a result, any ``auditory filter bank'' must be a*nonuniform*filter bank in which the channel bandwidths increase with frequency over most of the spectrum. A classic approximate example is the

*third-octave filter bank*.

^{11.14}A simpler (cruder) approximation is the

*octave filter bank*,

^{11.15}also called a

*dyadic filter bank*when implemented using a binary tree structure [287]. Both are examples of

*constant-Q filter banks*[29,30,244], in which the bandwidth of each filter-bank channel is proportional to center frequency [263]. Approximate auditory filter banks, such as constant-Q filter banks, have extensive applications in computer music, audio engineering, and basic hearing research. If the output signals from all channels of a constant-Q filter bank are all

*sampled*at a particular time, we obtain what may be called a constant-Q

*transform*[29]. A constant-Q transform can be efficiently implemented by smoothing the output of a Fast Fourier Transform (FFT) [30]. More generally, a

*multiresolution spectrogram*can be implemented by combining FFTs of different lengths and advancing the FFTs forward through time (§7.3). Such nonuniform filter banks can also be implemented based on the Goetzel algorithm [33]. While the topic of

*filter banks*is well developed in the literature, including constant-Q, nonuniform FFT-based, and wavelet filter banks, the simple, robust methods presented in this section appear to be new [265]. In particular, classic nonuniform FFT filter banks as described in [226] have not offered the

*perfect reconstruction*property [287] in which the filter-bank sum yields the input signal exactly (to within a delay and/or scale factor) when the filter-band signals are not modified. The voluminous literature on perfect-reconstruction filter banks [287] addresses nonuniform filter banks, such as dyadic filter banks designed based on pseudo quadrature mirror filter designs, but simpler STFT methods do not yet appear to be incorporated. In the cosine-modulated filter-bank domain, subband DCTs have been used in a related way [302], but apparently without consideration for the possibility of a common time domain across multiple channels.

^{11.16}This section can be viewed as an extension of [30] to the FFT filter-bank case. Alternatively, it may be viewed as a novel method for nonuniform FIR filter-bank design and implementation, based on STFT methodology, with arbitrarily accurate reconstruction and controlled aliasing in the downsampled case. While we consider only auditory (approximately constant-Q) filter banks, the method works equally well for arbitrary nonuniform spectral partitions and

*overlap-add decompositions*in the frequency domain.

###

Basic Idea

The basic idea is to partition FFT bins into the desired nonuniform
bands, and perform smaller inverse FFTs on each subband to synthesize
downsampled time-domain signals in each band. A simple example for a
length 8 FFT octave filter bank is shown in Fig.10.30. We'll next
extend this example to a practically useful level via a series of
tutorially oriented matlab examples.
###

Summing STFT Bins

In the Short-Time Fourier Transform, which implements a *uniform*FIR filter bank (Chapter 9), each FFT bin can be regarded as one sample of the filter-bank output in one channel. It is elementary that summing adjacent filter-bank signals sums the corresponding pass-bands to create a wider pass-band. Summing adjacent FFT bins in the STFT, therefore, synthesizes one sample from a wider pass-band implemented using an FFT. This is essentially how a constant-Q transform is created from an FFT in [30] (using a different frequency-weighting, or ``smoothing kernel''). However, when making a filter bank, as opposed to only a transform used for spectrographic purposes, we must be able to step the FFT through time and compute properly sampled time-domain filter-bank signals. The wider pass-band created by adjacent-channel summing requires a higher sampling rate in the time domain to avoid aliasing. As a result, the maximum STFT ``hop size'' is limited by the widest pass-band in the filter bank. For audio filter banks, low-frequency channels have narrow bandwidths, while high-frequency channels are wider, thereby forcing a smaller hop size for the STFT. This means that the low-frequency channels are heavily oversampled when the high-frequency channels are merely adequately sampled (in time) [30,88]. In an octave filter-bank, for example, the top octave, occupying the entire upper half of the spectrum, requires a time-domain step-size of no more than two samples, if aliasing of the band is to be avoided. Each octave down is then oversampled (in time) by an additional factor of 2.

###

Inverse Transforming STFT Bin Groups

The solution proposed in [265] is to compute
*multiple*time samples for each high-frequency channel, so that one hop of the FFT produces all needed samples in each band. That way all channels can use the same hop-size without redundancy. If the narrowest band gets one sample per hop, then a band times as wide must produce samples per hop. A fast way to compute multiple time samples from the frequency-samples (FFT bins) of a given band is an inverse FFT, as shown in Fig.10.30. In matlab notation, let

`X(1:N)`denote the FFT (length

`N`) of the current frame of data in the STFT, and denote the lower and upper spectral samples for band

`k`by

`lsn(k)`and

`hsn(k)`, respectively. Then we may specify the matlab computation of the full-rate time-domain signal corresponding to band

`k`as follows:

BandK = X(lsn(k):hsn(k)); z1 = zeros(1,lsn(k)-1); z2 = zeros(1,N-hsn(k)); BandKzp = [z1, BandK, z2]; % (1) x(k,:) = ifft(BandKzp);where

`x(k,:)`denotes the output signal vector (length

`N`) for the

`k`th filter-bank channel for the current time-domain STFT window. Let denote the number of FFT bins in band . When is a power of 2, we can apply an inverse FFT only to the nonzero samples in the band:

xd{k} = ifft(BandK);where

`xd{k}`now denotes a

*cell array*for holding the ``downsampled'' signal vectors (since the downsampling factor is typically different for different bands). We may relate

`x(k,:)`and

`xd{k}`by noting that, when

`Lk = N/Nk`is an integer, we have that the relation

BandK == alias(BandKzp,Lk) % (2)is true, for each element, where

`alias(x,K)`denotes aliasing of

`K`equal partitions of the vector

`x`(§2.3.12):

^{11.17}

function y = alias(x,L) Nx = length(x); Np = Nx/L; % aliasing-partition length x = x(:); % ensure column vector y = zeros(Np,1); for i=1:L y = y + x(1+(i-1)*Np : i*Np); endBy the

*aliasing theorem*(

*downsampling theorem*) for Discrete Fourier Transforms (DFTs) (§2.3.12), the relation

`(2)`in the frequency domain corresponds to

xd{k} == Lk * x(k,1:Lk:end)in the time domain,

*i.e.*,

`xd{k}`is obtained from

`x(k,:)`by means of

*downsampling*by the factor

`Lk`. This produces

`N/Lk == Nk`samples. That is, for a band that is

`Nk`bins wide, we obtain

`Nk`time-domain samples for each STFT frame, when critically sampled. (At the full rate, we obtain

`N`samples from each channel for each frame.) We see that taking an inverse FFT of only the bins in a given channel computes the critically downsampled signal for that channel. Downsampling factors between 1 and

`Lk`can be implemented by choosing a zero-padding factor for the band somewhere between 1 and

`Lk`samples. Note that this filter bank has the

*perfect reconstruction*(PR) property [287]. That is, the original input signal can be exactly reconstructed (to within a delay and possible scale factor) from the channel signals. The PR property follows immediately from the exact invertibility of the discrete Fourier transform. Specifically, we can recover the original signal frame by taking an FFT of each channel-signal frame and abutting the spectral bins so computed to form the original frame spectrum. Of course, the underlying STFT (uniform filter bank) must be PR as well, as it routinely is in practice.

###

Improving the Channel Filters

Recall that each FFT bin can be viewed as a sample from a bandpass
filter whose frequency response is a frequency-shift of the FFT-window
Fourier transform (§9.3). Therefore, the frequency
response of a channel filter obtained by summing `Nk`adjacent FFT bins is given by the sum of

`Nk`window transforms, one for each FFT bin in the sum. As a result, the stop-band of the channel-filter frequency response is a sum of

`Nk`window side lobes, and by controlling window side-lobe level, we may control the stop-band gain of the channel filters. The

*transition width*from pass-band to stop-band, on the other hand, is given by the main-lobe width of the window transform (§5.5.1). In the previous subsection, by zero-padding the band (line

`(1)`above), we implicitly assumed a transition width of one bin. Only the length

`N`rectangular window can be reasonably said to have a one-bin transition from pass-band to stop-band. Since the first side lobe of a rectangular window transform is only about 13 dB below the main lobe, the rectangular window gives poor stop-band performance, as illustrated in Fig.10.33. Moreover, we often need FFT data windows to be shorter than the FFT size

`N`(

*i.e.*, we often need zero-padding in the time domain) so that the frame spectrum will be oversampled, enabling various spectral processing such as linear filtering (Chapter 8). One might wonder how the length

`N`rectangular window can be all that bad when it gives the perfect reconstruction property, as demonstrated in the previous subsection. The answer is that there is a lot of aliasing in the channel signals, when downsampled, but this aliasing is exactly canceled in the reconstruction, provided the channel signals were not modified in any way. Going back to §10.7.3, we need to replace the zero-padded band

`(1)`by a proper filtering operation in the frequency domain (a ``spectral window''):

BandK2 = Hk .* X; x(k,:) = ifft(BandK2); % full rate BandK2a = alias(BandK2,Nk); xd{k} = ifft(BandK2a); % crit sampwhere the channel filter frequency response

`Hk`may be prepared in advance as the appropriate weighted sum of FFT-window transforms:

Hideal = [z1,ones(1,Nk),z2]; Hk = cconvr(W,Hideal); % circ. conv.where

`z1`and

`z2`are the same zero vectors defined in §10.7.3, and

`cconvr(W,H)`denotes the

*circular convolution*of two real vectors having the same length [264]:

function [Y] = cconvr(W,X) wc=fft(W); xc=fft(X); yc = wc .* xc; Y = real(ifft(yc));Note that in this more practical case, the perfect reconstruction property no longer holds, since the operation

BandK2a = alias(Hk .* X, Nk);is not exactly invertible in general.

^{11.18}However, we may approach perfect reconstruction arbitrarily closely by only aliasing stop-band intervals onto the pass-band, and by increasing the stop-band attenuation of

`Hk`as desired. In contrast to the PR case, we do not rely on aliasing cancellation, which is valuable when the channel signals are to be modified. The band filters

`Hk`can be said to have been designed by the

*window method*for FIR filter design [224]. (See functions

`fir1`and

`fir2`in Octave and/or the Matlab Signal Processing Toolbox.)

###

Fast Octave Filter Banks

Let's now apply the above technique to the design of an octave filter
bank.^{11.19}At first sight, this appears to be a natural fit, because it is immediately easy to partition a power of 2 (typical for the FFT size

`N`) into octaves, each having a width in bins that is a power of 2. For example, when

`N == 8`, we have the following stack of frequency-response vectors for the rectangular-window, no-zero-padding, complex-signal case:

H = [ ... 0 0 0 0 1 1 1 1 ; ... 0 0 1 1 0 0 0 0 ; ... 0 1 0 0 0 0 0 0 ; ... 1 0 0 0 0 0 0 0 ];Figure 10.31 depicts the resulting filter bank schematically in the frequency domain.

^{11.20}Thus,

`H(1,:)`is the frequency-response for the top (first) octave,

`H(2,:)`is the frequency-response for the next-to-top (second) octave,

`H(3,:)`is the next octave down, and

`H(4,:)`is the ``remainder'' of the spectrum. (In every octave filter bank, there is a final low-frequency band. A schematic of the filter-bank implementation using FFTs is shown in Fig.10.30. The division by

`N`called for by

`ifft(N)`is often spread out among the

`ifft`butterfly stages as divisions by 2 (one-bit right-shifts in fixed-point arithmetic) after each of them. For conservation of dynamic range, half of the right-shifts can be spread out among every other forward-FFT butterfly stage [28] as well. The simple spectral partitioning of Fig.10.31 is appropriate for

*complex*signals--those for which the spectrum is regarded as spanning 0 Hz to the sampling rate. For real signals, we need a spectral partition more like that in Fig.10.32. Unfortunately, the number of spectral samples in Fig.10.32 is 14--not a power of 2. (The previous length 8 complex case maps to length 14 real case because the dc and Nyquist-limit samples do not have complex-conjugate counterparts.) Discarding the sample at the Nyquist limit--number 8 in Fig.10.32--does not help, since that gives 13 samples--still not a power of 2. In summary, there is no obvious way to octave-partition the spectral samples of a real signal while maintaining the power-of-2 condition for each band and symmetrically partitioning positive and negative frequencies. A real signal can of course be converted to its corresponding ``analytic signal'' by filtering out the negative-frequency components. This is normally done by designing a

*Hilbert transform filter*[133,224]. However, such filters are large-order FIR filters, exactly like we are trying to design! If we design our filter bank properly, we can use

*it*to zero the negative-frequency components.

###

Spectral Rotation of Real Signals

Note that if we *rotate*the spectrum of a real signal by half a bin, we obtain positive-frequency samples and negative-frequency samples, with no sample at dc or at the Nyquist limit. This is typically desirable for audio signals because dc is inaudible and the Nyquist limit is a degenerate point of the spectrum that, for example, cannot have a phase other than 0 or . If is a power of 2, then so is , and the octave-band partitioning of the previous subsection can be applied separately to each half of the spectrum:

N = 32; x = randn(1,N); % Specific example LN = round(log2(N)); % number of octave bands shifter = exp(-j*pi*[0:N-1]/N); % half-bin xs = x .* shifter; % apply spectral shift X = fft(xs,N); % project xs onto rotated basis XP = X(1:N/2); % positive-frequency components XN = X(N:-1:N/2+1); % neg.-frequency components YP = dcells(XP); % partition to octave bands YN = dcells(XN); % ditto for neg. frequencies YPe = dcells2spec(YP); % unpack "dyadic cells" YNe = dcells2spec(YN); % unpack neg. freqs YNeflr = fliplr(YNe); % undo former flip ys = ifft([YPe,YNeflr],N,2); y = real(ones(LN,1)*conj(shifter) .* ys); % = octave filter-bank signals (real) yr = sum(y); % filter-bank sum (should equal x) yrerr = x - yr; disp(sprintf(... 'Total filter-bank sum L2 error = %0.2f %%',... 100*norm(yrerr)/norm(x)));In the above listing, the function

`dcells`

^{11.21}simply forms a cell array in matlab containing the spectral partition (``dyadic cells''). The function

`dcells2spec`

^{11.22}is the inverse of

`dcells`, taking a spectral partition and unpacking it to produce a usual spectrum vector.

###

Improving the Octave Band Filters

To see the true filter-bank frequency response corresponding to
Fig.10.31, we may feed an impulse to the filter bank and
take a long FFT (for zero padding) of the channel signals to
produce the interpolated
response shown in Fig.10.33.
The horizontal line along 0 dB in Fig.10.33 was
obtained by summing the channel responses, indicating that it is a
perfect-reconstruction filter-bank, as expected. However, the
stop-band performance of the channels is quite poor, being comparable
to the side-lobes of a rectangular window transform (an aliased sinc function).
In fact, the stop-band is identical to the rectangular-window side-lobes for the
two lowest bands. Notice that the original eight samples of
Fig.10.31 still lie along the 0-dB line, and there are
still zeros in each channel response beneath the unity-gain point of
every other channel's response, so Fig.10.31 can be
obtained by sampling Fig.10.33 at those eight
points. However, the interpolated response shows that the filter bank
is quite poor by audio standards.
*Dolph-Chebyshev window*was used (using the matlab function call

`chebwin(127,80)`). The FFT size is about twice the filter length, thereby allowing for a data frame of equal length (to the filter) without incurring time aliasing, in the usual way for STFTs (Chapter 8). The data window is chosen to overlap-add to a constant, as typical in STFTs, so its choice does not affect the quality of the filter-bank output channel signals. We therefore may continue to use a rectangular data window that advances by its full length each frame. Choosing an odd filter length facilitates zero-phase offline STFT processing, in which the middle FIR sample is treated as occurring at time zero, so that there is no delay in any filter-bank channel [264]. The filter bank is PR in the full-rate case because the underlying STFT is PR, in the absence of modifications, and because the channel filter-bank is constant-overlap-add in the frequency domain (again in the absence of modifications) according to STFT theory (Chapter 8).

###

Aliasing on Downsampling

While the filter bank of
Fig.10.34 gives good
stop-band rejection, there is still a significant amount of
*aliasing*when the bands are critically sampled. This happens because the

*transition bands*are aliased about their midpoints. This can be seen in Fig.10.34 by noting that aliasing ``folding frequencies'' lie at the crossover point between each pair of bands. An overlay of the spectra of the downsampled filter-bank outputs, for an impulse input, is shown in Fig.10.35.

*step*input (same filter bank). (This type of plot looks ideal for an impulse input signal because the spectrum is constant, so the aliased bands are also constant.) Note the large slice of dc energy that has aliased from near the sampling rate to near half the sampling rate in the top octave band. The signal and error spectra are shown overlaid in Fig.10.37. In this case, the aliasing causes significant error in the reconstruction.

###

Restricting Aliasing to Stop-Bands

To eliminate the relatively heavy transition-band aliasing (when
critically sampling the channel signals), we can define
*overlapping bands*such that each band encompasses the transition bands on either side. However, unless a full oversampling is provided for each band (which is one easy solution), the bandwidth (in bins) is no longer a power of two, thereby thwarting use of radix-2 inverse-FFTs to compute the time-domain band signals. To keep the channel bandwidths at powers of two while restricting aliasing to stop-band energy, the IFFT bands can be widened to

*include*transition bands on either side. That is, the desired pass-band

*plus*the two transition bands span a power-of-two bins. This results in overlapping channel IFFTs. Figure 10.38 shows how the example of Fig.10.34 is modified by this strategy.

^{11.23}The band should roll off to reach its stop-band at the edge of the wider encompassing band. It is fine to have extra space in the wider band, and this may be filled with a continuation of the enclosed band's stop-band response (or some tapering of it--since we assume stop-band energy is negligible, the difference should be inconsequential). The desired bands may overlap each other by any amount, and may have any desired shape. The encompassing bands then overlap further to reach the next power of two (in width) for each overlapping extended band. (See the gammatone and gammachirp filter banks for examples of heavily overlapping bands in an audio filter bank [111].) In this approach, pass-bands of arbitrary width are embedded in overlapping IFFT bands that are a power-of-2 wide. As a result of this flexibility, the frequency-rotation trick of §10.7.7 is no longer needed for real filter banks. Instead, we simply allocate any desired bands between dc and half the sampling rate, and then conjugate-symmetry dictates the rest. In addition to a left-over ``dc-Nyquist'' band, there is a similar residual ``Nyquist-limit'' band (a typically negligible band about half the sampling rate). In other words, since the pass-bands may be any width and the encompassing IFFT bands may overlap by any amount, they do not have to ``pack'' conveniently as power-of-two blocks. The minimum channel bandwidth is defined as two transition bands plus one bin (

*i.e.*, the minimum pass-band width is zero, corresponding to one bin, or one spectral sample). For the Dolph-Chebyshev window, the transition bandwidth is known in closed form [155]. In our examples, we have a length 127 window with 80 dB stop-band attenuation in the lowpass prototype [

`chebwin(127,80)`], corresponding to a transition width of 6.35 bins in a length 256 FFT, which was rounded up to 7 bins in software for simplicity of band allocation. Therefore, our minimum channel bandwidth is 15 bins (two transition bands plus one sample for the band center). The next highest power of two is 16, so that is our minimum encompassing IFFT length for any band. The dc and Nyquist channels are combined into a single channel containing the left-over residual filter-bank response consisting of a low transition down from dc and a high-frequency transition up to the sampling rate (in the complex-signal case). When

`N`is sufficiently large so that these bands contain no audible energy, they may be discarded. We include them in all examples here so as to preserve the (near) perfect reconstruction property of the filter bank. Thus, the 7-bin dc channel is combined with the 7-bin Nyquist channel to form a single 16-bin encompassing residual band that may be discarded in many audio applications (when the initial FFT size is sufficiently large for the sampling rate used). In the example of Fig.10.38, the initial FFT size is 256, and the channel bandwidths (pass-bands only, excluding transitions), from top to bottom, are 121, 64, 32, 16, and 8 bins. The top band is reduced by 7 bins to leave a transition band to the sampling rate. Similarly, the lowest band lies above a transition band consisting of bins 0-6. The encompassing IFFTs (containing transitions) are lengths 256, 128, 64, 32, 32, for the interior bands, and a length 32 IFFT handles the dc and Nyquist bands (which are combined into a single 14-bin band about dc, which occupies 28 bins when the transition bands are appended). Letting [lo,hi] denote a band by its lower and upper bin limit, the non-overlapping adjacent pass-band edges in ``spectral samples''

^{11.24}of the interior bands are [8, 15], [16, 31], [32, 63], [64, 127], and [128, 248]; the overlapping encompassing IFFT band edges are then [1, 32], [9, 40], [25, 88], [57, 184], [1, 256],

*i.e.*, they each contain a pass-band and two transition bands, and have a power-of-2 length. The downsampling factor for each channel can be computed as the initial FFT size (256) divided by the IFFT size ( , , , , or ) for the channel. Figure 10.39 shows the counterpart of Fig.10.35 for this example. In this case, the aliased signal energy comes only from channel-filter stop-bands. For narrow bands, the aliasing is suppressed by at least 80 dB (the side-lobe level of the chosen Dolph-Chebyshev window transform). For bands significantly wider than one bin (the minimum bandwidth in this example is the dc-Nyquist band at 14 bins), the stop-band consists of a sum of shifts of the window-transform side lobes, and these are found to be more than 80 dB down due to cancellation (more than 90 dB down in most bands of this example).

####

Tightening the IFFTs

In this example the top band is not downsampled at all, and
the interior bands are oversampled by approximately 2. This is because
the desired pass-band widths started out at a power of 2, so that the
addition of transition bands forced the next higher power of 2 for the
IFFT size. Narrowing the width of the top band from 121 bins to
bins would enable use of a length 128 IFFT for
the top band, and similarly for the lower bands. In other words, when
the desired spectral partition is that of an ideal octave filter bank,
as sketched in Fig.10.31, narrowing each octave-band by
twice the transition width of the lowpass prototype filter (and
``covering down'' to keep them adjacent) will produce a relatively
``tight'' FFT filterbank design in which the IFFT sizes remain the
same length as in the heavily aliased case discussed above
(Fig.10.35).
When applied to the octave filter bank, the pass-bands become a little
narrower than one octave. We may call this a *quasi octave filter bank*.

### Real Filter Bank Example

Finally, Fig.10.40 shows the appearance of the octave filter bank for real signals. In this case, bands are constructed between dc and*half*the sampling rate, and conjugate symmetry is used to automatically construct the desired bands between half the sampling rate and the sampling rate. The band allocation algorithm therefore needs no modification for the real-signal case.

### Optimal Band Filters

In the filter-bank literature, one class of filter banks is called ``cosine modulated'' filter banks. DFT filter banks are similar. The lowpass-filter prototype in such filter banks can be used in place of the Dolph-Chebyshev window used here. Therefore, any result on optimal design of cosine-modulated filter banks can be adapted to this context. See, for example, [253,302]. Note, however, that in principle a separate optimization is needed for each different channel bandwidth. An optimal lowpass prototype only optimizes channels having a one-bin pass-band, since the prototype frequency-response is merely shifted in frequency (cosine-modulated in time) to create the channel frequency response. Wider channels are made by summing such channel responses, which alters the stop-bands. In practice, the Dolph-Chebyshev window, used in the examples of this section, typically yields a filter bank magnitude frequency response that is optimal in the Chebyshev sense, when at least one channel is minimum width, because- there can be only one lowpass prototype filter in any modulated filter bank (such as the DFT filter bank),
- the prototype itself is the optimal Chebyshev lowpass filter of minimum bandwidth, and
- summing shifted copies of the prototype frequency response
(to synthesize a wider pass-band)
generally
*improves*the stop-band rejection over that of the prototype, thereby meeting the Chebyshev optimality requirement for the filter-bank as a whole (keeping below the worst-case deviation of the prototype).

**Next Section:**

FFT Filter-Bank Summary and Fourier Duality with OLA

**Previous Section:**

Gaussian Windowed Chirps (Chirplets)