Fast Fourier Transform (FFT) Algorithms

The term fast Fourier transform (FFT) refers to an efficient implementation of the discrete Fourier transform (DFT) for highly compositeA.1 transform lengths $ N$. When computing the DFT as a set of $ N$ inner products of length $ N$ each, the computational complexity is $ {\cal O}(N^2)$. When $ N$ is an integer power of 2, a Cooley-Tukey FFT algorithm delivers complexity $ {\cal O}(N\lg N)$, where $ \lg N$ denotes the log-base-2 of $ N$, and $ {\cal O}(x)$ means ``on the order of $ x$''. Such FFT algorithms were evidently first used by Gauss in 1805 [30] and rediscovered in the 1960s by Cooley and Tukey [16].

In this appendix, a brief introduction is given for various FFT algorithms. A tutorial review (1990) is given in [22]. Additionally, there are some excellent FFT ``home pages'':

Pointers to FFT software are given in §A.7.

Mixed-Radix Cooley-Tukey FFT

When the desired DFT length $ N$ can be expressed as a product of smaller integers, the Cooley-Tukey decomposition provides what is called a mixed radix Cooley-Tukey FFT algorithm.A.2

Two basic varieties of Cooley-Tukey FFT are decimation in time (DIT) and its Fourier dual, decimation in frequency (DIF). The next section illustrates decimation in time.

Decimation in Time

The DFT is defined by

$\displaystyle X(k) = \sum_{n=0}^{N-1} x(n) W_N^{kn}, \quad k=0,1,2,\ldots,N-1,
$

where $ x(n)$ is the input signal amplitude at time $ n$, and

$\displaystyle W_N \isdef e^{-j\frac{2\pi}{N}}.\quad \hbox{(primitive $N$th root of unity)}
$

Note that $ W_N^N=1$.

When $ N$ is even, the DFT summation can be split into sums over the odd and even indexes of the input signal:

$\displaystyle X(\omega_k)$ $\displaystyle \isdef$ $\displaystyle \hbox{\sc DFT}_{{N,k}}\{x\} \isdef \sum_{n=0}^{N-1} x(n) e^{-j\omega_k n T},
\quad \omega_k \isdef \frac{2\pi k}{NT}$  
  $\displaystyle =$ $\displaystyle \sum_{{\stackrel{n=0}{\vspace{2pt}\mbox{\tiny$n$\ even}}}}^{N-2} ...
...stackrel{n=0}{\vspace{2pt}\mbox{\tiny$n$\ odd}}}}^{N-1} x(n) e^{-j\omega_k n T}$  
  $\displaystyle =$ $\displaystyle \sum_{n=0}^{\frac{N}{2}-1} x(2n) e^{-j2\pi \frac{k}{N/2} n}
+ e^{-j2\pi\frac{k}{N}}\sum_{n=0}^{\frac{N}{2}-1} x(2n+1) e^{-j2\pi \frac{k}{N/2} n},$  
  $\displaystyle =$ $\displaystyle \sum_{n=0}^{\frac{N}{2}-1} x_e(n) W_{N/2}^{kn} + W_N^k
\sum_{n=0}^{\frac{N}{2}-1} x_o(n) W_{N/2}^{kn}$  
  $\displaystyle \isdef$ $\displaystyle \hbox{\sc DFT}_{{\frac{N}{2},k}}\{\hbox{\sc Downsample}_2(x)\}$  
    $\displaystyle \mathop{\quad} +\;W_N^k\cdot\hbox{\sc DFT}_{{\frac{N}{2},k}}\{\hbox{\sc Downsample}_2[\hbox{\sc Shift}_1(x)]\},
\protect$ (A.1)

where $ x_e(n)\isdef x(2n)$ and $ x_o(n)\isdef x(2n+1)$ denote the even- and odd-indexed samples from $ x$. Thus, the length $ N$ DFT is computable using two length $ N/2$ DFTs. The complex factors $ W_N^k=e^{-j\omega_k}=\exp(-j2\pi k/N)$ are called twiddle factors. The splitting into sums over even and odd time indexes is called decimation in time. (For decimation in frequency, the inverse DFT of the spectrum $ X(\omega_k)$ is split into sums over even and odd bin numbers $ k$.)


Radix 2 FFT

When $ N$ is a power of $ 2$, say $ N=2^K$ where $ K>1$ is an integer, then the above DIT decomposition can be performed $ K-1$ times, until each DFT is length $ 2$. A length $ 2$ DFT requires no multiplies. The overall result is called a radix 2 FFT. A different radix 2 FFT is derived by performing decimation in frequency.

A split radix FFT is theoretically more efficient than a pure radix 2 algorithm [73,31] because it minimizes real arithmetic operations. The term ``split radix'' refers to a DIT decomposition that combines portions of one radix 2 and two radix 4 FFTs [22].A.3On modern general-purpose processors, however, computation time is often not minimized by minimizing the arithmetic operation count (see §A.7 below).

Radix 2 FFT Complexity is N Log N

Putting together the length $ N$ DFT from the $ N/2$ length-$ 2$ DFTs in a radix-2 FFT, the only multiplies needed are those used to combine two small DFTs to make a DFT twice as long, as in Eq.$ \,$(A.1). Since there are approximately $ N$ (complex) multiplies needed for each stage of the DIT decomposition, and only $ \lg N$ stages of DIT (where $ \lg N$ denotes the log-base-2 of $ N$), we see that the total number of multiplies for a length $ N$ DFT is reduced from $ {\cal O}(N^2)$ to $ {\cal O}(N\lg N)$, where $ {\cal O}(x)$ means ``on the order of $ x$''. More precisely, a complexity of $ {\cal O}(N\lg N)$ means that given any implementation of a length-$ N$ radix-2 FFT, there exist a constant $ C$ and integer $ M$ such that the computational complexity $ {\cal C}(N)$ satisfies

$\displaystyle {\cal C}(N) \leq C N \lg N
$

for all $ N>M$. In summary, the complexity of the radix-2 FFT is said to be ``N log N'', or $ {\cal O}(N\lg N)$.


Fixed-Point FFTs and NFFTs

Recall (e.g., from Eq.$ \,$(6.1)) that the inverse DFT requires a division by $ N$ that the forward DFT does not. In fixed-point arithmetic (Appendix G), and when $ N$ is a power of 2, dividing by $ N$ may be carried out by right-shifting $ \log_2(N)$ places in the binary word. Fixed-point implementations of the inverse Fast Fourier Transforms (FFT) (Appendix A) typically right-shift one place after each Butterfly stage. However, superior overall numerical performance may be obtained by right-shifting after every other butterfly stage [8], which corresponds to dividing both the forward and inverse FFT by $ \sqrt{N}$ (i.e., $ \sqrt{N}$ is implemented by half as many right shifts as dividing by $ N$). Thus, in fixed-point, numerical performance can be improved, no extra work is required, and the normalization work (right-shifting) is spread equally between the forward and reverse transform, instead of concentrating all $ N$ right-shifts in the inverse transform. The NDFT is therefore quite attractive for fixed-point implementations.

Because signal amplitude can grow by a factor of 2 from one butterfly stage to the next, an extra guard bit is needed for each pair of subsequent NDFT butterfly stages. Also note that if the DFT length $ N=2^K$ is not a power of $ 4$, the number of right-shifts in the forward and reverse transform must differ by one (because $ K=\log_2(N)$ is odd instead of even).


Prime Factor Algorithm (PFA)

By the prime factorization theorem, every integer $ N$ can be uniquely factored into a product of prime numbers $ p_i$ raised to an integer power $ m_i\ge 1$:

$\displaystyle N = \prod_{i=1}^{n_p} p_i^{m_i}
$

As discussed above, a mixed-radix Cooley Tukey FFT can be used to implement a length $ N$ DFT using DFTs of length $ p_i$. However, for factors of $ N$ that are mutually prime (such as $ p_i^{m_i}$ and $ p_j^{m_j}$ for $ i\ne j$), a more efficient prime factor algorithm (PFA), also called the Good-Thomas FFT algorithm, can be used [26,80,35,43,10,83].A.4 The Chinese Remainder Theorem is used to re-index either the input or output samples for the PFA.A.5Since the PFA is only applicable to mutually prime factors of $ N$, it is ideally combined with a mixed-radix Cooley-Tukey FFT, which works for any integer factors.

It is interesting to note that the PFA actually predates the Cooley-Tukey FFT paper of 1965 [16], with Good's 1958 work on the PFA being cited in that paper [83].

The PFA and Winograd transform [43] are closely related, with the PFA being somewhat faster [9].


Rader's FFT Algorithm for Prime Lengths

Rader's FFT algorithm can be used to compute DFTs of length $ N$ in $ {\cal O}(N\lg N)$ operations when $ N$ is a prime number. For an introduction, see the Wikipedia page for Rader's FFT Algorithm:

http://en.wikipedia.org/wiki/Rader's_FFT_algorithm


Bluestein's FFT Algorithm

Like Rader's FFT, Bluestein's FFT algorithm (also known as the chirp $ z$-transform algorithm), can be used to compute prime-length DFTs in $ {\cal O}(N\lg N)$ operations [24, pp. 213-215].A.6 However, unlike Rader's FFT, Bluestein's algorithm is not restricted to prime lengths, and it can compute other kinds of transforms, as discussed further below.

Beginning with the DFT

$\displaystyle X(k) = \sum_{n=0}^{N-1} x(n) W^{-kn}, \quad k=0,1,2,\ldots,N-1,
$

where $ W \isdeftext \exp(j2\pi/N)$ denotes a primitive $ N$th root of unity,A.7 we multiply and divide by $ W^{(n^2+k^2)/2}$ to obtain

\begin{eqnarray*}
X(k)
&=&
\sum_{n=0}^{N-1}x(n)W^{-kn}W^{\frac{1}{2}(n^2+k^2)}...
...frac{1}{2}(k-n)^2} \\
&=& W^{-\frac{1}{2}k^2} (x_q \ast w_q)_k,
\end{eqnarray*}

where `$ \ast $' denotes convolution7.2.4), and the sequences $ x_q$ and $ w_q$ are defined by

\begin{eqnarray*}
x_q(n) & \isdef & x(n)W^{-\frac{1}{2}n^2}, \quad n=0,1,2,\ldot...
...{\frac{1}{2}n^2}, \quad n=-N+1,-N+2,\ldots,-1,0,1,2,\ldots, N-1,
\end{eqnarray*}

where the ranges of $ n$ given are those actually required by the convolution sum above. Beyond these required minimum ranges for $ n$, the sequences may be extended by zeros. As a result, we may implement this convolution (which is cyclic for even $ N$ and ``negacyclic'' for odd $ N$) using zero-padding and a larger cyclic convolution, as mentioned in §7.2.4. In particular, the larger cyclic convolution size $ N_2\ge 2N-1$ may be chosen a power of 2, which need not be larger than $ 4N-3$. Within this larger cyclic convolution, the negative-$ n$ indexes map to $ N_2-N+1:N_2-1$ in the usual way.

Note that the sequence $ x_q(n)$ above consists of the original data sequence $ x(n)$ multiplied by a signal $ \exp(j\pi n^2/N)$ which can be interpreted as a sampled complex sinusoid with instantaneous normalized radian frequency $ 2\pi n/N$, i.e., an instantaneous frequency that increases linearly with time. Such signals are called chirp signals. For this reason, Bluestein's algorithm is also called the chirp $ z$-transform algorithm [59].

In summary, Bluestein's FFT algorithm provides complexity $ N \lg N$ for any positive integer DFT-length $ N$ whatsoever, even when $ N$ is prime.

Other adaptations of the Bluestein FFT algorithm can be used to compute a contiguous subset of DFT frequency samples (any uniformly spaced set of samples along the unit circle), with $ N \lg N$ complexity. It can similarly compute samples of the $ z$ transform along a sampled spiral of the form $ z^k$, where $ z$ is any complex number, and $ k=k_0,k_0+1,\ldots,k_0+N-1$, again with complexity $ N \lg N$ [24].


Fast Transforms in Audio DSP

Since most audio signal processing applications benefit from zero padding (see §8.1), in which case we can always choose the FFT length to be a power of 2, there is almost never a need in practice for more ``exotic'' FFT algorithms than the basic ``pruned'' power-of-2 algorithms. (Here ``pruned'' means elimination of all unnecessary operations, such as when the input signal is real [74,21].)

An exception is when processing exactly periodic signals where the period is known to be an exact integer number of samples in length.A.8 In such a case, the DFT of one period of the waveform can be interpreted as a Fourier seriesB.3) of the periodic waveform, and unlike virtually all other practical spectrum analysis scenarios, spectral interpolation is not needed (or wanted). In the exactly periodic case, the spectrum is truly zero between adjacent harmonic frequencies, and the DFT of one period provides spectral samples only at the harmonic frequencies.

Adaptive FFT software (see §A.7 below) will choose the fastest algorithm available for any desired DFT length. Due to modern processor architectures, execution time is not normally minimized by minimizing arithmetic complexity [23].


Related Transforms

There are alternatives to the Cooley-Tukey FFT which can serve the same or related purposes and which can have advantages in certain situations [10]. Examples include the fast discrete cosine transform (DCT) [5], discrete Hartley transform [21], and number theoretic transform [2].

The DCT, used extensively in image coding, is described in §A.6.1 below. The Hartley transform, optimized for processing real signals, does not appear to have any advantages over a ``pruned real-only FFT'' [74]. The number theoretic transform has special applicability for large-scale, high-precision calculations (§A.6.2 below).

The Discrete Cosine Transform (DCT)

In image coding (such as MPEG and JPEG), and many audio coding algorithms (MPEG), the discrete cosine transform (DCT) is used because of its nearly optimal asymptotic theoretical coding gain.A.9For 1D signals, one of several DCT definitions (the one called DCT-II)A.10is given by

$\displaystyle \hbox{\sc DCT}_k(x)$ $\displaystyle =$ $\displaystyle 2\sum_{n=0}^{N-1} x(n) \cos\left[\frac{\pi k}{2N}(2n+1)\right],
\quad k=0,1,2,\ldots,N-1$  
  $\displaystyle =$ $\displaystyle 2\sum_{n=0}^{N-1} x(n) \cos\left[\omega_k\left(n+\frac{1}{2}\right)\right]
\protect$ (A.2)

where

$\displaystyle \omega_k \isdef \frac{2\pi}{2N}k.
$

Note that $ \omega_k$ is the DFT frequency for a length $ 2N$ DFT (as opposed to $ N$).

For real signals, the real part of the DFT is a kind of DCT:

\begin{eqnarray*}
\mbox{re}\left\{\hbox{\sc DFT}_k(x)\right\}
&\isdef & \mbox{r...
...right\}\\
&=& \sum_{n=0}^{N-1} x(n) \cos\left(\omega_k n\right)
\end{eqnarray*}

Thus, the real part of a double-length FFT is the same as the DCT except for the half-sample phase shift in the sinusoidal basis functions $ s_k(n) = e^{j\omega_k n}$ (and a scaling by 2 which is unimportant).

In practice, the DCT is normally implemented using the same basic efficiency techniques as in FFT algorithms. In Matlab and Octave (Octave-Forge), the functions dct and dct2 are available for the 1D and 2D cases, respectively.

Exercise: Using Euler's identity, expand the cosine in the DCT defined by Eq.$ \,$(A.2) above into a sum of complex sinusoids, and show that the DCT can be rewritten as the sum of two phase-modulated DFTs:

$\displaystyle \hbox{\sc DCT}_{N,k} =
e^{j\omega_k\frac{1}{2}} \overline{X_{2N}(\omega_k)}
+e^{-j\omega_k\frac{1}{2}} X_{2N}(\omega_k)
$

where $ X_{2N}(\omega_k)=\hbox{\sc DFT}_{2N,k}(x)$ denotes the length $ 2N$ DFT of $ x$.


Number Theoretic Transform

The number theoretic transform is based on generalizing the $ N$th primitive root of unity (see §3.12) to a ``quotient ring'' instead of the usual field of complex numbers. Let $ W_N$ denote a primitive $ N$th root of unity. We have been using $ W_N
= \exp(-j2\pi/N)$ in the field of complex numbers, and it of course satisfies $ W_N^N=1$, making it a root of unity; it also has the property that $ W_N^k$ visits all of the ``DFT frequency points'' on the unit circle in the $ z$ plane, as $ k$ goes from 0 to $ N-1$.

In a number theory transform, $ W_N$ is an integer which satisfies

$\displaystyle W_N^N = 1\left(\mbox{mod}\;p\right)
$

where $ p$ is a prime integer. From number theory, for each prime number $ p$ there exists at least one primitive root $ r$ such that $ r^n$ (modulo $ p$) visits all of the numbers $ 1$ through $ p-1$ in some order, as $ n$ goes from $ 1$ to $ p-1$. Since $ m^{p-1}=1\left(\mbox{mod}\;p\right)$ for all integers $ m$ (another result from number theory), $ r$ is also an $ N$th root of unity, where $ N=p-1$ is the transform size. (More generally, $ N$ can be any integer divisor $ L$ of $ p-1$, in which case we use $ W_N=r^L$ as the generator of the numbers participating in the transform.)

When the number of elements in the transform is composite, a ``fast number theoretic transform'' may be constructed in the same manner as a fast Fourier transform is constructed from the DFT, or as the prime factor algorithm (or Winograd transform) is constructed for products of small mutually prime factors [43].

Unlike the DFT, the number theoretic transform does not transform to a meaningful ``frequency domain''. However, it has analogous theorems, such as the convolution theorem, enabling it to be used for fast convolutions and correlations like the various FFT algorithms.

An interesting feature of the number theory transform is that all computations are exact (integer multiplication and addition modulo a prime integer). There is no round-off error. This feature has been used to do fast convolutions to multiply extremely large numbers, such as are needed when computing $ \pi $ to millions of digits of precision.


FFT Software

For those of us in signal processing research, the built-in fft function in Matlab (or Octave) is what we use almost all the time. It is adaptive in that it will choose the best algorithm available for the desired transform size.

For C or C++ applications, there are several highly optimized FFT variants in the FFTW package (``Fastest Fourier Transform in the West'') [23]. FFTW is free for non-commercial or free-software applications under the terms of the GNU General Public License.

For embedded DSP applications (software running on special purpose DSP chips), consult your vendor's software libraries and support website for FFT algorithms written in optimized assembly language for your DSP hardware platform. Nearly all DSP chip vendors supply free FFT software (and other signal processing utilities) as a way of promoting their hardware.


Next Section:
Fourier Transforms for Continuous/Discrete Time/Frequency
Previous Section:
Example Applications of the DFT