Fast Fourier Transform (FFT) Algorithms
The term
fast Fourier transform (
FFT) refers to an efficient
implementation of the discrete
Fourier transform (
DFT) for highly
composite
A.1 transform
lengths

. When computing the DFT as a set of
inner products of
length

each, the computational complexity is

. When

is an integer power of 2, a
Cooley-Tukey FFT algorithm delivers
complexity

, where

denotes the log-base-2 of

, and

means ``on the order of

''.
Such FFT algorithms were evidently first used by Gauss in 1805
[
30] and rediscovered in the 1960s by Cooley and Tukey
[
16].

In this appendix, a brief introduction is given for various FFT
algorithms. A tutorial review (1990) is given in
[
22].
Additionally, there are some excellent FFT ``home pages'':
Pointers to FFT software are given in §
A.7.
When the desired
DFT length

can be expressed as a product of
smaller integers, the Cooley-Tukey decomposition provides what is
called a
mixed radix Cooley-Tukey
FFT algorithm.
A.2
Two basic varieties of Cooley-Tukey FFT are
decimation in time
(DIT) and its Fourier dual,
decimation in frequency (DIF). The
next section illustrates
decimation in time.
The
DFT is defined by
where

is the input
signal amplitude at time

, and
Note that

.
When

is even, the DFT summation can be split into sums over the
odd and even indexes of the input signal:
where

and

denote the even-
and odd-indexed samples from

. Thus, the length

DFT is
computable using two length

DFTs. The complex factors

are called
twiddle factors. The splitting
into sums over even and odd time indexes is called
decimation in
time. (For
decimation in frequency, the inverse DFT of the
spectrum

is split into sums over even and odd
bin
numbers 
.)
Radix 2 FFT
When

is a power of

, say

where

is an integer,
then the above DIT decomposition can be performed

times, until
each
DFT is length

. A length

DFT requires no multiplies. The
overall result is called a
radix 2 FFT. A different radix 2
FFT is derived by performing
decimation in frequency.
A
split radix FFT is theoretically more efficient than a pure
radix 2 algorithm [
73,
31] because it
minimizes real arithmetic operations. The term ``split radix'' refers
to a DIT decomposition that combines portions of one radix 2 and two
radix 4 FFTs [
22].
A.3On modern general-purpose
processors, however, computation time is often not minimized by
minimizing the arithmetic operation count (see §
A.7 below).
Radix 2 FFT Complexity is N Log N
Putting together the length
DFT from the

length-

DFTs in
a radix-2 FFT, the only multiplies needed are those used to combine
two small DFTs to make a DFT twice as long, as in Eq.

(
A.1).
Since there are approximately

(complex) multiplies needed for each
stage of the DIT decomposition, and only

stages of DIT (where

denotes the log-base-2 of

), we see that the total number
of multiplies for a length

DFT is reduced from

to

, where

means ``on the order of

''. More
precisely, a complexity of

means that given any
implementation of a length-

radix-2 FFT, there exist a constant

and integer

such that the computational complexity

satisfies
for all

. In summary, the complexity of the radix-2 FFT is said
to be ``N log N'', or

.
Recall (
e.g., from Eq.

(
6.1)) that the inverse
DFT requires a
division by

that the forward DFT does not. In
fixed-point
arithmetic (Appendix
G), and when

is a power of 2,
dividing by

may be carried out by right-shifting

places in the binary word. Fixed-point implementations of the inverse
Fast Fourier Transforms (FFT) (Appendix
A) typically right-shift one
place after each Butterfly stage. However, superior overall numerical
performance may be obtained by right-shifting after every
other
butterfly stage [
8], which corresponds to dividing both the
forward and inverse FFT by

(
i.e.,

is implemented
by
half as many right shifts as dividing by

). Thus, in
fixed-point, numerical performance can be improved, no extra work is
required, and the normalization work (right-shifting) is spread
equally between the forward and reverse transform, instead of
concentrating all

right-shifts in the inverse transform. The NDFT
is therefore quite attractive for fixed-point implementations.
Because
signal amplitude can grow by a factor of 2 from one butterfly
stage to the next, an extra guard bit is needed for each pair of
subsequent NDFT butterfly stages. Also note that if the DFT length

is not a power of

, the number of right-shifts in the
forward and reverse transform must differ by one (because

is odd instead of even).
Prime Factor Algorithm
(PFA)
By the prime factorization theorem, every integer

can be uniquely
factored into a product of
prime numbers 
raised to an
integer power

:
As discussed above, a mixed-radix
Cooley Tukey FFT can be used to
implement a length
DFT using DFTs of length

. However, for
factors of

that are mutually prime (such as

and

for

), a more efficient
prime factor
algorithm (PFA), also called the
Good-Thomas FFT algorithm,
can be used [
26,
80,
35,
43,
10,
83].
A.4
The
Chinese Remainder
Theorem
is used to
re-index either the input or output samples for the PFA.
A.5Since the PFA is only applicable to mutually prime factors of

, it
is ideally combined with a mixed-radix Cooley-Tukey FFT, which works
for any integer factors.
It is interesting to note that the PFA actually predates the
Cooley-Tukey FFT paper of 1965 [
16], with Good's
1958 work on the PFA being cited in that paper [
83].
The PFA and
Winograd transform [
43] are
closely related, with the PFA being somewhat faster [
9].
Rader's FFT algorithm can be used to compute
DFTs of length

in

operations when

is a
prime number.
For an introduction, see the
Wikipedia page for Rader's
FFT Algorithm:
http://en.wikipedia.org/wiki/Rader's_FFT_algorithm
Like
Rader's FFT,
Bluestein's FFT algorithm (also known as the
chirp
-transform algorithm), can be used to compute
prime-length
DFTs in

operations
[
24, pp. 213-215].
A.6 However,
unlike Rader's
FFT, Bluestein's algorithm is not restricted to prime
lengths, and it can compute other kinds of transforms, as discussed
further below.
Beginning with the DFT
where

denotes a primitive

th
root of
unity,
A.7 we multiply and divide by

to
obtain
where `

' denotes
convolution (§
7.2.4), and
the sequences

and

are defined by
where the ranges of

given are those actually required by the
convolution sum above. Beyond these required minimum ranges for

,
the sequences may be extended by zeros. As a result, we may implement
this convolution (which is cyclic for even

and ``negacyclic'' for
odd

) using
zero-padding and a larger
cyclic convolution, as
mentioned in §
7.2.4. In particular, the larger cyclic
convolution size

may be chosen a power of 2, which need
not be larger than

. Within this larger cyclic convolution, the
negative-

indexes map to

in the usual way.
Note that the sequence

above consists of the original data
sequence

multiplied by a
signal

which can be
interpreted as a sampled complex
sinusoid with instantaneous
normalized radian frequency

,
i.e., an instantaneous
frequency that increases linearly with time. Such signals are called
chirp signals. For this reason, Bluestein's algorithm is also
called the chirp

-transform algorithm [
59].
In summary, Bluestein's FFT algorithm provides complexity

for any positive integer DFT-length

whatsoever, even when

is
prime.
Other adaptations of the Bluestein FFT algorithm can be used to
compute a contiguous subset of DFT frequency samples (any uniformly
spaced set of samples along the unit circle), with

complexity. It can similarly compute samples of the

transform
along a sampled spiral of the form

, where

is any
complex
number, and

, again with complexity

[
24].
Fast Transforms in Audio DSP
Since most audio
signal processing applications benefit from
zero padding (see §
8.1), in which case we can always
choose the
FFT length to be a power of 2, there is almost never a need
in practice for more ``exotic'' FFT algorithms than the basic
``pruned'' power-of-2 algorithms. (Here ``pruned'' means elimination
of all unnecessary operations, such as when the input signal is real
[
74,
21].)
An exception is when processing exactly
periodic signals where the
period is known to be an exact integer number of samples in
length.
A.8 In such a case, the
DFT of one period of the waveform can be
interpreted as a
Fourier series (§
B.3)
of the
periodic waveform, and
unlike virtually all other practical
spectrum analysis scenarios,
spectral interpolation is not needed (or wanted). In the exactly
periodic case, the
spectrum is truly zero between adjacent
harmonic
frequencies, and the DFT of one period provides spectral samples only
at the
harmonic frequencies.
Adaptive FFT software (see §
A.7 below) will choose the fastest
algorithm available for any desired DFT length. Due to modern
processor architectures, execution time is not normally minimized by
minimizing arithmetic complexity [
23].
There are alternatives to the
Cooley-Tukey FFT which can serve the
same or related purposes and which can have advantages in certain
situations [
10]. Examples include the fast
discrete cosine transform (
DCT) [
5], discrete
Hartley transform [
21], and
number theoretic
transform [
2].
The DCT, used extensively in image coding, is described in §
A.6.1
below. The Hartley transform, optimized for processing real
signals,
does not appear to have any advantages over a ``pruned real-only
FFT''
[
74]. The number theoretic transform has special
applicability for large-scale, high-precision calculations (§
A.6.2
below).
In image coding (such as
MPEG and JPEG), and many audio coding
algorithms (MPEG), the
discrete cosine transform (DCT) is used
because of its nearly optimal asymptotic theoretical
coding gain.
A.9For 1D
signals, one of several DCT definitions (the one called
DCT-II)
A.10is given by
where
Note that

is the
DFT frequency for a length

DFT (as opposed to

).
For real signals, the real part of the DFT is a kind of DCT:
Thus, the real part of a double-length
FFT is the same as the DCT
except for the half-sample phase shift in the
sinusoidal basis
functions

(and a scaling by 2 which is
unimportant).
In practice, the DCT is normally implemented using the same basic
efficiency techniques as in FFT algorithms. In
Matlab and Octave
(
Octave-Forge), the functions
dct and
dct2 are
available for the 1D and 2D cases, respectively.
Exercise: Using
Euler's identity, expand the cosine
in the DCT defined by Eq.

(
A.2) above into a sum of complex
sinusoids, and show that the DCT can be rewritten as the sum of two
phase-modulated DFTs:
where

denotes the length

DFT of

.
The
number theoretic transform is based on generalizing the

th
primitive root of unity (see §
3.12) to a
``quotient ring'' instead of the usual field of
complex numbers. Let

denote a primitive

th
root of unity. We have been using

in the field of complex numbers, and it of course
satisfies

, making it a root of unity; it also has the
property that

visits all of the ``
DFT frequency points'' on
the unit circle in the

plane, as

goes from 0 to

.
In a
number theory transform,

is an
integer which
satisfies
where

is a prime integer. From number theory, for each
prime
number 
there exists at least one primitive root

such that

(modulo

) visits all of the numbers

through
in some order, as

goes from

to

. Since

for all integers

(another result from number
theory),

is also an

th root of unity, where

is the
transform size. (More generally,

can be any integer divisor

of

, in which case we use

as the generator of the
numbers participating in the transform.)
When the number of elements in the transform is composite, a ``fast
number theoretic transform'' may be constructed in the same manner as
a
fast Fourier transform is constructed from the DFT, or as the
prime factor algorithm (or Winograd transform) is constructed for
products of small mutually prime factors [
43].
Unlike the DFT, the number theoretic transform does not transform to a
meaningful ``
frequency domain''. However, it has analogous theorems,
such as the
convolution theorem, enabling it to be used for fast
convolutions and
correlations like the various
FFT algorithms.
An interesting feature of the number theory transform is that all
computations are
exact (integer multiplication and addition
modulo a prime integer). There is no round-off error. This feature
has been used to do fast convolutions to multiply extremely large
numbers, such as are needed when computing

to millions of digits
of precision.
FFT Software
For those of us in
signal processing research, the built-in
fft function in
Matlab (or Octave) is what we use almost all
the time. It is adaptive in that it will choose the best algorithm
available for the desired transform size.
For C or C++ applications, there are several highly optimized FFT
variants in the
FFTW
package
(``Fastest
Fourier Transform in the West'') [
23].
FFTW is free for non-commercial or free-software applications
under the terms of the
GNU General Public License.
For embedded
DSP applications (software running on special purpose DSP
chips), consult your vendor's software libraries and support website
for FFT algorithms written in optimized assembly language for your DSP
hardware platform. Nearly all DSP chip vendors supply free FFT
software (and other signal processing utilities) as a way of promoting
their hardware.
Next Section: Fourier Transforms for Continuous/Discrete Time/FrequencyPrevious Section: Example Applications of the DFT