The Fast Fourier Transform (FFT) is a family of algorithms that compute the Discrete Fourier Transform (DFT) of a sequence in O(N log N) time rather than the O(N²) time required by a naive DFT implementation. It decomposes an N-point transform into smaller sub-transforms, dramatically reducing the number of multiplications and additions required.
In practice
In embedded systems, FFTs appear wherever frequency-domain analysis is needed: audio processing, vibration monitoring, motor control (detecting harmonics or fault signatures), power quality measurement, and software-defined radio. On MCUs with sufficient RAM and a hardware multiplier or FPU, fixed-point or floating-point FFTs are practical in real time. ARM Cortex-M4 (on variants that include an FPU) and Cortex-M7 cores are common targets because their DSP instructions and FPU make the butterfly operations efficient; CMSIS-DSP provides tested, optimized FFT implementations for these and other supported Cortex-M cores. Smaller 8- or 16-bit MCUs can run FFTs but are typically limited to short transforms (64 or 128 points) at modest sample rates.
Among the most widely implemented variants is the Cooley-Tukey radix-2 algorithm, which requires N to be a power of two. Radix-4, split-radix, and mixed-radix variants are also widely used and can be faster or more flexible for certain transform lengths, including lengths that are not powers of two. Twiddle factors (the complex exponential coefficients used in each butterfly stage) are usually precomputed and stored in flash to avoid expensive trigonometric calculations at runtime. The blog post "Computing FFT Twiddle Factors" and the series "How the Cooley-Tukey FFT Algorithm Works | Part 4 - Twiddle Factors" cover this in detail. When RAM is scarce, in-place FFT algorithms reuse the input buffer for output, though this destroys the original time-domain data.
A persistent pitfall is misunderstanding the output. The FFT of a real N-point sequence produces N complex bins, but because a real-valued input has conjugate-symmetric frequency-domain representation, only the first N/2+1 bins are independent (the remaining bins mirror them). Frequency resolution equals the sample rate divided by N, so increasing resolution requires either a longer transform or a lower sample rate. Spectral leakage occurs when the signal frequency does not fall exactly on a bin; windowing functions (Hann, Hamming, Blackman) mitigate leakage at the cost of frequency resolution. "FFT Interpolation Based on FFT Samples: A Detective Story With a Surprise Ending" addresses how to recover more accurate frequency estimates from bin data.
Memory is often the binding constraint on small MCUs. A 1024-point complex floating-point FFT requires 8 KB just for the data buffer (1024 x 2 x 4 bytes), plus twiddle-factor tables. Fixed-point implementations using 16-bit or 32-bit integers reduce memory and improve throughput at the cost of dynamic range and the need to manage scaling to avoid overflow. When the required transform length exceeds available memory or the FFT IP available on a device, it is sometimes possible to compute large DFTs by combining smaller FFTs, as described in "Computing Large DFTs Using Small FFTs."
Discussed on DSPRelated
Frequently asked
What transform length should I choose for my FFT?
For radix-2 Cooley-Tukey FFTs, N must be a power of two (64, 128, 256, ..., 4096 are common on MCUs). Choose N based on two competing requirements: frequency resolution (sample_rate / N) and time resolution (N / sample_rate). Larger N gives finer frequency bins but longer acquisition windows and more RAM. On memory-constrained parts, 256 to 1024 points is a typical practical range.
Should I use floating-point or fixed-point FFT on my MCU?
Floating-point (typically 32-bit float) is easier to implement correctly and avoids overflow management, but requires an FPU for acceptable speed. ARM Cortex-M4F and Cortex-M7 cores handle 32-bit float FFTs efficiently via CMSIS-DSP. Fixed-point (
Q15 or Q31) is faster on cores without an FPU, uses less memory, and is suitable for 16-bit DSP instruction sets, but requires careful attention to scaling at each butterfly stage to prevent overflow or excessive quantization noise.
Why does my FFT output look wrong or noisy?
Common causes include: (1) spectral leakage from a signal frequency that falls between bins -- apply a
window function before the FFT; (2)
aliasing from sampling below the
Nyquist rate -- ensure your anti-aliasing filter cutoff is below half the
sample rate; (3) DC offset producing a large spike in bin 0 -- subtract the mean from the input; (4) overflow in fixed-point implementations -- check your scaling strategy. Also verify that your input buffer is filled with exactly N fresh samples and that any index bit-reversal required by your FFT library is being applied.
Can I run an FFT and inverse FFT at the same time to save compute?
Yes, for real-valued inputs there is a well-known technique to compute a forward FFT and an
inverse FFT simultaneously using a single FFT pass by packing two real sequences into the real and imaginary parts of one complex transform. The blog post 'Simultaneously Computing a Forward FFT and an Inverse FFT Using a Single FFT' covers this approach in detail. This roughly halves the computation cost when both transforms are needed concurrently, which is useful in systems like audio codecs or overlap-add filter banks.
Is there a hardware FFT accelerator on common MCUs?
Dedicated FFT hardware is uncommon in general-purpose MCUs but appears in some DSP-oriented SoCs and FPGAs. Some MCU vendors provide DSP co-processors or peripherals that can assist in signal-processing pipelines (for example, some STM32 parts include the DFSDM peripheral, which is a digital filter interface targeted at audio acquisition, not an FFT accelerator). On FPGA targets, FFT IP cores are standard library components. For most Cortex-M4 and Cortex-M7 MCUs, the CMSIS-DSP software library with DSP instruction optimization is the practical path.
Differentiators vs similar concepts
The FFT is an algorithm for computing the
DFT, not a different transform. The DFT is the mathematical operation; the FFT is the efficient implementation. Using 'FFT' and 'DFT' interchangeably is common but imprecise: the DFT defines what is computed, the FFT defines how. Separately, the FFT is sometimes confused with the DTFT (Discrete-Time
Fourier Transform), which operates on infinite sequences and produces a continuous frequency spectrum -- the DFT/FFT operates on finite, sampled sequences and produces a discrete set of frequency bins. The FFT is also unrelated to the FWT (Fast Walsh-Hadamard Transform), which is a different transform used in spread-spectrum and some image-processing applications.