DSPRelated.com

Fixed-Point Arithmetic

Category: Numerics | Also known as: fixed point, Q-format, Q15, Q31

Fixed-point arithmetic is a method of representing and computing with fractional numbers using standard integer hardware, by implicitly treating a fixed number of the integer's bits as fractional (sub-unity) bits. Unlike floating-point, the position of the binary point never moves, so the representable range and resolution are both constant across all values in the stored scaled domain (though practical computations can still lose precision through rounding, truncation, and overflow).

In practice

Fixed-point is a common approach to fractional math on MCUs and DSPs that lack a hardware floating-point unit (FPU), which still describes a large share of production embedded targets: 8-bit AVR/PIC, most MSP430 variants, Cortex-M0/M0+, and many low-cost Cortex-M3/M4 parts deployed without the optional FPU enabled. (Some firmware uses scaled integers, lookup tables, or software floating-point instead.) Even on parts with an FPU, fixed-point is often chosen in tight ISR loops or on DSP cores (e.g., the C55x or Blackfin family) where integer pipelines offer higher throughput or lower power than the FPU pipeline.

One widely used notation is Q-format, written Qm.n or just Qn, though other conventions exist across communities and tools that differ in how the sign bit and integer bits are counted. For a 16-bit signed word, Q1.15 (often called Q15) dedicates 15 bits to the fractional part; under the common convention where the sign bit is separate from m, this gives a range of [-1, +1) with a resolution of 2^-15 (~30.5 µV of a 1 V full-scale, for example). Q31 similarly uses 31 fractional bits in a 32-bit signed word, with the exact range depending on the convention applied. CMSIS-DSP, ARM's open-source signal-processing library for Cortex-M, uses Q7, Q15, and Q31 throughout and provides intrinsic-optimized kernels for filters, FFTs, and matrix operations in those formats. When two Qn numbers are multiplied, the result has twice as many fractional bits; on a 16x16->32-bit multiplier (available on Cortex-M4 and many DSP cores) you shift or round the 32-bit product back to Q15 to stay in format, though the exact shift amount depends on the operand formats and whether pre-biasing or saturation is used.

Overflow is the most common pitfall. Because the binary point is fixed, any intermediate result that exceeds the representable range can wrap or saturate depending on the hardware and how the code is written; unlike floating-point, there is no exponent to absorb dynamic range. Accumulators are typically widened (e.g., 32- or 40-bit) to hold partial sums before final rounding. Saturation arithmetic, available as a hardware instruction on ARMv7-M cores such as Cortex-M3/M4 (e.g., SSAT, USAT, and QADD where the DSP extension is present) and many DSP cores, is strongly preferred over wrapping when processing audio, control, or communication signals. The EmbeddedRelated post "A Fixed-Point Introduction by Example" walks through these scaling decisions concretely.

Discussed on DSPRelated

Frequently asked

How do I choose the right Q format for my application?
Identify the maximum magnitude your signal can reach (the range requirement) and the smallest difference you need to distinguish (the resolution requirement). A Qm.n signed format in a W-bit word gives a range of [-2^m, 2^m) and a resolution of 2^-n, where m + n = W - 1 (the remaining bit is the sign). Start with the widest word your hardware multiplier can handle efficiently (typically 16 or 32 bits), reserve enough integer bits to prevent overflow in intermediate calculations, and assign the rest to fractional bits. The EmbeddedRelated posts 'A Fixed-Point Introduction by Example' and 'Simple Concepts Explained: Fixed-Point' both cover this scaling process step by step.
What happens when I multiply two fixed-point numbers?
Multiplying a Qn value by a Qn value produces a Q(2n) result with twice as many fractional bits. On a 16x16->32-bit multiplier, two Q15 values produce a Q30 (or Q31 with a pre-shift) 32-bit product. You must shift right by n and round or truncate to return to Qn before storing or feeding the result into the next stage. Forgetting this step is the most common fixed-point programming error and doubles your effective noise floor even if no overflow occurs.
Can I mix Q formats in one calculation?
Yes, but you must track the binary-point position of every operand. Adding a Q15 value to a Q13 value requires aligning the binary points first (shift the Q13 left by 2 or the Q15 right by 2), then adding. Many teams use a spreadsheet or a type-annotation convention (e.g., a C typedef or a comment on every variable declaration) to track format through a signal chain. Misaligned additions introduce a constant scaling error that is easy to miss in testing.
When is floating-point a better choice than fixed-point?
When the signal has a wide dynamic range that is hard to predict at design time, when algorithm development time outweighs the runtime cost, or when the target already has a well-pipelined FPU (e.g., Cortex-M4F, Cortex-M7, Cortex-A-class cores running NEON). On those cores, single-precision float is often competitive with or faster than carefully written 32-bit fixed-point code. Fixed-point remains advantageous in tight memory budgets (16-bit Q15 data is half the storage of float), in cores without any FPU, and in algorithm kernels where SIMD integer instructions (e.g., SIMD16 on Cortex-M4) give throughput that float cannot match.
How can I prototype or verify a fixed-point algorithm without dedicated tools?
GNU Octave (free) supports integer arithmetic and bitwise operations well enough to simulate fixed-point signal chains. The EmbeddedRelated post 'Fixed-Point Simulation in GNU Octave—Without MATLAB' demonstrates this approach. For C code, the CMSIS-DSP library's Q7/Q15/Q31 functions can be compiled and tested on the host with a standard C toolchain before deployment, making it easy to compare fixed-point output against a floating-point reference.

Differentiators vs similar concepts

Fixed-point is often contrasted with floating-point, but the more subtle comparison is with integer arithmetic: plain integer math implies no fractional part, whereas fixed-point uses the same integer hardware but adds a software convention that a fixed number of bits are below the binary point. Fixed-point is also distinct from block floating-point, where a single shared exponent scales an entire block of integer values, giving some of floating-point's dynamic range at lower cost than per-sample floats. Q-format is simply the dominant notation for specifying fixed-point formats; Q15 and Q31 are specific instances (15 and 31 fractional bits in a 16- and 32-bit signed word, respectively) rather than separate arithmetic systems.