Fixed Point vs Floating Point

Started by nelsona 6 years ago8 replieslatest reply 6 years ago1091 views


I had a few easy questions regarding fixed point vs floating point digital signals I was hoping someone could answer:

1. Is it safe to assume that all digital signals using a bit rate below 32 bits use a fixed point representation?

2. If not all, can we assume that most digital signals using a bit rate below 32 bits use a fixed point representation?

3. Is it safe to assume that all fixed point digital signal use exactly 2 decimal points of precision (ex. 16 bit = 96.33 dB SNR)?

4. If not all, can we assume that most fixed point digital signals use only 2 decimal points of precision?

5. Is there a limit to the amount decimal places / precision one could use for a floating point digital signal, or is it arbitrary?

For example, would it be possible to represent the following 32 bit number:


...as either:

0.0110101010000010100001100110110 = 0.41605415474623441696


001101.01010000010100001100110110 = 13.31373295187950134277


001101010100.00010100001100110110 = 852.0789089202880859375

6. If so, how would we calculate the SNR (signal-to-noise) ratio (in dB) for any of the 3 numbers above (see question 5)?

Thank you,

[ - ]
Reply by oliviertDecember 1, 2018

When the signal impinges on the input pin of the ADC, this is just an electric signal which voltage is between [-Vmax; +Vmax].

This voltage is transformed into an integer on 10, 12, 16, 24, 32 bits depending on the ADC. You can interpret this binary number as an integer or a fixed-point number, this won't change at all the SNR of your signal.

If your target is an FPGA using C++ (typically Vivado HLS for Xilinx FPGAs) this 10/12/.../32 bit bus can be reinterpreted as you wish in any fixed-point number representation. After that you can do any computation (add/sub, mul/div, convolution, ...) Vivado HLS will follow the way you want to represent your numbers.

When I was using DSP processors, I was using numbers between [-1; 1[, so that a multiplication was still between -1 and 1. Globally after a multiplication you keep the MSBs and for the addition you keep the LSBs (after a right bitshift by 1 if necessary).

Floating-point is different. There exist a few standardized formats: half/single/double/quad, each with its own mantissa/exponent size.

When you work on a XILINX device, you can also choose any mantissa/exponent size that fits in a maximum of 64 bits, in that case you have to make a trade-off between more precision or more dynamics.


[ - ]
Reply by nelsonaDecember 1, 2018

Hello Olivier,

A big thanks for your detailed explanation.


[ - ]
Reply by bholzmayerDecember 1, 2018

Imagine your digital signal stored in a serie of flip-flops (register).
We'll find out a lot which may answer your questions by imagining this register.

  • First: bit rate has nothing to do with fixed-point or with representation. Instead it has to do with timing.
    "rate" would probably mean the refreshing/reloading of the signal in the register by triggering the CE(chip enable) of all FFs (FlipFlops). This is going to the next sample which will be stored in the same register.
  • The register cannot have an infinite number of FFs.  Therefore it can only hold a definite number of bits. Let's assume we have 3 bits.
  • One of the most obvious interpretations would be an integer number of (0..7), where the FFs hold the bits of the dual system representation (6='110'). Probably the most common representation.
  • The next issue is about SNR.
    Assuming you have a 3bit-register, but you feed it with a signal source of only 2bit width, leaving 1bit unchanged. This signal is still a 3bit signal, your SNR depend on which bit is stable.
    Instead of being stable this free bit might be arbitrary noise. The SNR cannot be derived from the number of bits, and not of the signal itself.

Resulting from this, my answers to your questions would be:

  1. no
  2. no
  3. no
  4. no
  5. no, it's arbitrary or - maybe- depends on what your hardware provides.
  6. impossible, since we don't know how many of the bits are noise and how many are reliable signal values.

In your 3 examples: if you think of SNR being the noise which comes from the resolution of the numbers. Then the left numbers obviously have a resolution of 1 bit (LSB) in the dual system. But what about the decimal system numbers on the right. Is the SNR thought of being the resolution effect of the decimal system? Then left and right numbers would have different SNR.

[ - ]
Reply by nelsonaDecember 1, 2018

Hello Bholzmayer,

A big thank you for the detailed explanation.


[ - ]
Reply by Y(J)SDecember 1, 2018
DSP signals come in all sizes, and the bit rate needs to be determined based on what you want to do with the signal.

Linearly sampled speech signals sound noisy if they are sampled with fewer than around 14 bits (which is why linear 16 bits is often used), but using logarithmic sampling 8 bits is sufficient. But orchestral music needs more using linear sampling since there can be a larger dynamic range.

The number of bits can be traded off with sampling rate. Ever heard of "one bit" (AKA sigma delta) sampling? By sampling at a high rate you can get away with a single bit, halving that rate you need 2 bits, etc.

There are DSP processors that give you 56 bits or more of precision. And even if the DSP's registers are 32 bits, if you intend processing (e.g., performing convolutions) you will need more bits in the accumulators (adding 2 32 bit numbers requires a 33 bit register, adding 4 such numbers requires 34 bits).

What do you mean by the number of bits after the decimal point for a fixed-point number? Fixed point numbers are integers, and the conversion factor is chosen according to the precision needed and the chance of overflow. To convert bits to dB of SNR multiply by 6 (approximately).

So the answers to all your questions are no.


[ - ]
Reply by nelsonaDecember 1, 2018

Hello Y(J)S,

Thank you for your response.


[ - ]
Reply by kazDecember 1, 2018

Most ADCs that I heard of are fixed point but floating point may exist. At ADC level the number of bits does have effect on quantisation noise but any external noise added to analogue signal is sampled as part of signal at adc.

Quantisation SNR depends on signal bandwidth as well as dynamic range and bitwidth and for single tone it is about +6dB per bit. Thus for n bits using full swing tone it is n*6dB.

The use of 8/10/12/16 ADcs is common as such devices are manufactured for common applications and standards, though what is defined is more to do with snr and sensitivity, power ...etc.

Within FPGA domain floating point(32 or 64) is sometimes preferred for critical applications. A 32 bit floating point is superior to 32 bit fixed point as resolution is not biased towards high values. A common example is floating point fft yet the trend is to work fixed point in the system and convert fixed point to floating point solely for fft and back to fixed point.

[ - ]
Reply by nelsonaDecember 1, 2018

Hello Kaz,

Thanks once again for your response.