## Fixed Point vs Floating Point

Started by 4 years ago8 replieslatest reply 4 years ago984 views

Hello,

I had a few easy questions regarding fixed point vs floating point digital signals I was hoping someone could answer:

1. Is it safe to assume that all digital signals using a bit rate below 32 bits use a fixed point representation?

2. If not all, can we assume that most digital signals using a bit rate below 32 bits use a fixed point representation?

3. Is it safe to assume that all fixed point digital signal use exactly 2 decimal points of precision (ex. 16 bit = 96.33 dB SNR)?

4. If not all, can we assume that most fixed point digital signals use only 2 decimal points of precision?

5. Is there a limit to the amount decimal places / precision one could use for a floating point digital signal, or is it arbitrary?

For example, would it be possible to represent the following 32 bit number:

00110101010000010100001100110110

...as either:

0.0110101010000010100001100110110 = 0.41605415474623441696

...or

001101.01010000010100001100110110 = 13.31373295187950134277

...or

001101010100.00010100001100110110 = 852.0789089202880859375

6. If so, how would we calculate the SNR (signal-to-noise) ratio (in dB) for any of the 3 numbers above (see question 5)?

Thank you,
Nelson

[ - ]

When the signal impinges on the input pin of the ADC, this is just an electric signal which voltage is between [-Vmax; +Vmax].

This voltage is transformed into an integer on 10, 12, 16, 24, 32 bits depending on the ADC. You can interpret this binary number as an integer or a fixed-point number, this won't change at all the SNR of your signal.

If your target is an FPGA using C++ (typically Vivado HLS for Xilinx FPGAs) this 10/12/.../32 bit bus can be reinterpreted as you wish in any fixed-point number representation. After that you can do any computation (add/sub, mul/div, convolution, ...) Vivado HLS will follow the way you want to represent your numbers.

When I was using DSP processors, I was using numbers between [-1; 1[, so that a multiplication was still between -1 and 1. Globally after a multiplication you keep the MSBs and for the addition you keep the LSBs (after a right bitshift by 1 if necessary).

Floating-point is different. There exist a few standardized formats: half/single/double/quad, each with its own mantissa/exponent size.

When you work on a XILINX device, you can also choose any mantissa/exponent size that fits in a maximum of 64 bits, in that case you have to make a trade-off between more precision or more dynamics.

Olivier

[ - ]

Hello Olivier,

A big thanks for your detailed explanation.

Nelson

[ - ]

Imagine your digital signal stored in a serie of flip-flops (register).
We'll find out a lot which may answer your questions by imagining this register.

• First: bit rate has nothing to do with fixed-point or with representation. Instead it has to do with timing.
"rate" would probably mean the refreshing/reloading of the signal in the register by triggering the CE(chip enable) of all FFs (FlipFlops). This is going to the next sample which will be stored in the same register.
• The register cannot have an infinite number of FFs.  Therefore it can only hold a definite number of bits. Let's assume we have 3 bits.
• One of the most obvious interpretations would be an integer number of (0..7), where the FFs hold the bits of the dual system representation (6='110'). Probably the most common representation.
• The next issue is about SNR.
Assuming you have a 3bit-register, but you feed it with a signal source of only 2bit width, leaving 1bit unchanged. This signal is still a 3bit signal, your SNR depend on which bit is stable.
Instead of being stable this free bit might be arbitrary noise. The SNR cannot be derived from the number of bits, and not of the signal itself.

1. no
2. no
3. no
4. no
5. no, it's arbitrary or - maybe- depends on what your hardware provides.
6. impossible, since we don't know how many of the bits are noise and how many are reliable signal values.

In your 3 examples: if you think of SNR being the noise which comes from the resolution of the numbers. Then the left numbers obviously have a resolution of 1 bit (LSB) in the dual system. But what about the decimal system numbers on the right. Is the SNR thought of being the resolution effect of the decimal system? Then left and right numbers would have different SNR.

[ - ]

Hello Bholzmayer,

A big thank you for the detailed explanation.

Nelson

[ - ]
DSP signals come in all sizes, and the bit rate needs to be determined based on what you want to do with the signal.

Linearly sampled speech signals sound noisy if they are sampled with fewer than around 14 bits (which is why linear 16 bits is often used), but using logarithmic sampling 8 bits is sufficient. But orchestral music needs more using linear sampling since there can be a larger dynamic range.

The number of bits can be traded off with sampling rate. Ever heard of "one bit" (AKA sigma delta) sampling? By sampling at a high rate you can get away with a single bit, halving that rate you need 2 bits, etc.

There are DSP processors that give you 56 bits or more of precision. And even if the DSP's registers are 32 bits, if you intend processing (e.g., performing convolutions) you will need more bits in the accumulators (adding 2 32 bit numbers requires a 33 bit register, adding 4 such numbers requires 34 bits).

What do you mean by the number of bits after the decimal point for a fixed-point number? Fixed point numbers are integers, and the conversion factor is chosen according to the precision needed and the chance of overflow. To convert bits to dB of SNR multiply by 6 (approximately).

Y(J)S

[ - ]

Hello Y(J)S,

Nelson

[ - ]

Most ADCs that I heard of are fixed point but floating point may exist. At ADC level the number of bits does have effect on quantisation noise but any external noise added to analogue signal is sampled as part of signal at adc.

Quantisation SNR depends on signal bandwidth as well as dynamic range and bitwidth and for single tone it is about +6dB per bit. Thus for n bits using full swing tone it is n*6dB.

The use of 8/10/12/16 ADcs is common as such devices are manufactured for common applications and standards, though what is defined is more to do with snr and sensitivity, power ...etc.

Within FPGA domain floating point(32 or 64) is sometimes preferred for critical applications. A 32 bit floating point is superior to 32 bit fixed point as resolution is not biased towards high values. A common example is floating point fft yet the trend is to work fixed point in the system and convert fixed point to floating point solely for fft and back to fixed point.

[ - ]