# Simple Concepts Explained: Fixed-Point

# Introduction

Most signal processing intensive applications on FPGA are still implemented relying on integer or fixed-point arithmetic. It is not easy to find the key ideas on quantization, fixed-point and integer arithmetic. In a series of articles, I aim to clarify some concepts and add examples on how things are done in real life. The ideas covered are the result of my professional experience and hands-on projects.

In this article I will present the most fundamental question you need to ask to starting building experience in this domain: what is fixed-point? The article won't go through complicated mathematical equations. The intention is to build intuition by simple concepts we already know. I hope
it's useful for some DSP lovers out there!!

# Fixed-Point

What is fixed-point after all? Our everyday life has nothing to do with fixed-point. When we go to the supermarket we ask for 1.2 pounds of bread, or we pay 2.4 dollars for something. We never say "here you have your 134532523 dollars, but remember it's Q3.15". Life is real, and so are the numbers that represent it in our mind.

Real numbers are a mathematical entity and are full of meaning for us. Integer numbers are also a mathematical entity, and we can understand them well. What are fixed-point numbers after all? Well, to put it simple, fixed-point is a path, a way to go back and forth between real numbers and integer ones. This path is necessary for a reason: designing filters, solving equations and pretty much everything we do will happen in the world of real numbers. FPGA or ASIC implementations, even with very powerful devices, will most likely use integer operations.

Let's dive into the details of fixed-point. Let's start with a simple example. Let's say we have an ADC converter, which gives 16-bit, signed numbers at its output. These numbers go all the way from -32768 to 32767, which is nothing but $-2^{15}$ to $2^{15}-1.$ Now we want to implement a digital circuit in our FPGA, which is going to be a gain. This gain can be either positive or negative. Let's describe in words what we want, and then with some luck we will end up understanding how to manipulate the numbers to make it work.

I want my gain block to be configurable from $-1$ to $+1.$ Everybody understands this. No more words needed. A gain of $-1$ will make the signal flip (think of a scope capture), and a gain of $+1$ will give the same result. Additionally, the client says "I want to have at least 50 different settings for my gain". Things become a bit more complicated. I need to separate my entire range in at least 50 parts or sections. As engineers we are close relatives to powers of two, so we say "you know what, for the same price you will get 64 steps". Off course that is because we know 64 is $2^6.$ What we are saying here is that the gain will be a 6-bit number.

So far, so good. We have our 16-bit number coming in, and the 6-bit gain that will be applied. The mapping seems obvious for the gain, but we can still write it down to make it clear:

- $-1$: will be mapped to -32, the most negative number we can represent.
- $+1$: will be mapped to 31, the most positive number we can represent.
- $2/64$=$0.03125$, which is the range divided by the number of steps.

Few details here. We cannot represent $+1.$ Instead, what we have is $0.03125\times 31$, which is $0.96875.$ The step speaks for itself, and it represents the minimum amount we can differentiate between two consecutive codes.

# Completing the example

We have already built the tools to understand most of the concepts, believe it or not. So let's put some numbers on it and see if it works. I told you that fixed-point is just a path to move back and forth between real and integer numbers. Let's see this powerful idea in action.

Let's say the ADC voltage range is $\pm 1$ V. Now suppose the actual input is $0.234$ V and the desired gain is $0.3.$ We can do an easy calculation to obtain the output voltage after the gain, which is $0.234\times 0.3=0.0702.$ This is what we should get, so let's try to get the same result by applying our gain block.

We calculated the step to be $0.03125$ for the gain, which means that we can now compute how many steps we need to approach $0.3,$ the desired gain. This is $0.3/0.03125=9.6.$ Unfortunately we won't be able to use $9.6$ steps: we either use 9 or 10. Let's say we chose to use 10 steps, this will give a quantized gain of $0.3125.$ As you can see some errors show up, and that's the way it is.

Here we need a bit of theory. Let's build the path from real numbers to integer numbers by using fixed-point for the ADC. We said the voltage range is $\pm 1$ V, and the digital range is -32768 to 32767. The mapping to go from 1 to 32768 is multiplying it by $2^{15},$ which is nothing but the fixed-point representation of the integer number. Let's put this in words. We have an integer number which uses 16-bits, and then we want to divide this by some power of 2 to end up with a real number, which we know how to handle. The magic number is $2^{15}$ in this case. The theory will call this "1 bit for the integer part, and 15 bits for the fractional part". Some will call this Q1.15 or in any other fancier way, now you understand what is going on so we don't need names.

The same idea can be applied for the gain, where we used 5 bits for the fraction and 1 bit for the integer part, so the mapping is $\pm 1$ to -32 to 31 (remember the exact 1 is left outside). Back into the math, we fixed the ADC input to $0.234,$ which will be translated to 7668 (please do the math calculating the step similar to the gain example). The calculation is as follows:

$7668\times 10=76680,$

where 7668 is the voltage in ADC steps and 10 is the gain, in gain steps. What can we do with this number? Google around 5 minutes and you will find the rules for fixed-point multiplication: the number of bits are added, everywhere. We originally had 1 bit for the integer part for both the ADC and gain, and we have 15 bits of fractional part for the ADC and 5 for the gain. So the result must be 2 bits for the integer part and 20 bits for the fractional. Let's try it. The path from integer to real given by fixed-point says we need to divide our result by $2^{20}.$ This should lead to the real world back again. I'll let you do the math, but you can easily verify that the result we get is $0.07312775.$ This is not quite right, because when we did our real life calculation, we got $0.0702,$ but we know the gain was carrying some error, as well as the ADC data.

As a last word, I will give you a little bit of homework. What if the ADC range is $\pm 5$ V but it's still 16-bit? Do you think the gain block will change at all? Well the answer is no, and I'm pretty sure you already know why.

# Conclussion

The intention of this article is to walk through numbers and define fixed-point as a way of going from real to integer numbers. World is real, and such are the numbers that represent it. Computation and signal processing blocks are implemented using integer arithmetic, specially on FPGA and ASIC implementations. Understanding how to go from real numbers to integer is crucial when implementing solutions using these platforms.

Stay tuned!! This is the first article of a series where I will cover some other fun ideas and concepts to make complex things work in real life.

- Comments
- Write a Comment Select to add a comment

There is some confusion in this article between real numbers and floating-point representation. They are very different things. In this context, real numbers are analog and floating-point numbers are digital. Therefore, the way you talk about "real" in the Conclusion is correct, but everywhere else it is not. Floating-point numbers have finite precision and aren't so different from fixed-point numbers (the main difference being that their exponent is explicit and can therefore "float" around, instead of staying "fixed" to an implicit value). There are plenty of other important differences (particularly if we're talking about IEEE754), but that's the key one here.

If you want your gain to go from -1 to +1 with at least 50 steps, then you need 7 bits (signed two's complement): -1.0 = "11.00000" and +1.0 = "01.00000". You may think that 0.96875 seems quite close to 1.0, but your specification was 1.0. So either the specification was wrong, or the implementation is.

Your 5 minutes of Googling regarding fixed-point multiplication has led you to the wrong answer. This is not always true: "*the number of bits are added, everywhere*". This can be confirmed analytically, but it's not straightforward. Therefore, I would encourage you to exhaustively multiply pairs of fixed-point numbers across all possible values for each number, for a range of different fixed-point representations. (Use small bit-widths to keep computation time manageable). Then check the necessary and sufficient number of bits to represent the result of the multiplication in each case. You may be amazed to see how many tricky edge cases there are that break the "rule" you found by Googling.

One very trivial example is if you multiply two 1-bit signed integers together. Do you need 1+1=2 bits to represent the result? No. The result can only be 0 or 1, which only needs 1 bit (unsigned) to represent.

As I first read these comments, I thought you were being overly harsh and discouraging to Leandro with the "5 minutes of googling" comments-- but see that is exactly what he said in the post. You are forgiven :)

Nice write-up Leandro and looking forward to your further posts on your experiences. Helpful comments here by Weetabix. I believe the general rule that does always hold without confusion of corner cases is when we multiply **full scale** signed integers together, the precision needed at the output is given by the sum of precisions but only counting the sign bit once. This would then also include the simple case of 1 bit signed integers and is a useful takeaway. Do you concur? If we know for a fact that a signal will never reach full scale, or we are willing to accept overflow either by truncation or modulo-wrapping, then we can reduce that output precision accordingly.

Hi Dan,

Cannot agree more. Sometimes I just want to simplify the discussion to allow readers to focus on simple things, but it is true sometimes over-simplifying can be bad, too.

I have been implementing more then 100 IPs for RFSoCs and I hope I can keep writing to share some of that!!

Thanks!!

In the case of multiplication, the edge cases are very obscure indeed. They seem so obscure that they would never be useful, until they are:

- If
**either**input is 1-bit unsigned, then the number of integer bits can be reduced by 1 (compared to the normal rule). - If
**both**inputs are 1-bit signed, then the sign bit can be dropped.

I think it's fairly clear why those truncations are permissible. For me, it's not so obvious to see that they are the __only__ exceptions to the normal rule.

Addition and subtraction have more interesting edge cases (especially when handling inputs whose amount of integer bits or fractional bits is negative), which I'm not sure can be expressed so compactly. I will look into that, actually.

I have never seen any of these edge cases discussed anywhere, which I think is a real shame because they're both fascinating and potentially useful.

Let's please not be so critical to be discouraging, particularly to a new blogger. I look forward to his next entry and added my other comment as a supportive link to prerquisite material for those readers who may not be familiar with the binary representation of numbers.

How about you write up an article on this leaf level topic? It may be of interest to others who are more expert in this area.

Thanks for the comment!! I'm actually thinking on an interesting project I did which includes 2 DDSs and a pole-zero pair for implementing a frequency modulated resonator, which we use for emulating actual resonators (superconducting detectors).

This may be interesting, specially the fixed-point implementation of the IIR section.

I'll try to do it when I find some time!!

Hi there,

Thanks for your comments!! Honestly, I meant to write an example, very introductory, for those that don't have much experience in the field. I don't know why you introduced floating-point, as I never mentioned that in the article.

Regarding your multiplication fact, I can tell you that if you always keep that extra MSB in real life applications, you will keep loosing a ton of precision in the long term. Multiplying -1 x -1 sure needs one extra bit for representing +2, but good practice won't keep this extra bit and will invest it better for added precision.

The example of the +1/-1 yes, it is true you cannot reach +1. Now, this example is coming from a real life thing, so the ADC output does not reach the most positive voltage either, which is my actual input. So when you say +1 V and -1 V is the range, that is not totally true, I know. It is common practice to say "I have x bits and my range is -1/+1", when you actually mean -1 to +0.999 or whatever precision you have. You will never pay that extra bit for not using it 99 % of the time.

Again, the article is not intended to be exhaustive, just an example for people without experience that may get confused easily. And also to stress the fact that you can implement very complex signal processing things without having to use expensive floating-point operators!!

I hope I can write more with hands-on examples on FPGA!

Looking forward to your further posts! I think Weetabix meant literally a 1 bit signed number which can only be 0 which maps to 0, or 1 which maps to -1 as a signed 2's complement number. So the possible outputs of this product can only also be 0 or 1. (one bit). See my response to his comment on what I think the more robust rule is so that we needn't worry about concern with corner cases (and that we are using this specifically when our concern is having numbers that can reach full-scale). Thanks for posting!

Here is a link to a post I wrote on another forum that is very related. The forum is on a site for a language called Gambas which only available on Linux. It is a byte-code type IDE modeled after VisualBasic 6.0, before it went to the .net framework.

It is a great platform for prototyping algorithms as it is really easy to make a user interface for parameter setting and graphical output. It is a very rich language, on par with Python for being a good first language for learning programming. It also has support for calling shared libraries written in C.

The code examples in the post may or may not interest anyone here, but the output and the subject matter is about how bits work in bytes and is thus, IMO, a good supplement for this article.

The reason I wrote the article was to warn folks of a little gotcha quirk that Gambas has due to its author wanting to respect VB conventions.

Exposing bytes for what they really are

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: