FIR sideways (interpolator polyphase decomposition)

Markus NentwigSeptember 12, 20129 comments

An efficient implementation of a symmetric-FIR polyphase 1:3 interpolator that doesn't follow the usual tapped delay line-paradigm.
The example exploits the impulse response symmetry and avoids four multiplications out of 10.
keywords: symmetric polyphase FIR filter implementation ASIC

Matlab / Octave implementation


An interpolating FIR filter can be implemented with a single tapped delay line, possibly going forwards and backwards for a symmetric impulse response. To optimize it further, I need to understand only a few things, namely "cascade" and "Nobel" identities, transposed filter structures and how to spell "polyphase decompostition".
Or maybe I forget everything about theory and just look at the implementation, which seems rather straightforward. Even obvious, after staring at it for an hour (do not blink).


Figure 1 shows a 1:3 interpolating FIR filter at a conceptual level.
After every input sample x, two zero samples are inserted.
z-1 refers to a delay at the high (output) rate.
The delay line contents in the middle of the picture illustrate, how the non-zero input samples and inserted zeros move through the delay line at three consecutive time instants.
For interpolation-by-three, the filter cycles through three 'phases', where different taps hold the non-zero samples.

Figure 1: FIR interpolation and polyphase equations

Clearly there is no point in performing expensive multiplications with (non)-samples that are known to be zero (BUT: this is perfectly fine for Matlab coding, as long as it helps to keep things 'simple and stupid'. Usually it does).
The equations at the bottom of Figure 1 show only remaining terms, if products with zeros are omitted.
The three output samples y1, y2 and y3 are calculated for each input sample (the x before zero insertion), as the interpolating FIR produces three output samples per input sample.

Textbook polyphase FIR interpolator

Instead of inserting zeros and then carefully multiplying around them, the structure can be rewritten using "slower" delays z-3 at the input rate (=> zzz => the sample goes to sleep for a while. Ahem.).
Now three output samples are generated for each step of the delay line, as shown in Figure 2:

Figure 2: Conventional tapped delay-line implementation\

'FIR sideways':An alternative implementation

So far so good.
The implementation in Figure 2 is maybe the most common way to implement such a filter.
But, there is an alternative:

Figure 3: Alternative implementation

Figure 3 uses individual delay lines for each coefficient. The structure can always be implemented using one multiplication per independent coefficient and input sample.
The number of multiplications has been reduced from 10 to 6. And if the center tap can be fixed to 1, as is common, it improves from 9 to 5, which is even better.


This article shows a polyphase FIR interpolator for symmetric impulse responses that uses fewer multiplications than the textbook 'single-delay-line' implementation, in the example only two per output sample.
Does it make sense? Possibly "yes", justified by the application-dependent cost of multiplications.
In real life, "cost" is rarely as straightforward, but it may be worth having a look at.
For example, [1] (Fig 2.4) shows an application where a "non-canonical" polyphase structure can be beneficial.
Possible savings differ, depending on interpolation factor and filter length.

Matlab demo

The Matlab / Octave example calculates the filter response using the three methods outlined in this article.
The resulting plot (Fig. 4) shows the output, three identical and thus overlapping traces.

Figure 4: Matlab output


[1] Improved multirate polyphase-based interpolation

[ - ]
Comment by kazSeptember 15, 2012
Hi Markus, For an interpolator (usually polyphase structure), the number of multipliers = ceil(taps/I/folding factor). Where I = interpolation factor and folding factor is system floor(clock/sample rate). Thus in your case I expect 10/3/1 = 4 or even 3 if peak tap is power of 2. With the polyphase structure, we only need one delay line equal to taps/I and the symmetry is lost but it stays as most efficient approach. Kadhiem
[ - ]
Comment by mnentwigSeptember 15, 2012
Hello, thanks for your comment. Now I'm not sure if I understand it correctly. Please clarify, if I got it wrong: A conventional polyphase would use four multiplications per -output- sample, which is (interpolating by 3) twelve multiplications per input sample. As my example impulse response is a bit shorter, eleven remain, and one is duplicate => ten multiplications per input sample. My proposal uses 6 multiplications per -input- sample. As I get three output samples, I need only two multiplications per -output- sample, instead of four. -markus
[ - ]
Comment by kazSeptember 15, 2012
That seems ok. The point you are raising is multiplications per output (or input). The point I am raising is multipliers per output(or input). It depends on perspective. Mine is FPGA and we focus on the latter. May you are concerned about speed or power consumtion. Thanks. By the way, your index of coeffs is not right (possibly typing error) in Fig 2 Kadhiem
[ - ]
Comment by mnentwigSeptember 15, 2012
Hello, in the end, it boils down to how many multiplications per second I need for a given signal. Compared to a "canonical" implementation, the above proposal should need about half.
[ - ]
Comment by mnentwigSeptember 15, 2012
BTW, where is the mistake in Fig. 2? I can't see it, but typos are usually good at "not being seen"...
[ - ]
Comment by kazSeptember 15, 2012
Oops it is the way you indexed for symmetry, a bit confusing, I would normally use h1 ~ h11 rather than repeat the second half with same index of first half. Sorry.
[ - ]
Comment by mnentwigSeptember 15, 2012
Hello, the whole exercise works only if the impulse response is symmetric, because I exploit c7 = c5 etc. I'll point out the need for symmetric IR more clearly in the introduction. Fortunately, I'd expect that the typical polyphase FIR impulse response is symmetric: If I need asymmetry to equalize group delay, I could do that more efficiently at the low rate.
[ - ]
Comment by kazSeptember 15, 2012
Hi Markus, Just to confirm what I said about FPGA perspective. Given the choice of 4 multipliers for entire fir design or 6 multipliers, any fpga designer will certainly choose 4. The 3 outputs will take turns to use same 4 multipliers. It seems that the perspective of dsp (software as opposed to fpga dsp) is different. I can imagine that in terms of power consumption, yes 6 multiplications per input may be favoured over 4x3 per input. Kadhiem
[ - ]
Comment by mnentwigSeptember 15, 2012
I think most DSPs actually handle a multiplication in one clock cycle. If so, I'd expect that multiplications don't matter at all. But on an ASIC; for example, an algorithm that uses fewer multiplications for a given processing job can save quite a bit of silicon area, as that's where the money goes.

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: