# Interpolation and decimation

Started by January 13, 2004
```In article bufn81\$h6vvj\$1@ID-210375.news.uni-berlin.de, Jon Harris at
goldentully@hotmail.com wrote on 01/18/2004 23:45:

> Ronald H. Nicholson Jr. <rhn@mauve.rahul.net> wrote in message
> news:bufdfe\$f3d\$1@blue.rahul.net...
...
>>
>> Or maybe use the interpolation FIR filter on it's own coefficients!

one place where that is useful is in designing the FIR coefs to start with
if the upsample ratio is so high that MATLAB's remez() chokes on it.  so you
do it for fewer phases, and then use those very coefs to interpolate a
denser table with more phases.

> This is usually overkill.  Unless your available table memory is so small that
> you can only store a handfull of phases, the filter coefficients are
> essentially highly oversampled data (or low pass data as r b-j called it).

it's not what i meant regarding "low pass data", Jon.  hmmmm.  maybe it *is*
what i meant, but from a different POV.

what i meant is if the audio data is highly oversampled (which lot's of
phases in the FIR table would do to it), then it appears, from the POV of
the high sampling rate, very "low pass data".  now if that's the case, the
*images* are narrow little blips at every integer multiple of the high Fs
and a sinc^2 frequency response (which is what you get when you convolute
with the triangular pulse which is what happens when you do linear
interpolation between neighboring samples which is essentially the same
thing you are doing if you are interpolating between coefficients of
neighboring phases) will kill those little blips quite nicely.  higher order
B-splines (which has frequency response of sinc^(N-1) for an Nth order
B-spline) will kill those blips even better.

> Hence you can get away a pretty simple interpolation scheme since there
> is very little high-frequency content.
>
>> Hmmm...  I wonder how many levels of this it takes before diminishing
>> returns sets in?  Everybody seems to assume 1.5 levels in enough:
>> 1 or 2 taps (select or linear interpolation inside the phase table
>> for the FIR coefficients) plus N taps (for the data).
>
> This is usually adequate for the reasons stated above (and see below too).

if you have 512 different phases of your polyphase FIR filter (the same as
oversampling by a factor of 512), then linear interpolation will be adequate
for an S/N of 120 dB, if i recall correctly.  if you do a cubic B-spline,
you will need only 16 or 32 different phases (or oversampling by 16 or 32)
to get the same S/N.  if you're doing *no* interpolation (sometimes called
"drop-sample") between coefficients (or oversampled samples), then you need
512K phases (or oversample by 512K.

>> Given a fixed number of MACs per sample and a fixed number of table
>> entries (to fit in x% of the dcache for instance), it looks like there
>> are several possiblities.  0 MACs for the coeffs and N MACs for the data.
>> N/2 MACs for 2 taps of coeff generation (and with a table now with twice
>> as many phases in the same number of bytes) and N/2 taps for the data.
>> Or maybe even more MACs for coefficient generation and an even shorter
>> FIR filter.  An interesting optimization and error bounding problem.
>
> I think it would usually be pretty easy to find the optimial answer.  For
> example, if you decide to do linear coeficient interpolation, you immediately
> have to use a shorter filter which means you can store more phases in the same
> memory, which means there is a less critical need for the linear coef
> interpolation.  You can see why higher order interpolation of the coefs
> quickly becomes a losing battle.  The only excpeption might be if memory was
> *extremely* tight and you had tons of MACs available.

the tradeoff is strictly between memory usage and computational complexity.
if you have little memory and lotsa MIPs to burn, you will want to use a
higher order interpolation between the few coefficients you have.  if you
have 16 megawords laying around, you can save MIPs and do *no* interpolation
of coefficients.

now, if you are *upsampling*, then you can do the linear (or whatever kind)
interpolation on the resulting output samples from the two (or more)
neighboring phases.  that is mathematically equivalent to interpolating the
coefficients and usually much simpler to code.  this doesn't work as well
for downsampling because you need to perform some low-pass filtering of the
signal anyway to avoid aliasing and one way to do that is to pick off your
sinc() table coefficients at a different spacing.  that means more taps on
your FIR filter (per sample) but you are doing fewer output samples per

r b-j

```
```In article <bufn81\$h6vvj\$1@ID-210375.news.uni-berlin.de>,
Jon Harris <goldentully@hotmail.com> wrote:
>Ronald H. Nicholson Jr. <rhn@mauve.rahul.net> wrote in message
>news:bufdfe\$f3d\$1@blue.rahul.net...
>> >> Polyphase is just an implementation detail for a known filter - so I
>> >>choose to leave that out as much as possible.
>>
>> I sense recursion here.
>>
>> The problem is not having continuous data.  So we use equally spaced
>> sampled data, and we select, iterpolate or multi-rate FIR filter to
>> get (estimated) values in between the ones sampled.
>>
>> But since no CPU has a one-cycle sinc() generating instruction (etc.), we
>> don't have a continuous FIR filter coefficients.  So we use eqully spaced
>> "phases" of the FIR coefficients, and we select, interpolate, or...
>>
>> Or maybe use the interpolation FIR filter on it's own coefficients!
>
>This is usually overkill.  Unless your available table memory is so
>small that you can only store a handfull of phases

Actually, that's exactly the case about which I'm thinking.  Consider
a DSP/CPU with a very small data cache, and a latency for table memory
access which is many cycles long (e.g. even longer than doing "several"
multiply accumulates).  If one is only interpolating a short signal
segment with a non-repeating phase step, a large phase table is almost
guaranteed to not be cache resident, so every table lookup incurs a
cache miss penalty.

How much can the phase table be compressed by using higher orders, or
even recursive, interpolation schemes?  At the limit, one could even do
a series expansion of the sin and cosine terms contained in a windowed
sinc with zero table lookups.  In the case of common CPUs with a 1 cycle
fp MAC but cache miss penalty of over 200 cycles (2+ GHz CPU with a 100
nS memory system), this almost makes sense.  Some DSPs with much smaller
caches are getting near this point also.

So, is there any good way to interpolate or approximate a windowed sinc
using only a tiny number of "phases"/samples, e.g. such that the entire
table fits in only a few cache lines?

Would using a linear interpolation on a very sparse windowed sinc
table on a few more points of the sparse table itself work?

IMHO. YMMV.
--
Ron Nicholson   rhn AT nicholson DOT com   http://www.nicholson.com/rhn/
#include <canonical.disclaimer>        // only my own opinions, etc.
```
```"Ronald H. Nicholson Jr." <rhn@mauve.rahul.net> wrote in message
news:bukaio\$h2t\$2@blue.rahul.net...
> In article <bufn81\$h6vvj\$1@ID-210375.news.uni-berlin.de>,
> Jon Harris <goldentully@hotmail.com> wrote:
> >Ronald H. Nicholson Jr. <rhn@mauve.rahul.net> wrote in message
> >news:bufdfe\$f3d\$1@blue.rahul.net...
> >> >> Polyphase is just an implementation detail for a known filter - so I
> >> >>choose to leave that out as much as possible.
> >>
> >> I sense recursion here.
> >>
> >> The problem is not having continuous data.  So we use equally spaced
> >> sampled data, and we select, iterpolate or multi-rate FIR filter to
> >> get (estimated) values in between the ones sampled.
> >>
> >> But since no CPU has a one-cycle sinc() generating instruction (etc.),
we
> >> don't have a continuous FIR filter coefficients.  So we use eqully
spaced
> >> "phases" of the FIR coefficients, and we select, interpolate, or...
> >>
> >> Or maybe use the interpolation FIR filter on it's own coefficients!
> >
> >This is usually overkill.  Unless your available table memory is so
> >small that you can only store a handfull of phases
>
> Actually, that's exactly the case about which I'm thinking.  Consider
> a DSP/CPU with a very small data cache, and a latency for table memory
> access which is many cycles long (e.g. even longer than doing "several"
> multiply accumulates).  If one is only interpolating a short signal
> segment with a non-repeating phase step, a large phase table is almost
> guaranteed to not be cache resident, so every table lookup incurs a
> cache miss penalty.

OK, I see where you're going now.  I am guilty of only looking at things in
my little world, where the memory is plentiful but the MACs are limited.

> How much can the phase table be compressed by using higher orders, or
> even recursive, interpolation schemes?  At the limit, one could even do
> a series expansion of the sin and cosine terms contained in a windowed
> sinc with zero table lookups.  In the case of common CPUs with a 1 cycle
> fp MAC but cache miss penalty of over 200 cycles (2+ GHz CPU with a 100
> nS memory system), this almost makes sense.  Some DSPs with much smaller
> caches are getting near this point also.

Don't these CPUs tend to have pretty decent-sized and often multi-level
caches?
Also, does a 2 GHz CPU really have a single-cycle (32-bit) floating-point
MAC?  I haven't studied the latest from Intel, etc. but that seems
unrealistic considering the best 32-bit floating point DSPs are an order of
magnitude slower.  I was under the impression that a 2 GHz couldn't really
do much of anything in a single clock cycle.

> So, is there any good way to interpolate or approximate a windowed sinc
> using only a tiny number of "phases"/samples, e.g. such that the entire
> table fits in only a few cache lines?

Well, you could generate the sine part of it from your favorite trig library
or an appropriate approximation, then just store the the window and 1/x part
combined in a table.  I would think you could get away with a pretty small
table in this case, since the 1/x*window would be very smooth (monotonicly
decreasing).  Also, if your sinc-based low-pass filter is designed so it is
exactly at the Nyquist frequency, the sine part of the filter will only need
to be calculated twice (I think) for all frequency coefficients.  That might
result in a very efficient implementation for some architectures.

Don't forget about the "free" 2:1 compression you can take advantage of by
storing only one half of the symmetric filter coefs.  There is a small
penalty in implementation overhead for doing this, but it's usually not too

> Would using a linear interpolation on a very sparse windowed sinc
> table on a few more points of the sparse table itself work?

As r b-j has discussed, if you have the MACs available, you're probably
better off with something better than linear interpolation.  In his paper,
he shows that not only are higher order interpolation schemes better that
linear overall, but they get "more better faster" than linear, i.e. as the
number of phases in the table increases, the improvement is greater.

>
> IMHO. YMMV.
> --
> Ron Nicholson   rhn AT nicholson DOT com   http://www.nicholson.com/rhn/
> #include <canonical.disclaimer>        // only my own opinions, etc.

```
```In article <buke59\$ipht4\$1@ID-210375.news.uni-berlin.de>,
Jon Harris <goldentully@hotmail.com> wrote:
>> Actually, that's exactly the case about which I'm thinking.  Consider
>> a DSP/CPU with a very small data cache, and a latency for table memory
>> access which is many cycles long (e.g. even longer than doing "several"
>> multiply accumulates).  If one is only interpolating a short signal
>> segment with a non-repeating phase step, a large phase table is almost
>> guaranteed to not be cache resident, so every table lookup incurs a
>> cache miss penalty.
>
>OK, I see where you're going now.  I am guilty of only looking at things in
>my little world, where the memory is plentiful but the MACs are limited.
>
>> How much can the phase table be compressed by using higher orders, or
>> even recursive, interpolation schemes?  At the limit, one could even do
>> a series expansion of the sin and cosine terms contained in a windowed
>> sinc with zero table lookups.  In the case of common CPUs with a 1 cycle
>> fp MAC but cache miss penalty of over 200 cycles (2+ GHz CPU with a 100
>> nS memory system), this almost makes sense.  Some DSPs with much smaller
>> caches are getting near this point also.
>
>Don't these CPUs tend to have pretty decent-sized and often multi-level
>caches?
>Also, does a 2 GHz CPU really have a single-cycle (32-bit) floating-point
>MAC?

Some (G5/970) can issue and retire (start or finish) even more than
one fp MAC per clock cycle, however the latency can be several cycles.
So for recursive IIR filters, where there are data dependancies on
the previous fp calculation, you won't get single cycle performance.
But for FIR filters, where several taps can be computed independantly
and thus in parallel, one can get near one MAC per clock cycle with
even fairly simple coding.  Most of these CPUs can even do the indexing
and data loads in parallel with the fp MACs.  I believe both the G5 and
Itanium have Fortran Linpack benchmark numbers of well over 1 flop per Hz.

The caches are huge on the server CPU's, but I doubt they will be as
large on the lower power GHz+ DSP chips in development.  In any
case, the first level caches are usually not large compared to a
large multi-tap multi-phase table, and there is still a multi-cycle
penalty going to second level cache, more than enough time for
a few more flop MACs.

IMHO. YMMV.
--
Ron Nicholson   rhn AT nicholson DOT com   http://www.nicholson.com/rhn/
#include <canonical.disclaimer>        // only my own opinions, etc.
```
```robert bristow-johnson <rbj@surfglobal.net> wrote in message
news:BC31AF92.7DC0%rbj@surfglobal.net...
> In article bufn81\$h6vvj\$1@ID-210375.news.uni-berlin.de, Jon Harris at
> goldentully@hotmail.com wrote on 01/18/2004 23:45:
>
> > Ronald H. Nicholson Jr. <rhn@mauve.rahul.net> wrote in message
> > news:bufdfe\$f3d\$1@blue.rahul.net...
> ...
> one place where that is useful is in designing the FIR coefs to start with
> if the upsample ratio is so high that MATLAB's remez() chokes on it.  so you
> do it for fewer phases, and then use those very coefs to interpolate a
> denser table with more phases.
>
> > This is usually overkill.  Unless your available table memory is so small
that
> > you can only store a handfull of phases, the filter coefficients are
> > essentially highly oversampled data (or low pass data as r b-j called it).
>
> it's not what i meant regarding "low pass data", Jon.  hmmmm.  maybe it *is*
> what i meant, but from a different POV.
>
> what i meant is if the audio data is highly oversampled (which lot's of
> phases in the FIR table would do to it), then it appears, from the POV of
> the high sampling rate, very "low pass data".  now if that's the case, the
> *images* are narrow little blips at every integer multiple of the high Fs
> and a sinc^2 frequency response (which is what you get when you convolute
> with the triangular pulse which is what happens when you do linear
> interpolation between neighboring samples which is essentially the same
> thing you are doing if you are interpolating between coefficients of
> neighboring phases) will kill those little blips quite nicely.  higher order
> B-splines (which has frequency response of sinc^(N-1) for an Nth order
> B-spline) will kill those blips even better.

A big AHA! for me a while back was that interpolating between the coefficients
and then using them to interpolate the audio is the same as calculating several
results using the "stright" coefficients and then interpolating between them to
get the final result.  (I know I'm not saying that very clearly, but hopefully
you know what I'm trying to get at.)  If you look at the math, it's pretty clear
that they are equivalent, but it wasn't intuitively obvious to me.  Either way,
the data is "low pass", it's just that in one case the audio is low-pass while
in the other case the coefficients (phases) are low pass.

> > Hence you can get away a pretty simple interpolation scheme since there
> > is very little high-frequency content.
> >
> >> Hmmm...  I wonder how many levels of this it takes before diminishing
> >> returns sets in?  Everybody seems to assume 1.5 levels in enough:
> >> 1 or 2 taps (select or linear interpolation inside the phase table
> >> for the FIR coefficients) plus N taps (for the data).
> >
> > This is usually adequate for the reasons stated above (and see below too).
>
> if you have 512 different phases of your polyphase FIR filter (the same as
> oversampling by a factor of 512), then linear interpolation will be adequate
> for an S/N of 120 dB, if i recall correctly.  if you do a cubic B-spline,
> you will need only 16 or 32 different phases (or oversampling by 16 or 32)
> to get the same S/N.  if you're doing *no* interpolation (sometimes called
> "drop-sample") between coefficients (or oversampled samples), then you need
> 512K phases (or oversample by 512K.

Sounds reasonable.  I know those early Analog Devices chips used linear coef.
interpolation and got 120dB.  I read in a paper once how many phases they kept,
but don't remember anymore.

> >> Given a fixed number of MACs per sample and a fixed number of table
> >> entries (to fit in x% of the dcache for instance), it looks like there
> >> are several possiblities.  0 MACs for the coeffs and N MACs for the data.
> >> N/2 MACs for 2 taps of coeff generation (and with a table now with twice
> >> as many phases in the same number of bytes) and N/2 taps for the data.
> >> Or maybe even more MACs for coefficient generation and an even shorter
> >> FIR filter.  An interesting optimization and error bounding problem.
> >
> > I think it would usually be pretty easy to find the optimial answer.  For
> > example, if you decide to do linear coeficient interpolation, you
immediately
> > have to use a shorter filter which means you can store more phases in the
same
> > memory, which means there is a less critical need for the linear coef
> > interpolation.  You can see why higher order interpolation of the coefs
> > quickly becomes a losing battle.  The only excpeption might be if memory was
> > *extremely* tight and you had tons of MACs available.
>
> now, if you are *upsampling*, then you can do the linear (or whatever kind)
> interpolation on the resulting output samples from the two (or more)
> neighboring phases.  that is mathematically equivalent to interpolating the
> coefficients and usually much simpler to code.

Regarding which method is best, if you need to do the same interpolation on more
than one data stream (e.g. stereo audio), then you're better off doing the
linear (or whatever kind) of interpolation on the coefs and then applying the
results to all the audio channels.  Otherwise, like you said, it's equivalent,
so pick the simplest method.

> <snip>

```