comp.dsp | Polyphase FIR implementation

Hi,

does anyone know if there is any reference code which implements
a generic polyphase FIR resampler (up/down scaler), using Altivec
instruction set?

I searched the web (as usual), but I could not find anything
really usefull, actully I did not find anything at all.

Also non-obvious search hints will be appreciated.

Thanks a lot in advance,

bye,

-- 

piergiorgio

Reply by Erik de Castro Lopo ●June 28, 20052005-06-28

Piergiorgio Sartor wrote:
> 
> Hi,
> 
> does anyone know if there is any reference code which implements
> a generic polyphase FIR resampler (up/down scaler),


    http://www.mega-nerd.com/SRC/

> using Altivec instruction set?

Sorry, no altivec. In fact, I tried vectorizing the algorithm and
it doesn't vectorize. All attempts, even handcoding the inner loop
in SSE (x86 equivalent of Altivec) was slower than the C code compiled
for the FPU.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"I consider C++ the most significant technical hazard to the survival
of your project and do so without apologies." -- Alistair Cockburn

Reply by Piergiorgio Sartor ●June 28, 20052005-06-28

Erik de Castro Lopo wrote:

>     http://www.mega-nerd.com/SRC/

Thanks!

> Sorry, no altivec. In fact, I tried vectorizing the algorithm and
> it doesn't vectorize. All attempts, even handcoding the inner loop
> in SSE (x86 equivalent of Altivec) was slower than the C code compiled
> for the FPU.

Uhm, could you please explain me were did you find the
problems in vectorizing it?

It seems to me that the data access is the real bottleneck.
Do you agree?

bye,

-- 

piergiorgio

Reply by Erik de Castro Lopo ●June 28, 20052005-06-28

Piergiorgio Sartor wrote:
> 
> Uhm, could you please explain me were did you find the
> problems in vectorizing it?

The algorithm used in Secret Rabbit Code is due to Julius
O. Smith. There is a link somewhere on the SRC web page to 
the Smith paper describing the algorithm.

> It seems to me that the data access is the real bottleneck.
> Do you agree?

Exactly. The Julius O. Smith algorithm calculates the
FIR filter coefficients on the fly using linear 
interpolation on a much larger table. Its this linear 
interpolation on the larger table which makes vectorised
versions of this algorithm slower than un-vectorised
versions.

I am currently working on a different algorithm which 
is specifically designed for high performance on CPUs
with Altivec or SSE instruction sets. When it works
it will become part of Secret Rabbit Code.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"Java is, in many ways, C++--." -- Michael Feldman

Reply by Jon Harris ●June 29, 20052005-06-29

"Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message 
news:42C20C24.D9E6B0B5@mega-nerd.com...
> Piergiorgio Sartor wrote:
>>
>> It seems to me that the data access is the real bottleneck.
>> Do you agree?
>
> Exactly. The Julius O. Smith algorithm calculates the
> FIR filter coefficients on the fly using linear
> interpolation on a much larger table. Its this linear
> interpolation on the larger table which makes vectorised
> versions of this algorithm slower than un-vectorised
> versions.

With the large amounts of memory available on PCs today, does it make sense to 
just store the whole table in RAM?  Or maybe that causes problems with the 
cache.

Reply by robert bristow-johnson ●June 29, 20052005-06-29

in article V_owe.9690$Xr6.7335@trnddc07, Jon Harris at
jon99_harris7@hotmail.com wrote on 06/28/2005 23:57:

> "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message
> news:42C20C24.D9E6B0B5@mega-nerd.com...
>> Piergiorgio Sartor wrote:
>>> 
>>> It seems to me that the data access is the real bottleneck.
>>> Do you agree?
>> 
>> Exactly. The Julius O. Smith algorithm calculates the
>> FIR filter coefficients on the fly using linear
>> interpolation on a much larger table. Its this linear
>> interpolation on the larger table which makes vectorised
>> versions of this algorithm slower than un-vectorised
>> versions.
> 
> With the large amounts of memory available on PCs today, does it make sense to
> just store the whole table in RAM?  Or maybe that causes problems with the
> cache. 

maybe.  i have calculated that, for 120 dB S/N (the N due to aliasing of the
images not completely killed), you need about 512 phases or fractional
delays (each with their own FIR coefficient set) if linear interpolation is
used in between those phases to get an arbitrarily precise fractional delay.
if no interpolation is used (strictly speaking, it's zero-order hold or
"drop-sample" interpolation), you would need 512K phases.  if each phase
gets a 32 tap FIR, that would be 16 Meg word (and if symmetry is taken
advantage of, then it's 8 Meg).  that's still seems to be a pretty wasteful
use of memory if you can reduce that by a factor of 1000 just by using
linear interpolation.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Jon Harris ●June 29, 20052005-06-29

"robert bristow-johnson" <rbj@audioimagination.com> wrote in message 
news:BEE7A025.89FE%rbj@audioimagination.com...
> in article V_owe.9690$Xr6.7335@trnddc07, Jon Harris at
> jon99_harris7@hotmail.com wrote on 06/28/2005 23:57:
>
>> "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message
>> news:42C20C24.D9E6B0B5@mega-nerd.com...
>>> Piergiorgio Sartor wrote:
>>>>
>>>> It seems to me that the data access is the real bottleneck.
>>>> Do you agree?
>>>
>>> Exactly. The Julius O. Smith algorithm calculates the
>>> FIR filter coefficients on the fly using linear
>>> interpolation on a much larger table. Its this linear
>>> interpolation on the larger table which makes vectorised
>>> versions of this algorithm slower than un-vectorised
>>> versions.
>>
>> With the large amounts of memory available on PCs today, does it make sense 
>> to
>> just store the whole table in RAM?  Or maybe that causes problems with the
>> cache.
>
> maybe.  i have calculated that, for 120 dB S/N (the N due to aliasing of the
> images not completely killed), you need about 512 phases or fractional
> delays (each with their own FIR coefficient set) if linear interpolation is
> used in between those phases to get an arbitrarily precise fractional delay.
> if no interpolation is used (strictly speaking, it's zero-order hold or
> "drop-sample" interpolation), you would need 512K phases.  if each phase
> gets a 32 tap FIR, that would be 16 Meg word (and if symmetry is taken
> advantage of, then it's 8 Meg).  that's still seems to be a pretty wasteful
> use of memory if you can reduce that by a factor of 1000 just by using
> linear interpolation.

Thanks for jumping in with some actual numbers.  My PC has 1.5GB of memory so 
8-16 MegaWords doesn't seem too bad, but I see your point.  If the memory access 
truly is the limiting factor, eliminating linear interpolation cuts that in 
half, which is a good thing.

Reply by Erik de Castro Lopo ●June 29, 20052005-06-29

Jon Harris wrote:
> 
> "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message
> news:42C20C24.D9E6B0B5@mega-nerd.com...
> > Piergiorgio Sartor wrote:
> >>
> >> It seems to me that the data access is the real bottleneck.
> >> Do you agree?
> >
> > Exactly. The Julius O. Smith algorithm calculates the
> > FIR filter coefficients on the fly using linear
> > interpolation on a much larger table. Its this linear
> > interpolation on the larger table which makes vectorised
> > versions of this algorithm slower than un-vectorised
> > versions.
> 
> With the large amounts of memory available on PCs today, does it make sense to
> just store the whole table in RAM?  Or maybe that causes problems with the
> cache.

The Secret Rabbit Code does store the whole table in RAM.

The problem that there is no way of arranging that table so 
that an CPU with vector instructions (ie Altivec or SSE) can
benefit.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"C++ is the only current language making COBOL look good."
-- Bertrand Meyer

Reply by Ronald H. Nicholson Jr. ●June 30, 20052005-06-30

In article <42C250BA.66967C36@mega-nerd.com>,
Erik de Castro Lopo  <nospam@mega-nerd.com> wrote:
>... store the whole table in RAM.
>
>The problem that there is no way of arranging that table so 
>that an CPU with vector instructions (ie Altivec or SSE) can
>benefit.

"that table" implies you are using a single large in-order table.

I've had luck recomputing and re-striping windowed-sinc tables for the
frequency ratio in use.  Once the step size through the reconstruction
function is known, one can recompute or pre-interpolate the tap
coefficients, and move them contiguously per phase for better cache
locality into a new temporary table (as well as new per phase contiguous
difference values for delta phase interpolation).  Depending on cache size
versus table size, you might even want to keep multiple shifted copies
of each phase coefficient group so that the data, the coefficient and
the difference tables can always be fetched from identically aligned
cache lines.

But when rescaling only one channel on a fast PC these days, I sometimes
don't even bother with any precomputed tables; the math lib on a 2+ GHz
PPC can calculate new sin() and cos() values fast enough for a real-time
multi-tap raised-cosine sinc resampling filter.

IMHO. YMMV.
-- 
Ron Nicholson   rhn AT nicholson DOT com   http://www.nicholson.com/rhn/ 
#include <canonical.disclaimer>        // only my own opinions, etc.

Reply by Erik de Castro Lopo ●June 30, 20052005-06-30

"Ronald H. Nicholson Jr." wrote:
> 
> In article <42C250BA.66967C36@mega-nerd.com>,
> Erik de Castro Lopo  <nospam@mega-nerd.com> wrote:
> >... store the whole table in RAM.
> >
> >The problem that there is no way of arranging that table so
> >that an CPU with vector instructions (ie Altivec or SSE) can
> >benefit.
> 
> "that table" implies you are using a single large in-order table.
> 
> I've had luck recomputing and re-striping windowed-sinc tables for the
> frequency ratio in use.

Secret Rabbit Code is designed to allow a varying ratio between input
and outpu sample rate.

>  Once the step size through the reconstruction
> function is known, one can recompute or pre-interpolate the tap
> coefficients, and move them contiguously per phase for better cache
> locality into a new temporary table (as well as new per phase contiguous
> difference values for delta phase interpolation).  Depending on cache size
> versus table size, you might even want to keep multiple shifted copies
> of each phase coefficient group so that the data, the coefficient and
> the difference tables can always be fetched from identically aligned
> cache lines.

This works for many specific cases, but not for the general case where
the input and output sample rates may vary with time.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
Java sucks. C sucks slightly less so, but only because it makes 
no pretense at all about being a high level language.

Polyphase FIR implementation

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group