Polyphase FIR implementation

Started by Piergiorgio Sartor June 28, 2005
Hi,

does anyone know if there is any reference code which implements
a generic polyphase FIR resampler (up/down scaler), using Altivec
instruction set?

I searched the web (as usual), but I could not find anything
really usefull, actully I did not find anything at all.

Also non-obvious search hints will be appreciated.

Thanks a lot in advance,

bye,

-- 

piergiorgio
Piergiorgio Sartor wrote:
> > Hi, > > does anyone know if there is any reference code which implements > a generic polyphase FIR resampler (up/down scaler),
http://www.mega-nerd.com/SRC/
> using Altivec instruction set?
Sorry, no altivec. In fact, I tried vectorizing the algorithm and it doesn't vectorize. All attempts, even handcoding the inner loop in SSE (x86 equivalent of Altivec) was slower than the C code compiled for the FPU. Erik -- +-----------------------------------------------------------+ Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid) +-----------------------------------------------------------+ "I consider C++ the most significant technical hazard to the survival of your project and do so without apologies." -- Alistair Cockburn
Erik de Castro Lopo wrote:

> http://www.mega-nerd.com/SRC/
Thanks!
> Sorry, no altivec. In fact, I tried vectorizing the algorithm and > it doesn't vectorize. All attempts, even handcoding the inner loop > in SSE (x86 equivalent of Altivec) was slower than the C code compiled > for the FPU.
Uhm, could you please explain me were did you find the problems in vectorizing it? It seems to me that the data access is the real bottleneck. Do you agree? bye, -- piergiorgio
Piergiorgio Sartor wrote:
> > Uhm, could you please explain me were did you find the > problems in vectorizing it?
The algorithm used in Secret Rabbit Code is due to Julius O. Smith. There is a link somewhere on the SRC web page to the Smith paper describing the algorithm.
> It seems to me that the data access is the real bottleneck. > Do you agree?
Exactly. The Julius O. Smith algorithm calculates the FIR filter coefficients on the fly using linear interpolation on a much larger table. Its this linear interpolation on the larger table which makes vectorised versions of this algorithm slower than un-vectorised versions. I am currently working on a different algorithm which is specifically designed for high performance on CPUs with Altivec or SSE instruction sets. When it works it will become part of Secret Rabbit Code. Erik -- +-----------------------------------------------------------+ Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid) +-----------------------------------------------------------+ "Java is, in many ways, C++--." -- Michael Feldman
"Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message 
news:42C20C24.D9E6B0B5@mega-nerd.com...
> Piergiorgio Sartor wrote: >> >> It seems to me that the data access is the real bottleneck. >> Do you agree? > > Exactly. The Julius O. Smith algorithm calculates the > FIR filter coefficients on the fly using linear > interpolation on a much larger table. Its this linear > interpolation on the larger table which makes vectorised > versions of this algorithm slower than un-vectorised > versions.
With the large amounts of memory available on PCs today, does it make sense to just store the whole table in RAM? Or maybe that causes problems with the cache.
in article V_owe.9690$Xr6.7335@trnddc07, Jon Harris at
jon99_harris7@hotmail.com wrote on 06/28/2005 23:57:

> "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message > news:42C20C24.D9E6B0B5@mega-nerd.com... >> Piergiorgio Sartor wrote: >>> >>> It seems to me that the data access is the real bottleneck. >>> Do you agree? >> >> Exactly. The Julius O. Smith algorithm calculates the >> FIR filter coefficients on the fly using linear >> interpolation on a much larger table. Its this linear >> interpolation on the larger table which makes vectorised >> versions of this algorithm slower than un-vectorised >> versions. > > With the large amounts of memory available on PCs today, does it make sense to > just store the whole table in RAM? Or maybe that causes problems with the > cache.
maybe. i have calculated that, for 120 dB S/N (the N due to aliasing of the images not completely killed), you need about 512 phases or fractional delays (each with their own FIR coefficient set) if linear interpolation is used in between those phases to get an arbitrarily precise fractional delay. if no interpolation is used (strictly speaking, it's zero-order hold or "drop-sample" interpolation), you would need 512K phases. if each phase gets a 32 tap FIR, that would be 16 Meg word (and if symmetry is taken advantage of, then it's 8 Meg). that's still seems to be a pretty wasteful use of memory if you can reduce that by a factor of 1000 just by using linear interpolation. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
"robert bristow-johnson" <rbj@audioimagination.com> wrote in message 
news:BEE7A025.89FE%rbj@audioimagination.com...
> in article V_owe.9690$Xr6.7335@trnddc07, Jon Harris at > jon99_harris7@hotmail.com wrote on 06/28/2005 23:57: > >> "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message >> news:42C20C24.D9E6B0B5@mega-nerd.com... >>> Piergiorgio Sartor wrote: >>>> >>>> It seems to me that the data access is the real bottleneck. >>>> Do you agree? >>> >>> Exactly. The Julius O. Smith algorithm calculates the >>> FIR filter coefficients on the fly using linear >>> interpolation on a much larger table. Its this linear >>> interpolation on the larger table which makes vectorised >>> versions of this algorithm slower than un-vectorised >>> versions. >> >> With the large amounts of memory available on PCs today, does it make sense >> to >> just store the whole table in RAM? Or maybe that causes problems with the >> cache. > > maybe. i have calculated that, for 120 dB S/N (the N due to aliasing of the > images not completely killed), you need about 512 phases or fractional > delays (each with their own FIR coefficient set) if linear interpolation is > used in between those phases to get an arbitrarily precise fractional delay. > if no interpolation is used (strictly speaking, it's zero-order hold or > "drop-sample" interpolation), you would need 512K phases. if each phase > gets a 32 tap FIR, that would be 16 Meg word (and if symmetry is taken > advantage of, then it's 8 Meg). that's still seems to be a pretty wasteful > use of memory if you can reduce that by a factor of 1000 just by using > linear interpolation.
Thanks for jumping in with some actual numbers. My PC has 1.5GB of memory so 8-16 MegaWords doesn't seem too bad, but I see your point. If the memory access truly is the limiting factor, eliminating linear interpolation cuts that in half, which is a good thing.
Jon Harris wrote:
> > "Erik de Castro Lopo" <nospam@mega-nerd.com> wrote in message > news:42C20C24.D9E6B0B5@mega-nerd.com... > > Piergiorgio Sartor wrote: > >> > >> It seems to me that the data access is the real bottleneck. > >> Do you agree? > > > > Exactly. The Julius O. Smith algorithm calculates the > > FIR filter coefficients on the fly using linear > > interpolation on a much larger table. Its this linear > > interpolation on the larger table which makes vectorised > > versions of this algorithm slower than un-vectorised > > versions. > > With the large amounts of memory available on PCs today, does it make sense to > just store the whole table in RAM? Or maybe that causes problems with the > cache.
The Secret Rabbit Code does store the whole table in RAM. The problem that there is no way of arranging that table so that an CPU with vector instructions (ie Altivec or SSE) can benefit. Erik -- +-----------------------------------------------------------+ Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid) +-----------------------------------------------------------+ "C++ is the only current language making COBOL look good." -- Bertrand Meyer
In article <42C250BA.66967C36@mega-nerd.com>,
Erik de Castro Lopo  <nospam@mega-nerd.com> wrote:
>... store the whole table in RAM. > >The problem that there is no way of arranging that table so >that an CPU with vector instructions (ie Altivec or SSE) can >benefit.
"that table" implies you are using a single large in-order table. I've had luck recomputing and re-striping windowed-sinc tables for the frequency ratio in use. Once the step size through the reconstruction function is known, one can recompute or pre-interpolate the tap coefficients, and move them contiguously per phase for better cache locality into a new temporary table (as well as new per phase contiguous difference values for delta phase interpolation). Depending on cache size versus table size, you might even want to keep multiple shifted copies of each phase coefficient group so that the data, the coefficient and the difference tables can always be fetched from identically aligned cache lines. But when rescaling only one channel on a fast PC these days, I sometimes don't even bother with any precomputed tables; the math lib on a 2+ GHz PPC can calculate new sin() and cos() values fast enough for a real-time multi-tap raised-cosine sinc resampling filter. IMHO. YMMV. -- Ron Nicholson rhn AT nicholson DOT com http://www.nicholson.com/rhn/ #include <canonical.disclaimer> // only my own opinions, etc.
"Ronald H. Nicholson Jr." wrote:
> > In article <42C250BA.66967C36@mega-nerd.com>, > Erik de Castro Lopo <nospam@mega-nerd.com> wrote: > >... store the whole table in RAM. > > > >The problem that there is no way of arranging that table so > >that an CPU with vector instructions (ie Altivec or SSE) can > >benefit. > > "that table" implies you are using a single large in-order table. > > I've had luck recomputing and re-striping windowed-sinc tables for the > frequency ratio in use.
Secret Rabbit Code is designed to allow a varying ratio between input and outpu sample rate.
> Once the step size through the reconstruction > function is known, one can recompute or pre-interpolate the tap > coefficients, and move them contiguously per phase for better cache > locality into a new temporary table (as well as new per phase contiguous > difference values for delta phase interpolation). Depending on cache size > versus table size, you might even want to keep multiple shifted copies > of each phase coefficient group so that the data, the coefficient and > the difference tables can always be fetched from identically aligned > cache lines.
This works for many specific cases, but not for the general case where the input and output sample rates may vary with time. Erik -- +-----------------------------------------------------------+ Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid) +-----------------------------------------------------------+ Java sucks. C sucks slightly less so, but only because it makes no pretense at all about being a high level language.