Most of you, I'm sure, will be familiar with the "Folded FIR" technique for reducing the computational load of FIR filters. For instance, in the example below, 8 multipliers and 8 adders are replaced by 4 multipliers and 7 adders.
I have a series of 5 million signal samples, and I'm looking for an computationally efficient way to implement this technique in Matlab or Octave.
If the standard FIR uses : output = filter (coeffs, 1, input_samples) then does anyone have an idea how the code could be changed to a folded FIR ?
To speed the process, I'm trying to utilize vector arithmetic rather than low-level loops.
You didn't tell us some other important things. Is the BW small compared to the sample rate? If it is you can partition the filter to an M-path filter operating at a reduced output rate and then interpolate its low clock rate back to the initial clock rate... the workload reduction is proportional to the resample rate. If the BW is not small compared to the clock rate, you can still do this with a polyphase input analysis channelizer and followed by a polyphase output synthesis channelizer. The workload will be reduced by more than a factor of 10. This is really mind blowing because it runs counter to intuition. I have done a number of these systems to implement a 1400 tap complex weight filter processing complex input to a workload of 100 real ops per input sample. I market the technology as green as opposed to clever... it sells better... Take a look at attached PDF... related example pages 25 through 29.... a 1051 tap filter implemented with 76 ops/input sample
I have many examples I can distribute but they were too big to attach... an e-mail requesting some will work... email@example.com
Good read, as always.
I've been working on channelizers this year and your various papers have saved my bacon. Last year I worked with Dragan at SpaceX. Darned small world, isn't it?
Thanks! Looks like a new book which I'll be happy to pick up when its available. The last SDR book stopped at chapter 5. Is there a set of supplements for that book too?
I'm back in Atlanta. Working in Seattle was a good experience but it stinks to be that far from home. I need to take another good Comm class. If you can suggest a good one out in your neck of the woods....
Question for you regarding the filter banks. I notice in various papers that the analysis bank either has an FFT or IFFT. I can implement an IFFT from an FFT (ignoring the scaling factor) by taking the complex conjugate of the input and then taking the complex conjugate of the outputs. This effectively just reverses the spectrum of the input and then reverses the spectrum of each output channel so they again "match" the input. The effect I notice is that it changes on which "side" of the output an adjacent alias appears. Otherwise seems to be the same. For "non-maximally decimated" type processing I can see this might matter. Anything else I'm missing?
Another question, if you are designing a filter for a channelizer are there any good tricks for the prototype filter? To get one that has a cut off at the channel edge and a pass band that occupies 60% of the channel requires a ludicrous number of taps.
Thanks to all for your interesting replies.
I should have said that yes, the impulse response is symmetric, and I'm interested in filter lengths from 8 to 100 taps. Also, I'm looking at floating-point operations on a PC, rather than dedicated FPGA or DSP processor.
However, it is interesting to see the results for different situations.
oliviert, thanks for your code. It seems to confirm what I feared, which is that the looped pre-add operations can easily negate any gains from the folded fir architecture.
I'll continue to experiment with code (I use Octave which is usually somewhat slower than Matlab), and report anything interesting.
At the moment I'm investigating code rather than general computational efficiency, but having said that, I found Fred's polyphase technique a good read.
This architecture is valid just because your filter is symmetric.
You gain a lot of time and resource when using specific architectures having a hardware Mult-Acc. On MATLAB running on your poor X86 you will loose a lot of time manipulating the data.
I tried just for the exercise, here is what I get:
L = 50000;
x = randn(1,L);
taps = [ 1 2 3 4 4 3 2 1 ];
N = length(taps);
% Standard way
tstart = tic;
y = conv(x,taps);
T1 = toc(tstart);
disp([ 'Time duration for standard way ' num2str(T1)]);
% Symmetry usage
tstart = tic;
Y = zeros(N/2,N+L-1);
Y(i,i:i+L-1) = x;
Y(N-i+1,i:i+L-1) = Y(N-i+1,i:i+L-1) + x;
y1 = taps(1:N/2)*Y;
T2 = toc(tstart);
disp([ 'Time duration for symmetry ' num2str(T2)]);
disp([ 'Maximum Error : ' num2str(max(abs(y-y1)))]);
Above 50000 samples, the symmetric way is slower.
You mean pre-add structure. Nothing folded about, just misnomer.
matlab filter function is much faster than loops of convolution as it uses vector computations. so I don't get it really. The results should be identical if adder is given full bit growth.
Many of the matlab/octave filter functions use fast convolution once the length of the filter impulse response gets very large. You don't mention how long your filter impulse response is, but that will make a difference in what technique might give you the best speed advantage.
As mentioned, the folding technique only works for a symmetric impulse response, but that's pretty common and I'm assuming yours is symmetric.
I think the bottom line is that if you really want to use a folded filter with vector instructions you'll have to code that yourself. Maybe make a new vector or matrix where all the pre-adding is done and then convolve the half-impulse-response with that or something.
In addition to the responses already given, I believe that MatLab is likely to have a very optimized compiled routine that is called for filter() or conv(), that also makes use of any pipelining in the underlying machine architecture.
It seems highly unlikely that you can beat their implementation by manipulating the data.
In a real DSP situation, with a true Multiply-Accumulate (MAC), you can view that either the multiply or the add comes for free, and the total number of instructions (clock tics) will b 8.