Low memory footprint decimation

Started by Piotr Wyderski September 7, 2017

so I finally have some time to return to the problem of the
multichannel decimation on PSOC5LP. The situation is as follows:
there are 8 channels of 12 bits@100kHz each and a single digital 
quadrature mixer running at 310kHz, also 12 bits. The hardware
is an 80MHz ARM CortexM3 equipped with a coprocessor called DFB,
running at the same speed, with single-cycle 24x24->48-bit MAC
and 256 24-bit memory cells. The final processing will be handled
by the ARM, but since all the input data streams are heavily
oversampled (by a factor of ~100), I'd like to do as much
preprocessing as I can on the DFB in order not to swamp the ARM
with massive amount of redundant data.

Since the 310kHz I/Q data stream is in fact composed of two
155kHz independent streams with exactly the same filtering
requirements and 155kHz is, by pure accident, close to the
remaining 100kHz streams, it effectively boils down to a 10
channel decimation by as much as possible using the same
structure. There are enough MIPS, but the RAM capacity is
effectively 256/10=25 cells per stream. This immediately
wipes out all the polyphase FIR techniques you told me about
previously. The only option on the table is a cascade of
decimating IIR filters. Question #1: what would be appropriate
here? It can be the CIC, but it leaves the MAC unit idle -- maybe
it could be used somehow?

But if not, then what should the CIC structure look like?
Each order K filter decimating by M requires 2K cells for
the delay storage + one cell for the decimation counter.
The input data strem is 12 bits wide and the word length
is 24 bits, which does not allow high K. 4 or 5 is the
absolute maximum. The required headroom for the integrators
grows only logarithmically with M, so a high value of M
is tempting. OTOH, the higher M is, the closer I slide to
the left on the main CIC lobe, decreasing the antialiasing
attenuation, so it seems to be pointless to push M very high.
It also makes no sense to require the stopband attenuation to
be higher than the max height of the second lobe, and since
the main and the second lobe's attenuations are equal at about
1/5th of the normalized frequency, the max. useful decimation
factor per stage is also about 5. Question #2: is this
reasoning acceptably correct?

If yes, then a quick calculation shows that the max.
available M is 20 for K=3 (23.97 bits), 9 for K=4 (23.68 bits)
and 6 for K=5 (23.93 bits). OTOH, the SNR grows by 1 bit
for every 4-fold decrease in the sampling rate, so having
that low integrator headroom it looks wise to use that bit
to its full capacity, i.e. making the decimation factor
equal 4^N. This + all the above means that M should be 4.
Question #3: is this reasoning acceptably correct?

If yes, then a cascade of 3 such CICs should do the job,
providing decimation factor of 64. For K=4 they would
require (2*4+1)*3=27 storage cells per channel, which
doesn't fit in the chip. But if I somehow combine all
the decimation counters into one cell it makes (2*4+1)*3=25,
which barely fits. The second saving may come from the
fact that the last comb section runs at exactly the
final frequency and so can be calculated by the ARM.
It removes 4 cells per channel, so the total memory
footprint would be 21. Then I could make the last stage
5th order, i.e. use 22 cells.

Sincere thanks to anyone who managed to reach up to this point,
but I wanted to make my reasoning explicit in order to make
it easy for the experts to spot and correct the mistakes in my

I am also extremely curious if there are better, MAC-based
factor ~60 decimators which would fit within 25 cells.

	Best regards, Piotr