rickman <gnuarm@gmail.com> writes:

> On 5/16/2017 3:40 PM, Randy Yates wrote:
>> Randy Yates <randyy@garnerundergroundinc.com> writes:
>>
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/16/2017 1:11 PM, Randy Yates wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>>
>>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>>> rickman wrote:
>>>>>>>>>
>>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>>
>>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>>> is also a third control memory which stores the information
>>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>>> which adds fun.
>>>>>>>>>
>>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>>
>>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>>> the past data values?
>>>>>>>>
>>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>>
>>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>>> for you.
>>>>>>>
>>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>>
>>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>>> my head around but finally understood how it worked.  At the time this
>>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>>> presenting the problem to me was the classic engineer who could read
>>>>>> and understand things, but couldn't explain anything to other people.
>>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>>> myself.
>>>>>>
>>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>>> outputs and each output is ... well output, when it has been fully
>>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>>
>>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>>
>>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>>
>>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>>> coefficient and the input, but also the intermediate output value
>>>>>> being calculated on each MAC operation.
>>>>>>
>>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>>> minimum number of calculations on arrival of each input since not all
>>>>>> the coefficients are used on each input.  So the coefficients are
>>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>>> any given input sample.
>>>>>
>>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>>> the sets of input values for different outputs are disjoint.
>>>>
>>>> Yes, so what?  You still don't need to store any inputs to calculate
>>>> the output.
>>>
>>> Where would they come from if they were not stored?
>>
>> OK I see what you mean now. You can do a running MAC as they come in.
>> That is certainly true.
>
> Ok, glad we got past that.

Yeah, thanks for hanging with me, Rick.

For some reason I kept on overlooking the key concept, which you plainly
stated:

  Instead of thinking you have to calculate all the terms of y(k) AT ONE
  TIME (and so store the last N*M values of x), consider that when x(i) ...

Mea culpa.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

On 5/16/2017 3:40 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/16/2017 1:11 PM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>
>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>> rickman wrote:
>>>>>>>>
>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>
>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>> is also a third control memory which stores the information
>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>> which adds fun.
>>>>>>>>
>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>
>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>> the past data values?
>>>>>>>
>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>
>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>> for you.
>>>>>>
>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>
>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>> my head around but finally understood how it worked.  At the time this
>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>> presenting the problem to me was the classic engineer who could read
>>>>> and understand things, but couldn't explain anything to other people.
>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>> myself.
>>>>>
>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>> outputs and each output is ... well output, when it has been fully
>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>
>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>
>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>
>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>> coefficient and the input, but also the intermediate output value
>>>>> being calculated on each MAC operation.
>>>>>
>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>> minimum number of calculations on arrival of each input since not all
>>>>> the coefficients are used on each input.  So the coefficients are
>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>> any given input sample.
>>>>
>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>> the sets of input values for different outputs are disjoint.
>>>
>>> Yes, so what?  You still don't need to store any inputs to calculate
>>> the output.
>>
>> Where would they come from if they were not stored?
>
> OK I see what you mean now. You can do a running MAC as they come in.
> That is certainly true.

Ok, glad we got past that.

-- 

Rick C

On 5/16/2017 3:50 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> Randy Yates <randyy@garnerundergroundinc.com> writes:
>>
>>> eric.jacobsen@ieee.org writes:
>>>
>>>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>>>> <randyy@garnerundergroundinc.com> wrote:
>>>>
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>>
>>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>>> rickman wrote:
>>>>>>>>>
>>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>>
>>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>>> is also a third control memory which stores the information
>>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>>> which adds fun.
>>>>>>>>>
>>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>>
>>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>>> the past data values?
>>>>>>>>
>>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>>
>>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>>> for you.
>>>>>>>
>>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>>
>>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>>> my head around but finally understood how it worked.  At the time this
>>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>>> presenting the problem to me was the classic engineer who could read
>>>>>> and understand things, but couldn't explain anything to other people.
>>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>>> myself.
>>>>>>
>>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>>> outputs and each output is ... well output, when it has been fully
>>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>>
>>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>>
>>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>>
>>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>>> coefficient and the input, but also the intermediate output value
>>>>>> being calculated on each MAC operation.
>>>>>>
>>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>>> minimum number of calculations on arrival of each input since not all
>>>>>> the coefficients are used on each input.  So the coefficients are
>>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>>> any given input sample.
>>>>>
>>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>>> the sets of input values for different outputs are disjoint.
>>>>
>>>> How each input will be used for each output is known when that input
>>>> arrives, so if all possible ouputs for that input have their own
>>>> partial product accumulator, the input can be multiplied by the
>>>> various coefficients for each partial product and accumulated in each,
>>>> and then forgotten.
>>>>
>>>> It's just another way to do it, but it can save a lot of hardware in a
>>>> hardware implementation, especially in an FPGA where the
>>>> multiply-accumulators are already built and just laying around,
>>>> anyway.
>>>
>>> So are you asserting this is not a counter-example?
>>>
>>> To state that an N-tap FIR decimator does not require all N inputs for
>>> an output at time k is, in general, not true, as I have shown.
>>>
>>> I believe what you are claiming is that IN CERTAIN CASES you can get
>>> away with not using all N inputs at output time k. That I can agree
>>> with. However, neither you nor Rick put a constraint on your assertion.
>>>
>>> And I can see that, in certain situations, this can be mo' better,
>>> namely when you have one or more MACs sitting around not being used (as
>>> you said) and N > M. But in general it hardly seems a storage saver as a
>>> MAC is a storage element+.
>>
>> In light of the light coming on due to Rick's post that came about the
>> same time as I was writing this, I'll have to retract.
>
> This is a type of systolic array?

I've not seen the two things connected.  This method of calculating a 
polyphase filter does not require multiple processors which is the only 
context I've seen the term "systolic array" used.  I believe that is 
just a collection of processing elements that each pump data between 
them at a regular rate and a run-time fixed pattern.  Certainly any FIR 
filter would be a good candidate for a systolic array.

I think systolic arrays are not so much used these days.  Processing 
speeds have increased so much that there are a lot more things that can 
be done on a single processing element without breaking it down into 
pieces.  I suppose radar work will always find a need for more 
processing and perhaps sonar.  I think the cell phone industry has been 
a real boon to many other fields in terms of pushing the state of the 
art in DSP hardware.

-- 

Rick C

Randy Yates <randyy@garnerundergroundinc.com> writes:

> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> eric.jacobsen@ieee.org writes:
>>
>>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>>> <randyy@garnerundergroundinc.com> wrote:
>>>
>>>>rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>
>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>> rickman wrote:
>>>>>>>>
>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>
>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>> is also a third control memory which stores the information
>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>> which adds fun.
>>>>>>>>
>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>
>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>> the past data values?
>>>>>>>
>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>
>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>> for you.
>>>>>>
>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>
>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>> my head around but finally understood how it worked.  At the time this
>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>> presenting the problem to me was the classic engineer who could read
>>>>> and understand things, but couldn't explain anything to other people.
>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>> myself.
>>>>>
>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>> outputs and each output is ... well output, when it has been fully
>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>
>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>
>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>
>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>> coefficient and the input, but also the intermediate output value
>>>>> being calculated on each MAC operation.
>>>>>
>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>> minimum number of calculations on arrival of each input since not all
>>>>> the coefficients are used on each input.  So the coefficients are
>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>> any given input sample.
>>>>
>>>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>>the sets of input values for different outputs are disjoint.
>>>
>>> How each input will be used for each output is known when that input
>>> arrives, so if all possible ouputs for that input have their own
>>> partial product accumulator, the input can be multiplied by the
>>> various coefficients for each partial product and accumulated in each,
>>> and then forgotten.
>>>
>>> It's just another way to do it, but it can save a lot of hardware in a
>>> hardware implementation, especially in an FPGA where the
>>> multiply-accumulators are already built and just laying around,
>>> anyway.
>>
>> So are you asserting this is not a counter-example? 
>>
>> To state that an N-tap FIR decimator does not require all N inputs for
>> an output at time k is, in general, not true, as I have shown. 
>>
>> I believe what you are claiming is that IN CERTAIN CASES you can get
>> away with not using all N inputs at output time k. That I can agree
>> with. However, neither you nor Rick put a constraint on your assertion.
>>
>> And I can see that, in certain situations, this can be mo' better,
>> namely when you have one or more MACs sitting around not being used (as
>> you said) and N > M. But in general it hardly seems a storage saver as a
>> MAC is a storage element+.
>
> In light of the light coming on due to Rick's post that came about the
> same time as I was writing this, I'll have to retract. 

This is a type of systolic array?
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Randy Yates <randyy@garnerundergroundinc.com> writes:

> eric.jacobsen@ieee.org writes:
>
>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>> <randyy@garnerundergroundinc.com> wrote:
>>
>>>rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>> rickman wrote:
>>>>>>>
>>>>>>>> Are the 248 words for data or also program?
>>>>>>>
>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>> But there is no such thing as a program counter, the code
>>>>>>> memories store instructions at consecutive addresses without
>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>> is also a third control memory which stores the information
>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>> which adds fun.
>>>>>>>
>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>
>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>> the past data values?
>>>>>>
>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>> data comes in the output samples can be built up and when one is
>>>>>> complete it is outputted and replaced by a new one.
>>>>>>
>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>> for you.
>>>>>
>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>
>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>> implement a polyphase filter that was the part I had trouble getting
>>>> my head around but finally understood how it worked.  At the time this
>>>> was being done on a very early DSP chip with very limited memory, so
>>>> storing the N-1 previous samples was not an option.  The guy
>>>> presenting the problem to me was the classic engineer who could read
>>>> and understand things, but couldn't explain anything to other people.
>>>> So I had to take the paper he gave me and finally figure it out for
>>>> myself.
>>>>
>>>> I will dig around when I get a chance and figure out exactly how this
>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>> and calculating the outputs one at a time (iterating over the stored
>>>> inputs) - as each input comes in it is iterated over the stored
>>>> outputs and each output is ... well output, when it has been fully
>>>> calculated. Since there are fewer outputs than inputs (by the
>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>
>>>> This is not hard to understand if you just stop thinking it *has* to
>>>> be the way you are currently seeing it.  (shown here with simplified
>>>> notation since it is too messy to show the indexes with decimation)
>>>>
>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>
>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>> than the standard MAC calculation since you have to fetch not only the
>>>> coefficient and the input, but also the intermediate output value
>>>> being calculated on each MAC operation.
>>>>
>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>> minimum number of calculations on arrival of each input since not all
>>>> the coefficients are used on each input.  So the coefficients are
>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>> any given input sample.
>>>
>>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>the sets of input values for different outputs are disjoint.
>>
>> How each input will be used for each output is known when that input
>> arrives, so if all possible ouputs for that input have their own
>> partial product accumulator, the input can be multiplied by the
>> various coefficients for each partial product and accumulated in each,
>> and then forgotten.
>>
>> It's just another way to do it, but it can save a lot of hardware in a
>> hardware implementation, especially in an FPGA where the
>> multiply-accumulators are already built and just laying around,
>> anyway.
>
> So are you asserting this is not a counter-example? 
>
> To state that an N-tap FIR decimator does not require all N inputs for
> an output at time k is, in general, not true, as I have shown. 
>
> I believe what you are claiming is that IN CERTAIN CASES you can get
> away with not using all N inputs at output time k. That I can agree
> with. However, neither you nor Rick put a constraint on your assertion.
>
> And I can see that, in certain situations, this can be mo' better,
> namely when you have one or more MACs sitting around not being used (as
> you said) and N > M. But in general it hardly seems a storage saver as a
> MAC is a storage element+.

In light of the light coming on due to Rick's post that came about the
same time as I was writing this, I'll have to retract. 
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Randy Yates <randyy@garnerundergroundinc.com> writes:

> rickman <gnuarm@gmail.com> writes:
>
>> On 5/16/2017 1:11 PM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>> rickman wrote:
>>>>>>>
>>>>>>>> Are the 248 words for data or also program?
>>>>>>>
>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>> But there is no such thing as a program counter, the code
>>>>>>> memories store instructions at consecutive addresses without
>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>> is also a third control memory which stores the information
>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>> which adds fun.
>>>>>>>
>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>
>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>> the past data values?
>>>>>>
>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>> data comes in the output samples can be built up and when one is
>>>>>> complete it is outputted and replaced by a new one.
>>>>>>
>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>> for you.
>>>>>
>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>
>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>> implement a polyphase filter that was the part I had trouble getting
>>>> my head around but finally understood how it worked.  At the time this
>>>> was being done on a very early DSP chip with very limited memory, so
>>>> storing the N-1 previous samples was not an option.  The guy
>>>> presenting the problem to me was the classic engineer who could read
>>>> and understand things, but couldn't explain anything to other people.
>>>> So I had to take the paper he gave me and finally figure it out for
>>>> myself.
>>>>
>>>> I will dig around when I get a chance and figure out exactly how this
>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>> and calculating the outputs one at a time (iterating over the stored
>>>> inputs) - as each input comes in it is iterated over the stored
>>>> outputs and each output is ... well output, when it has been fully
>>>> calculated. Since there are fewer outputs than inputs (by the
>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>
>>>> This is not hard to understand if you just stop thinking it *has* to
>>>> be the way you are currently seeing it.  (shown here with simplified
>>>> notation since it is too messy to show the indexes with decimation)
>>>>
>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>
>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>> than the standard MAC calculation since you have to fetch not only the
>>>> coefficient and the input, but also the intermediate output value
>>>> being calculated on each MAC operation.
>>>>
>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>> minimum number of calculations on arrival of each input since not all
>>>> the coefficients are used on each input.  So the coefficients are
>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>> any given input sample.
>>>
>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>> the sets of input values for different outputs are disjoint.
>>
>> Yes, so what?  You still don't need to store any inputs to calculate
>> the output.
>
> Where would they come from if they were not stored?

OK I see what you mean now. You can do a running MAC as they come in.
That is certainly true.
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

rickman <gnuarm@gmail.com> writes:

> On 5/16/2017 1:11 PM, Randy Yates wrote:
>> rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>> rickman wrote:
>>>>>>
>>>>>>> Are the 248 words for data or also program?
>>>>>>
>>>>>> Data only, there are dedicated code memories. The control part
>>>>>> is totally crazy: there are two separate code memories working
>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>> provide instructions with no delay in the case of a branch).
>>>>>> But there is no such thing as a program counter, the code
>>>>>> memories store instructions at consecutive addresses without
>>>>>> any deeper structure. A run of instructions is called a block
>>>>>> and is ended with the presence of a jump instruction. There
>>>>>> is also a third control memory which stores the information
>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>> which adds fun.
>>>>>>
>>>>>>> That should be plenty of room for coefficients.
>>>>>>
>>>>>> But doesn't the FIR require a lot of cells for
>>>>>> the past data values?
>>>>>
>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>> data comes in the output samples can be built up and when one is
>>>>> complete it is outputted and replaced by a new one.
>>>>>
>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>> for you.
>>>>
>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>
>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>> implement a polyphase filter that was the part I had trouble getting
>>> my head around but finally understood how it worked.  At the time this
>>> was being done on a very early DSP chip with very limited memory, so
>>> storing the N-1 previous samples was not an option.  The guy
>>> presenting the problem to me was the classic engineer who could read
>>> and understand things, but couldn't explain anything to other people.
>>> So I had to take the paper he gave me and finally figure it out for
>>> myself.
>>>
>>> I will dig around when I get a chance and figure out exactly how this
>>> worked.  But the basic idea is that instead of saving multiple inputs
>>> and calculating the outputs one at a time (iterating over the stored
>>> inputs) - as each input comes in it is iterated over the stored
>>> outputs and each output is ... well output, when it has been fully
>>> calculated. Since there are fewer outputs than inputs (by the
>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>
>>> This is not hard to understand if you just stop thinking it *has* to
>>> be the way you are currently seeing it.  (shown here with simplified
>>> notation since it is too messy to show the indexes with decimation)
>>>
>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>
>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>> time (and so store the last N*M values of x), consider that when x(i)
>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>> of fewer values by a factor of M.  It does require more memory fetches
>>> than the standard MAC calculation since you have to fetch not only the
>>> coefficient and the input, but also the intermediate output value
>>> being calculated on each MAC operation.
>>>
>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>> minimum number of calculations on arrival of each input since not all
>>> the coefficients are used on each input.  So the coefficients are
>>> divided into "phases" (N/M coefficients) with only one phase used for
>>> any given input sample.
>>
>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>> the sets of input values for different outputs are disjoint.
>
> Yes, so what?  You still don't need to store any inputs to calculate
> the output.

Where would they come from if they were not stored?
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

On 5/16/2017 11:59 AM, eric.jacobsen@ieee.org wrote:
> On Tue, 16 May 2017 02:59:51 -0400, rickman <gnuarm@gmail.com> wrote:
>
>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> Are the 248 words for data or also program?
>>>>>
>>>>> Data only, there are dedicated code memories. The control part
>>>>> is totally crazy: there are two separate code memories working
>>>>> in parallel (they take both paths simultaneosly in order to
>>>>> provide instructions with no delay in the case of a branch).
>>>>> But there is no such thing as a program counter, the code
>>>>> memories store instructions at consecutive addresses without
>>>>> any deeper structure. A run of instructions is called a block
>>>>> and is ended with the presence of a jump instruction. There
>>>>> is also a third control memory which stores the information
>>>>> where the i-th block begins and to which j-th state the FSM
>>>>> should go in the case of a branch. In short, hardware basic
>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>> which adds fun.
>>>>>
>>>>>> That should be plenty of room for coefficients.
>>>>>
>>>>> But doesn't the FIR require a lot of cells for
>>>>> the past data values?
>>>>
>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>> be asking.  I can't picture how it works, but I specifically remember
>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>> outputs which will be fewer because of the decimation.  As the input
>>>> data comes in the output samples can be built up and when one is
>>>> complete it is outputted and replaced by a new one.
>>>>
>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>> for you.
>>>
>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>> a length N polyphase FIR decimation by M. Just think first principles:
>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>
>> I am pretty sure that conclusion is not correct.  When I was asked to
>> implement a polyphase filter that was the part I had trouble getting my
>> head around but finally understood how it worked.  At the time this was
>> being done on a very early DSP chip with very limited memory, so storing
>> the N-1 previous samples was not an option.  The guy presenting the
>> problem to me was the classic engineer who could read and understand
>> things, but couldn't explain anything to other people.  So I had to take
>> the paper he gave me and finally figure it out for myself.
>>
>> I will dig around when I get a chance and figure out exactly how this
>> worked.  But the basic idea is that instead of saving multiple inputs
>> and calculating the outputs one at a time (iterating over the stored
>> inputs) - as each input comes in it is iterated over the stored outputs
>> and each output is ... well output, when it has been fully calculated.
>> Since there are fewer outputs than inputs (by the decimation factor of
>> M) you store fewer outputs than you would inputs.
>>
>> This is not hard to understand if you just stop thinking it *has* to be
>> the way you are currently seeing it.  (shown here with simplified
>> notation since it is too messy to show the indexes with decimation)
>>
>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>
>> Instead of thinking you have to calculate all the terms of y(k) at one
>> time (and so store the last N*M values of x), consider that when x(i)
>> arrives, you can calculate the x(i) terms and add them into y(k) (which
>> is then output) and y(k+1), y(k+2)...  This allows the storage of fewer
>> values by a factor of M.  It does require more memory fetches than the
>> standard MAC calculation since you have to fetch not only the
>> coefficient and the input, but also the intermediate output value being
>> calculated on each MAC operation.
>>
>> Maybe I won't have to dig it up.  I'm pretty sure this is how it worked.
>>  It ties in nicely with the "polyphase" aspect to allow a minimum
>> number of calculations on arrival of each input since not all the
>> coefficients are used on each input.  So the coefficients are divided
>> into "phases" (N/M coefficients) with only one phase used for any given
>> input sample.
>>
>> --
>>
>> Rick C
>
> You either need to store N-1 previous input samples or you need enough
> running partial product accumulators with the proper coefficients
> selected at each input.   I usually do the latter, but either works.

Exactly.  But if you are decimating, storing intermediate output sample 
calculations takes less storage than storing input values.

-- 

Rick C

On 5/16/2017 1:11 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes:
>
>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> Are the 248 words for data or also program?
>>>>>
>>>>> Data only, there are dedicated code memories. The control part
>>>>> is totally crazy: there are two separate code memories working
>>>>> in parallel (they take both paths simultaneosly in order to
>>>>> provide instructions with no delay in the case of a branch).
>>>>> But there is no such thing as a program counter, the code
>>>>> memories store instructions at consecutive addresses without
>>>>> any deeper structure. A run of instructions is called a block
>>>>> and is ended with the presence of a jump instruction. There
>>>>> is also a third control memory which stores the information
>>>>> where the i-th block begins and to which j-th state the FSM
>>>>> should go in the case of a branch. In short, hardware basic
>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>> which adds fun.
>>>>>
>>>>>> That should be plenty of room for coefficients.
>>>>>
>>>>> But doesn't the FIR require a lot of cells for
>>>>> the past data values?
>>>>
>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>> be asking.  I can't picture how it works, but I specifically remember
>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>> outputs which will be fewer because of the decimation.  As the input
>>>> data comes in the output samples can be built up and when one is
>>>> complete it is outputted and replaced by a new one.
>>>>
>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>> for you.
>>>
>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>> a length N polyphase FIR decimation by M. Just think first principles:
>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>
>> I am pretty sure that conclusion is not correct.  When I was asked to
>> implement a polyphase filter that was the part I had trouble getting
>> my head around but finally understood how it worked.  At the time this
>> was being done on a very early DSP chip with very limited memory, so
>> storing the N-1 previous samples was not an option.  The guy
>> presenting the problem to me was the classic engineer who could read
>> and understand things, but couldn't explain anything to other people.
>> So I had to take the paper he gave me and finally figure it out for
>> myself.
>>
>> I will dig around when I get a chance and figure out exactly how this
>> worked.  But the basic idea is that instead of saving multiple inputs
>> and calculating the outputs one at a time (iterating over the stored
>> inputs) - as each input comes in it is iterated over the stored
>> outputs and each output is ... well output, when it has been fully
>> calculated. Since there are fewer outputs than inputs (by the
>> decimation factor of M) you store fewer outputs than you would inputs.
>>
>> This is not hard to understand if you just stop thinking it *has* to
>> be the way you are currently seeing it.  (shown here with simplified
>> notation since it is too messy to show the indexes with decimation)
>>
>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>
>> Instead of thinking you have to calculate all the terms of y(k) at one
>> time (and so store the last N*M values of x), consider that when x(i)
>> arrives, you can calculate the x(i) terms and add them into y(k)
>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>> of fewer values by a factor of M.  It does require more memory fetches
>> than the standard MAC calculation since you have to fetch not only the
>> coefficient and the input, but also the intermediate output value
>> being calculated on each MAC operation.
>>
>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>> worked. It ties in nicely with the "polyphase" aspect to allow a
>> minimum number of calculations on arrival of each input since not all
>> the coefficients are used on each input.  So the coefficients are
>> divided into "phases" (N/M coefficients) with only one phase used for
>> any given input sample.
>
> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
> the sets of input values for different outputs are disjoint.

Yes, so what?  You still don't need to store any inputs to calculate the 
output.

-- 

Rick C

eric.jacobsen@ieee.org writes:

> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
> <randyy@garnerundergroundinc.com> wrote:
>
>>rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>> rickman wrote:
>>>>>>
>>>>>>> Are the 248 words for data or also program?
>>>>>>
>>>>>> Data only, there are dedicated code memories. The control part
>>>>>> is totally crazy: there are two separate code memories working
>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>> provide instructions with no delay in the case of a branch).
>>>>>> But there is no such thing as a program counter, the code
>>>>>> memories store instructions at consecutive addresses without
>>>>>> any deeper structure. A run of instructions is called a block
>>>>>> and is ended with the presence of a jump instruction. There
>>>>>> is also a third control memory which stores the information
>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>> which adds fun.
>>>>>>
>>>>>>> That should be plenty of room for coefficients.
>>>>>>
>>>>>> But doesn't the FIR require a lot of cells for
>>>>>> the past data values?
>>>>>
>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>> data comes in the output samples can be built up and when one is
>>>>> complete it is outputted and replaced by a new one.
>>>>>
>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>> for you.
>>>>
>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>
>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>> implement a polyphase filter that was the part I had trouble getting
>>> my head around but finally understood how it worked.  At the time this
>>> was being done on a very early DSP chip with very limited memory, so
>>> storing the N-1 previous samples was not an option.  The guy
>>> presenting the problem to me was the classic engineer who could read
>>> and understand things, but couldn't explain anything to other people.
>>> So I had to take the paper he gave me and finally figure it out for
>>> myself.
>>>
>>> I will dig around when I get a chance and figure out exactly how this
>>> worked.  But the basic idea is that instead of saving multiple inputs
>>> and calculating the outputs one at a time (iterating over the stored
>>> inputs) - as each input comes in it is iterated over the stored
>>> outputs and each output is ... well output, when it has been fully
>>> calculated. Since there are fewer outputs than inputs (by the
>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>
>>> This is not hard to understand if you just stop thinking it *has* to
>>> be the way you are currently seeing it.  (shown here with simplified
>>> notation since it is too messy to show the indexes with decimation)
>>>
>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>
>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>> time (and so store the last N*M values of x), consider that when x(i)
>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>> of fewer values by a factor of M.  It does require more memory fetches
>>> than the standard MAC calculation since you have to fetch not only the
>>> coefficient and the input, but also the intermediate output value
>>> being calculated on each MAC operation.
>>>
>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>> minimum number of calculations on arrival of each input since not all
>>> the coefficients are used on each input.  So the coefficients are
>>> divided into "phases" (N/M coefficients) with only one phase used for
>>> any given input sample.
>>
>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>the sets of input values for different outputs are disjoint.
>
> How each input will be used for each output is known when that input
> arrives, so if all possible ouputs for that input have their own
> partial product accumulator, the input can be multiplied by the
> various coefficients for each partial product and accumulated in each,
> and then forgotten.
>
> It's just another way to do it, but it can save a lot of hardware in a
> hardware implementation, especially in an FPGA where the
> multiply-accumulators are already built and just laying around,
> anyway.

So are you asserting this is not a counter-example? 

To state that an N-tap FIR decimator does not require all N inputs for
an output at time k is, in general, not true, as I have shown. 

I believe what you are claiming is that IN CERTAIN CASES you can get
away with not using all N inputs at output time k. That I can agree
with. However, neither you nor Rick put a constraint on your assertion.

And I can see that, in certain situations, this can be mo' better,
namely when you have one or more MACs sitting around not being used (as
you said) and N > M. But in general it hardly seems a storage saver as a
MAC is a storage element+.
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com