comp.dsp | How to downsample this signal?| page 3

Reply by ●May 16, 20172017-05-16

On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
<randyy@garnerundergroundinc.com> wrote:

>rickman <gnuarm@gmail.com> writes:
>
>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> Are the 248 words for data or also program?
>>>>>
>>>>> Data only, there are dedicated code memories. The control part
>>>>> is totally crazy: there are two separate code memories working
>>>>> in parallel (they take both paths simultaneosly in order to
>>>>> provide instructions with no delay in the case of a branch).
>>>>> But there is no such thing as a program counter, the code
>>>>> memories store instructions at consecutive addresses without
>>>>> any deeper structure. A run of instructions is called a block
>>>>> and is ended with the presence of a jump instruction. There
>>>>> is also a third control memory which stores the information
>>>>> where the i-th block begins and to which j-th state the FSM
>>>>> should go in the case of a branch. In short, hardware basic
>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>> which adds fun.
>>>>>
>>>>>> That should be plenty of room for coefficients.
>>>>>
>>>>> But doesn't the FIR require a lot of cells for
>>>>> the past data values?
>>>>
>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>> be asking.  I can't picture how it works, but I specifically remember
>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>> outputs which will be fewer because of the decimation.  As the input
>>>> data comes in the output samples can be built up and when one is
>>>> complete it is outputted and replaced by a new one.
>>>>
>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>> for you.
>>>
>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>> a length N polyphase FIR decimation by M. Just think first principles:
>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>
>> I am pretty sure that conclusion is not correct.  When I was asked to
>> implement a polyphase filter that was the part I had trouble getting
>> my head around but finally understood how it worked.  At the time this
>> was being done on a very early DSP chip with very limited memory, so
>> storing the N-1 previous samples was not an option.  The guy
>> presenting the problem to me was the classic engineer who could read
>> and understand things, but couldn't explain anything to other people.
>> So I had to take the paper he gave me and finally figure it out for
>> myself.
>>
>> I will dig around when I get a chance and figure out exactly how this
>> worked.  But the basic idea is that instead of saving multiple inputs
>> and calculating the outputs one at a time (iterating over the stored
>> inputs) - as each input comes in it is iterated over the stored
>> outputs and each output is ... well output, when it has been fully
>> calculated. Since there are fewer outputs than inputs (by the
>> decimation factor of M) you store fewer outputs than you would inputs.
>>
>> This is not hard to understand if you just stop thinking it *has* to
>> be the way you are currently seeing it.  (shown here with simplified
>> notation since it is too messy to show the indexes with decimation)
>>
>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>
>> Instead of thinking you have to calculate all the terms of y(k) at one
>> time (and so store the last N*M values of x), consider that when x(i)
>> arrives, you can calculate the x(i) terms and add them into y(k)
>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>> of fewer values by a factor of M.  It does require more memory fetches
>> than the standard MAC calculation since you have to fetch not only the
>> coefficient and the input, but also the intermediate output value
>> being calculated on each MAC operation.
>>
>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>> worked. It ties in nicely with the "polyphase" aspect to allow a
>> minimum number of calculations on arrival of each input since not all
>> the coefficients are used on each input.  So the coefficients are
>> divided into "phases" (N/M coefficients) with only one phase used for
>> any given input sample.
>
>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>the sets of input values for different outputs are disjoint.

How each input will be used for each output is known when that input
arrives, so if all possible ouputs for that input have their own
partial product accumulator, the input can be multiplied by the
various coefficients for each partial product and accumulated in each,
and then forgotten.

It's just another way to do it, but it can save a lot of hardware in a
hardware implementation, especially in an FPGA where the
multiply-accumulators are already built and just laying around,
anyway.



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Reply by rickman ●May 16, 20172017-05-16

On 5/16/2017 1:11 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes:
>
>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> Are the 248 words for data or also program?
>>>>>
>>>>> Data only, there are dedicated code memories. The control part
>>>>> is totally crazy: there are two separate code memories working
>>>>> in parallel (they take both paths simultaneosly in order to
>>>>> provide instructions with no delay in the case of a branch).
>>>>> But there is no such thing as a program counter, the code
>>>>> memories store instructions at consecutive addresses without
>>>>> any deeper structure. A run of instructions is called a block
>>>>> and is ended with the presence of a jump instruction. There
>>>>> is also a third control memory which stores the information
>>>>> where the i-th block begins and to which j-th state the FSM
>>>>> should go in the case of a branch. In short, hardware basic
>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>> which adds fun.
>>>>>
>>>>>> That should be plenty of room for coefficients.
>>>>>
>>>>> But doesn't the FIR require a lot of cells for
>>>>> the past data values?
>>>>
>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>> be asking.  I can't picture how it works, but I specifically remember
>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>> outputs which will be fewer because of the decimation.  As the input
>>>> data comes in the output samples can be built up and when one is
>>>> complete it is outputted and replaced by a new one.
>>>>
>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>> for you.
>>>
>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>> a length N polyphase FIR decimation by M. Just think first principles:
>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>
>> I am pretty sure that conclusion is not correct.  When I was asked to
>> implement a polyphase filter that was the part I had trouble getting
>> my head around but finally understood how it worked.  At the time this
>> was being done on a very early DSP chip with very limited memory, so
>> storing the N-1 previous samples was not an option.  The guy
>> presenting the problem to me was the classic engineer who could read
>> and understand things, but couldn't explain anything to other people.
>> So I had to take the paper he gave me and finally figure it out for
>> myself.
>>
>> I will dig around when I get a chance and figure out exactly how this
>> worked.  But the basic idea is that instead of saving multiple inputs
>> and calculating the outputs one at a time (iterating over the stored
>> inputs) - as each input comes in it is iterated over the stored
>> outputs and each output is ... well output, when it has been fully
>> calculated. Since there are fewer outputs than inputs (by the
>> decimation factor of M) you store fewer outputs than you would inputs.
>>
>> This is not hard to understand if you just stop thinking it *has* to
>> be the way you are currently seeing it.  (shown here with simplified
>> notation since it is too messy to show the indexes with decimation)
>>
>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>
>> Instead of thinking you have to calculate all the terms of y(k) at one
>> time (and so store the last N*M values of x), consider that when x(i)
>> arrives, you can calculate the x(i) terms and add them into y(k)
>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>> of fewer values by a factor of M.  It does require more memory fetches
>> than the standard MAC calculation since you have to fetch not only the
>> coefficient and the input, but also the intermediate output value
>> being calculated on each MAC operation.
>>
>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>> worked. It ties in nicely with the "polyphase" aspect to allow a
>> minimum number of calculations on arrival of each input since not all
>> the coefficients are used on each input.  So the coefficients are
>> divided into "phases" (N/M coefficients) with only one phase used for
>> any given input sample.
>
> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
> the sets of input values for different outputs are disjoint.

Yes, so what?  You still don't need to store any inputs to calculate the 
output.

-- 

Rick C

Reply by Randy Yates ●May 16, 20172017-05-16

eric.jacobsen@ieee.org writes:

> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
> <randyy@garnerundergroundinc.com> wrote:
>
>>rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>> rickman wrote:
>>>>>>
>>>>>>> Are the 248 words for data or also program?
>>>>>>
>>>>>> Data only, there are dedicated code memories. The control part
>>>>>> is totally crazy: there are two separate code memories working
>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>> provide instructions with no delay in the case of a branch).
>>>>>> But there is no such thing as a program counter, the code
>>>>>> memories store instructions at consecutive addresses without
>>>>>> any deeper structure. A run of instructions is called a block
>>>>>> and is ended with the presence of a jump instruction. There
>>>>>> is also a third control memory which stores the information
>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>> which adds fun.
>>>>>>
>>>>>>> That should be plenty of room for coefficients.
>>>>>>
>>>>>> But doesn't the FIR require a lot of cells for
>>>>>> the past data values?
>>>>>
>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>> data comes in the output samples can be built up and when one is
>>>>> complete it is outputted and replaced by a new one.
>>>>>
>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>> for you.
>>>>
>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>
>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>> implement a polyphase filter that was the part I had trouble getting
>>> my head around but finally understood how it worked.  At the time this
>>> was being done on a very early DSP chip with very limited memory, so
>>> storing the N-1 previous samples was not an option.  The guy
>>> presenting the problem to me was the classic engineer who could read
>>> and understand things, but couldn't explain anything to other people.
>>> So I had to take the paper he gave me and finally figure it out for
>>> myself.
>>>
>>> I will dig around when I get a chance and figure out exactly how this
>>> worked.  But the basic idea is that instead of saving multiple inputs
>>> and calculating the outputs one at a time (iterating over the stored
>>> inputs) - as each input comes in it is iterated over the stored
>>> outputs and each output is ... well output, when it has been fully
>>> calculated. Since there are fewer outputs than inputs (by the
>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>
>>> This is not hard to understand if you just stop thinking it *has* to
>>> be the way you are currently seeing it.  (shown here with simplified
>>> notation since it is too messy to show the indexes with decimation)
>>>
>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>
>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>> time (and so store the last N*M values of x), consider that when x(i)
>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>> of fewer values by a factor of M.  It does require more memory fetches
>>> than the standard MAC calculation since you have to fetch not only the
>>> coefficient and the input, but also the intermediate output value
>>> being calculated on each MAC operation.
>>>
>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>> minimum number of calculations on arrival of each input since not all
>>> the coefficients are used on each input.  So the coefficients are
>>> divided into "phases" (N/M coefficients) with only one phase used for
>>> any given input sample.
>>
>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>the sets of input values for different outputs are disjoint.
>
> How each input will be used for each output is known when that input
> arrives, so if all possible ouputs for that input have their own
> partial product accumulator, the input can be multiplied by the
> various coefficients for each partial product and accumulated in each,
> and then forgotten.
>
> It's just another way to do it, but it can save a lot of hardware in a
> hardware implementation, especially in an FPGA where the
> multiply-accumulators are already built and just laying around,
> anyway.

So are you asserting this is not a counter-example? 

To state that an N-tap FIR decimator does not require all N inputs for
an output at time k is, in general, not true, as I have shown. 

I believe what you are claiming is that IN CERTAIN CASES you can get
away with not using all N inputs at output time k. That I can agree
with. However, neither you nor Rick put a constraint on your assertion.

And I can see that, in certain situations, this can be mo' better,
namely when you have one or more MACs sitting around not being used (as
you said) and N > M. But in general it hardly seems a storage saver as a
MAC is a storage element+.
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Reply by Randy Yates ●May 16, 20172017-05-16

rickman <gnuarm@gmail.com> writes:

> On 5/16/2017 1:11 PM, Randy Yates wrote:
>> rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>> rickman wrote:
>>>>>>
>>>>>>> Are the 248 words for data or also program?
>>>>>>
>>>>>> Data only, there are dedicated code memories. The control part
>>>>>> is totally crazy: there are two separate code memories working
>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>> provide instructions with no delay in the case of a branch).
>>>>>> But there is no such thing as a program counter, the code
>>>>>> memories store instructions at consecutive addresses without
>>>>>> any deeper structure. A run of instructions is called a block
>>>>>> and is ended with the presence of a jump instruction. There
>>>>>> is also a third control memory which stores the information
>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>> which adds fun.
>>>>>>
>>>>>>> That should be plenty of room for coefficients.
>>>>>>
>>>>>> But doesn't the FIR require a lot of cells for
>>>>>> the past data values?
>>>>>
>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>> data comes in the output samples can be built up and when one is
>>>>> complete it is outputted and replaced by a new one.
>>>>>
>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>> for you.
>>>>
>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>
>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>> implement a polyphase filter that was the part I had trouble getting
>>> my head around but finally understood how it worked.  At the time this
>>> was being done on a very early DSP chip with very limited memory, so
>>> storing the N-1 previous samples was not an option.  The guy
>>> presenting the problem to me was the classic engineer who could read
>>> and understand things, but couldn't explain anything to other people.
>>> So I had to take the paper he gave me and finally figure it out for
>>> myself.
>>>
>>> I will dig around when I get a chance and figure out exactly how this
>>> worked.  But the basic idea is that instead of saving multiple inputs
>>> and calculating the outputs one at a time (iterating over the stored
>>> inputs) - as each input comes in it is iterated over the stored
>>> outputs and each output is ... well output, when it has been fully
>>> calculated. Since there are fewer outputs than inputs (by the
>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>
>>> This is not hard to understand if you just stop thinking it *has* to
>>> be the way you are currently seeing it.  (shown here with simplified
>>> notation since it is too messy to show the indexes with decimation)
>>>
>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>
>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>> time (and so store the last N*M values of x), consider that when x(i)
>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>> of fewer values by a factor of M.  It does require more memory fetches
>>> than the standard MAC calculation since you have to fetch not only the
>>> coefficient and the input, but also the intermediate output value
>>> being calculated on each MAC operation.
>>>
>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>> minimum number of calculations on arrival of each input since not all
>>> the coefficients are used on each input.  So the coefficients are
>>> divided into "phases" (N/M coefficients) with only one phase used for
>>> any given input sample.
>>
>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>> the sets of input values for different outputs are disjoint.
>
> Yes, so what?  You still don't need to store any inputs to calculate
> the output.

Where would they come from if they were not stored?
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Reply by rickman ●May 16, 20172017-05-16

On 5/16/2017 11:59 AM, eric.jacobsen@ieee.org wrote:
> On Tue, 16 May 2017 02:59:51 -0400, rickman <gnuarm@gmail.com> wrote:
>
>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> Are the 248 words for data or also program?
>>>>>
>>>>> Data only, there are dedicated code memories. The control part
>>>>> is totally crazy: there are two separate code memories working
>>>>> in parallel (they take both paths simultaneosly in order to
>>>>> provide instructions with no delay in the case of a branch).
>>>>> But there is no such thing as a program counter, the code
>>>>> memories store instructions at consecutive addresses without
>>>>> any deeper structure. A run of instructions is called a block
>>>>> and is ended with the presence of a jump instruction. There
>>>>> is also a third control memory which stores the information
>>>>> where the i-th block begins and to which j-th state the FSM
>>>>> should go in the case of a branch. In short, hardware basic
>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>> which adds fun.
>>>>>
>>>>>> That should be plenty of room for coefficients.
>>>>>
>>>>> But doesn't the FIR require a lot of cells for
>>>>> the past data values?
>>>>
>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>> be asking.  I can't picture how it works, but I specifically remember
>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>> outputs which will be fewer because of the decimation.  As the input
>>>> data comes in the output samples can be built up and when one is
>>>> complete it is outputted and replaced by a new one.
>>>>
>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>> for you.
>>>
>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>> a length N polyphase FIR decimation by M. Just think first principles:
>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>
>> I am pretty sure that conclusion is not correct.  When I was asked to
>> implement a polyphase filter that was the part I had trouble getting my
>> head around but finally understood how it worked.  At the time this was
>> being done on a very early DSP chip with very limited memory, so storing
>> the N-1 previous samples was not an option.  The guy presenting the
>> problem to me was the classic engineer who could read and understand
>> things, but couldn't explain anything to other people.  So I had to take
>> the paper he gave me and finally figure it out for myself.
>>
>> I will dig around when I get a chance and figure out exactly how this
>> worked.  But the basic idea is that instead of saving multiple inputs
>> and calculating the outputs one at a time (iterating over the stored
>> inputs) - as each input comes in it is iterated over the stored outputs
>> and each output is ... well output, when it has been fully calculated.
>> Since there are fewer outputs than inputs (by the decimation factor of
>> M) you store fewer outputs than you would inputs.
>>
>> This is not hard to understand if you just stop thinking it *has* to be
>> the way you are currently seeing it.  (shown here with simplified
>> notation since it is too messy to show the indexes with decimation)
>>
>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>
>> Instead of thinking you have to calculate all the terms of y(k) at one
>> time (and so store the last N*M values of x), consider that when x(i)
>> arrives, you can calculate the x(i) terms and add them into y(k) (which
>> is then output) and y(k+1), y(k+2)...  This allows the storage of fewer
>> values by a factor of M.  It does require more memory fetches than the
>> standard MAC calculation since you have to fetch not only the
>> coefficient and the input, but also the intermediate output value being
>> calculated on each MAC operation.
>>
>> Maybe I won't have to dig it up.  I'm pretty sure this is how it worked.
>>  It ties in nicely with the "polyphase" aspect to allow a minimum
>> number of calculations on arrival of each input since not all the
>> coefficients are used on each input.  So the coefficients are divided
>> into "phases" (N/M coefficients) with only one phase used for any given
>> input sample.
>>
>> --
>>
>> Rick C
>
> You either need to store N-1 previous input samples or you need enough
> running partial product accumulators with the proper coefficients
> selected at each input.   I usually do the latter, but either works.

Exactly.  But if you are decimating, storing intermediate output sample 
calculations takes less storage than storing input values.

-- 

Rick C

Reply by Randy Yates ●May 16, 20172017-05-16

Randy Yates <randyy@garnerundergroundinc.com> writes:

> rickman <gnuarm@gmail.com> writes:
>
>> On 5/16/2017 1:11 PM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>> rickman wrote:
>>>>>>>
>>>>>>>> Are the 248 words for data or also program?
>>>>>>>
>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>> But there is no such thing as a program counter, the code
>>>>>>> memories store instructions at consecutive addresses without
>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>> is also a third control memory which stores the information
>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>> which adds fun.
>>>>>>>
>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>
>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>> the past data values?
>>>>>>
>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>> data comes in the output samples can be built up and when one is
>>>>>> complete it is outputted and replaced by a new one.
>>>>>>
>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>> for you.
>>>>>
>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>
>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>> implement a polyphase filter that was the part I had trouble getting
>>>> my head around but finally understood how it worked.  At the time this
>>>> was being done on a very early DSP chip with very limited memory, so
>>>> storing the N-1 previous samples was not an option.  The guy
>>>> presenting the problem to me was the classic engineer who could read
>>>> and understand things, but couldn't explain anything to other people.
>>>> So I had to take the paper he gave me and finally figure it out for
>>>> myself.
>>>>
>>>> I will dig around when I get a chance and figure out exactly how this
>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>> and calculating the outputs one at a time (iterating over the stored
>>>> inputs) - as each input comes in it is iterated over the stored
>>>> outputs and each output is ... well output, when it has been fully
>>>> calculated. Since there are fewer outputs than inputs (by the
>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>
>>>> This is not hard to understand if you just stop thinking it *has* to
>>>> be the way you are currently seeing it.  (shown here with simplified
>>>> notation since it is too messy to show the indexes with decimation)
>>>>
>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>
>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>> than the standard MAC calculation since you have to fetch not only the
>>>> coefficient and the input, but also the intermediate output value
>>>> being calculated on each MAC operation.
>>>>
>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>> minimum number of calculations on arrival of each input since not all
>>>> the coefficients are used on each input.  So the coefficients are
>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>> any given input sample.
>>>
>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>> the sets of input values for different outputs are disjoint.
>>
>> Yes, so what?  You still don't need to store any inputs to calculate
>> the output.
>
> Where would they come from if they were not stored?

OK I see what you mean now. You can do a running MAC as they come in.
That is certainly true.
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Reply by Randy Yates ●May 16, 20172017-05-16

Randy Yates <randyy@garnerundergroundinc.com> writes:

> eric.jacobsen@ieee.org writes:
>
>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>> <randyy@garnerundergroundinc.com> wrote:
>>
>>>rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>> rickman wrote:
>>>>>>>
>>>>>>>> Are the 248 words for data or also program?
>>>>>>>
>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>> But there is no such thing as a program counter, the code
>>>>>>> memories store instructions at consecutive addresses without
>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>> is also a third control memory which stores the information
>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>> which adds fun.
>>>>>>>
>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>
>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>> the past data values?
>>>>>>
>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>> data comes in the output samples can be built up and when one is
>>>>>> complete it is outputted and replaced by a new one.
>>>>>>
>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>> for you.
>>>>>
>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>
>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>> implement a polyphase filter that was the part I had trouble getting
>>>> my head around but finally understood how it worked.  At the time this
>>>> was being done on a very early DSP chip with very limited memory, so
>>>> storing the N-1 previous samples was not an option.  The guy
>>>> presenting the problem to me was the classic engineer who could read
>>>> and understand things, but couldn't explain anything to other people.
>>>> So I had to take the paper he gave me and finally figure it out for
>>>> myself.
>>>>
>>>> I will dig around when I get a chance and figure out exactly how this
>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>> and calculating the outputs one at a time (iterating over the stored
>>>> inputs) - as each input comes in it is iterated over the stored
>>>> outputs and each output is ... well output, when it has been fully
>>>> calculated. Since there are fewer outputs than inputs (by the
>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>
>>>> This is not hard to understand if you just stop thinking it *has* to
>>>> be the way you are currently seeing it.  (shown here with simplified
>>>> notation since it is too messy to show the indexes with decimation)
>>>>
>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>
>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>> than the standard MAC calculation since you have to fetch not only the
>>>> coefficient and the input, but also the intermediate output value
>>>> being calculated on each MAC operation.
>>>>
>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>> minimum number of calculations on arrival of each input since not all
>>>> the coefficients are used on each input.  So the coefficients are
>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>> any given input sample.
>>>
>>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>the sets of input values for different outputs are disjoint.
>>
>> How each input will be used for each output is known when that input
>> arrives, so if all possible ouputs for that input have their own
>> partial product accumulator, the input can be multiplied by the
>> various coefficients for each partial product and accumulated in each,
>> and then forgotten.
>>
>> It's just another way to do it, but it can save a lot of hardware in a
>> hardware implementation, especially in an FPGA where the
>> multiply-accumulators are already built and just laying around,
>> anyway.
>
> So are you asserting this is not a counter-example? 
>
> To state that an N-tap FIR decimator does not require all N inputs for
> an output at time k is, in general, not true, as I have shown. 
>
> I believe what you are claiming is that IN CERTAIN CASES you can get
> away with not using all N inputs at output time k. That I can agree
> with. However, neither you nor Rick put a constraint on your assertion.
>
> And I can see that, in certain situations, this can be mo' better,
> namely when you have one or more MACs sitting around not being used (as
> you said) and N > M. But in general it hardly seems a storage saver as a
> MAC is a storage element+.

In light of the light coming on due to Rick's post that came about the
same time as I was writing this, I'll have to retract. 
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Reply by Randy Yates ●May 16, 20172017-05-16

Randy Yates <randyy@garnerundergroundinc.com> writes:

> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> eric.jacobsen@ieee.org writes:
>>
>>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>>> <randyy@garnerundergroundinc.com> wrote:
>>>
>>>>rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>
>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>> rickman wrote:
>>>>>>>>
>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>
>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>> is also a third control memory which stores the information
>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>> which adds fun.
>>>>>>>>
>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>
>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>> the past data values?
>>>>>>>
>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>
>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>> for you.
>>>>>>
>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>
>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>> my head around but finally understood how it worked.  At the time this
>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>> presenting the problem to me was the classic engineer who could read
>>>>> and understand things, but couldn't explain anything to other people.
>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>> myself.
>>>>>
>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>> outputs and each output is ... well output, when it has been fully
>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>
>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>
>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>
>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>> coefficient and the input, but also the intermediate output value
>>>>> being calculated on each MAC operation.
>>>>>
>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>> minimum number of calculations on arrival of each input since not all
>>>>> the coefficients are used on each input.  So the coefficients are
>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>> any given input sample.
>>>>
>>>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>>the sets of input values for different outputs are disjoint.
>>>
>>> How each input will be used for each output is known when that input
>>> arrives, so if all possible ouputs for that input have their own
>>> partial product accumulator, the input can be multiplied by the
>>> various coefficients for each partial product and accumulated in each,
>>> and then forgotten.
>>>
>>> It's just another way to do it, but it can save a lot of hardware in a
>>> hardware implementation, especially in an FPGA where the
>>> multiply-accumulators are already built and just laying around,
>>> anyway.
>>
>> So are you asserting this is not a counter-example? 
>>
>> To state that an N-tap FIR decimator does not require all N inputs for
>> an output at time k is, in general, not true, as I have shown. 
>>
>> I believe what you are claiming is that IN CERTAIN CASES you can get
>> away with not using all N inputs at output time k. That I can agree
>> with. However, neither you nor Rick put a constraint on your assertion.
>>
>> And I can see that, in certain situations, this can be mo' better,
>> namely when you have one or more MACs sitting around not being used (as
>> you said) and N > M. But in general it hardly seems a storage saver as a
>> MAC is a storage element+.
>
> In light of the light coming on due to Rick's post that came about the
> same time as I was writing this, I'll have to retract. 

This is a type of systolic array?
-- 
Randy Yates, Embedded Firmware Developer
Garner Underground, Inc.
866-260-9040, x3901
http://www.garnerundergroundinc.com

Reply by rickman ●May 16, 20172017-05-16

On 5/16/2017 3:50 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> Randy Yates <randyy@garnerundergroundinc.com> writes:
>>
>>> eric.jacobsen@ieee.org writes:
>>>
>>>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates
>>>> <randyy@garnerundergroundinc.com> wrote:
>>>>
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>
>>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>>
>>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>>> rickman wrote:
>>>>>>>>>
>>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>>
>>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>>> is also a third control memory which stores the information
>>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>>> which adds fun.
>>>>>>>>>
>>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>>
>>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>>> the past data values?
>>>>>>>>
>>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>>
>>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>>> for you.
>>>>>>>
>>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>>
>>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>>> my head around but finally understood how it worked.  At the time this
>>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>>> presenting the problem to me was the classic engineer who could read
>>>>>> and understand things, but couldn't explain anything to other people.
>>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>>> myself.
>>>>>>
>>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>>> outputs and each output is ... well output, when it has been fully
>>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>>
>>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>>
>>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>>
>>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>>> coefficient and the input, but also the intermediate output value
>>>>>> being calculated on each MAC operation.
>>>>>>
>>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>>> minimum number of calculations on arrival of each input since not all
>>>>>> the coefficients are used on each input.  So the coefficients are
>>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>>> any given input sample.
>>>>>
>>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>>> the sets of input values for different outputs are disjoint.
>>>>
>>>> How each input will be used for each output is known when that input
>>>> arrives, so if all possible ouputs for that input have their own
>>>> partial product accumulator, the input can be multiplied by the
>>>> various coefficients for each partial product and accumulated in each,
>>>> and then forgotten.
>>>>
>>>> It's just another way to do it, but it can save a lot of hardware in a
>>>> hardware implementation, especially in an FPGA where the
>>>> multiply-accumulators are already built and just laying around,
>>>> anyway.
>>>
>>> So are you asserting this is not a counter-example?
>>>
>>> To state that an N-tap FIR decimator does not require all N inputs for
>>> an output at time k is, in general, not true, as I have shown.
>>>
>>> I believe what you are claiming is that IN CERTAIN CASES you can get
>>> away with not using all N inputs at output time k. That I can agree
>>> with. However, neither you nor Rick put a constraint on your assertion.
>>>
>>> And I can see that, in certain situations, this can be mo' better,
>>> namely when you have one or more MACs sitting around not being used (as
>>> you said) and N > M. But in general it hardly seems a storage saver as a
>>> MAC is a storage element+.
>>
>> In light of the light coming on due to Rick's post that came about the
>> same time as I was writing this, I'll have to retract.
>
> This is a type of systolic array?

I've not seen the two things connected.  This method of calculating a 
polyphase filter does not require multiple processors which is the only 
context I've seen the term "systolic array" used.  I believe that is 
just a collection of processing elements that each pump data between 
them at a regular rate and a run-time fixed pattern.  Certainly any FIR 
filter would be a good candidate for a systolic array.

I think systolic arrays are not so much used these days.  Processing 
speeds have increased so much that there are a lot more things that can 
be done on a single processing element without breaking it down into 
pieces.  I suppose radar work will always find a need for more 
processing and perhaps sonar.  I think the cell phone industry has been 
a real boon to many other fields in terms of pushing the state of the 
art in DSP hardware.

-- 

Rick C

Reply by rickman ●May 16, 20172017-05-16

On 5/16/2017 3:40 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes:
>
>> rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/16/2017 1:11 PM, Randy Yates wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>
>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote:
>>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>>
>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote:
>>>>>>>> rickman wrote:
>>>>>>>>
>>>>>>>>> Are the 248 words for data or also program?
>>>>>>>>
>>>>>>>> Data only, there are dedicated code memories. The control part
>>>>>>>> is totally crazy: there are two separate code memories working
>>>>>>>> in parallel (they take both paths simultaneosly in order to
>>>>>>>> provide instructions with no delay in the case of a branch).
>>>>>>>> But there is no such thing as a program counter, the code
>>>>>>>> memories store instructions at consecutive addresses without
>>>>>>>> any deeper structure. A run of instructions is called a block
>>>>>>>> and is ended with the presence of a jump instruction. There
>>>>>>>> is also a third control memory which stores the information
>>>>>>>> where the i-th block begins and to which j-th state the FSM
>>>>>>>> should go in the case of a branch. In short, hardware basic
>>>>>>>> blocks. The data path is a VLIW with exposed pipelining,
>>>>>>>> which adds fun.
>>>>>>>>
>>>>>>>>> That should be plenty of room for coefficients.
>>>>>>>>
>>>>>>>> But doesn't the FIR require a lot of cells for
>>>>>>>> the past data values?
>>>>>>>
>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to
>>>>>>> be asking.  I can't picture how it works, but I specifically remember
>>>>>>> NOT needing to store previous data inputs.  I am thinking you store
>>>>>>> outputs which will be fewer because of the decimation.  As the input
>>>>>>> data comes in the output samples can be built up and when one is
>>>>>>> complete it is outputted and replaced by a new one.
>>>>>>>
>>>>>>> But read a proper reference.  If I wasn't busy today I'd dig this up
>>>>>>> for you.
>>>>>>
>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for
>>>>>> a length N polyphase FIR decimation by M. Just think first principles:
>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M.
>>>>>
>>>>> I am pretty sure that conclusion is not correct.  When I was asked to
>>>>> implement a polyphase filter that was the part I had trouble getting
>>>>> my head around but finally understood how it worked.  At the time this
>>>>> was being done on a very early DSP chip with very limited memory, so
>>>>> storing the N-1 previous samples was not an option.  The guy
>>>>> presenting the problem to me was the classic engineer who could read
>>>>> and understand things, but couldn't explain anything to other people.
>>>>> So I had to take the paper he gave me and finally figure it out for
>>>>> myself.
>>>>>
>>>>> I will dig around when I get a chance and figure out exactly how this
>>>>> worked.  But the basic idea is that instead of saving multiple inputs
>>>>> and calculating the outputs one at a time (iterating over the stored
>>>>> inputs) - as each input comes in it is iterated over the stored
>>>>> outputs and each output is ... well output, when it has been fully
>>>>> calculated. Since there are fewer outputs than inputs (by the
>>>>> decimation factor of M) you store fewer outputs than you would inputs.
>>>>>
>>>>> This is not hard to understand if you just stop thinking it *has* to
>>>>> be the way you are currently seeing it.  (shown here with simplified
>>>>> notation since it is too messy to show the indexes with decimation)
>>>>>
>>>>> y(k)   = a(0)x(i)   + a(1)x(i-1) + a(2)x(i-2) + ...
>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i)   + a(2)x(i-1) + ...
>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i)   + ...
>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ...
>>>>>
>>>>> Instead of thinking you have to calculate all the terms of y(k) at one
>>>>> time (and so store the last N*M values of x), consider that when x(i)
>>>>> arrives, you can calculate the x(i) terms and add them into y(k)
>>>>> (which is then output) and y(k+1), y(k+2)...  This allows the storage
>>>>> of fewer values by a factor of M.  It does require more memory fetches
>>>>> than the standard MAC calculation since you have to fetch not only the
>>>>> coefficient and the input, but also the intermediate output value
>>>>> being calculated on each MAC operation.
>>>>>
>>>>> Maybe I won't have to dig it up.  I'm pretty sure this is how it
>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a
>>>>> minimum number of calculations on arrival of each input since not all
>>>>> the coefficients are used on each input.  So the coefficients are
>>>>> divided into "phases" (N/M coefficients) with only one phase used for
>>>>> any given input sample.
>>>>
>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by
>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus
>>>> the sets of input values for different outputs are disjoint.
>>>
>>> Yes, so what?  You still don't need to store any inputs to calculate
>>> the output.
>>
>> Where would they come from if they were not stored?
>
> OK I see what you mean now. You can do a running MAC as they come in.
> That is certainly true.

Ok, glad we got past that.

-- 

Rick C

Previous 1 234 Next

How to downsample this signal?

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group