Reply by Randy Yates May 17, 20172017-05-17
rickman <gnuarm@gmail.com> writes:

> On 5/16/2017 3:40 PM, Randy Yates wrote: >> Randy Yates <randyy@garnerundergroundinc.com> writes: >> >>> rickman <gnuarm@gmail.com> writes: >>> >>>> On 5/16/2017 1:11 PM, Randy Yates wrote: >>>>> rickman <gnuarm@gmail.com> writes: >>>>> >>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>>>> rickman <gnuarm@gmail.com> writes: >>>>>>> >>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>>>> rickman wrote: >>>>>>>>> >>>>>>>>>> Are the 248 words for data or also program? >>>>>>>>> >>>>>>>>> Data only, there are dedicated code memories. The control part >>>>>>>>> is totally crazy: there are two separate code memories working >>>>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>>>> provide instructions with no delay in the case of a branch). >>>>>>>>> But there is no such thing as a program counter, the code >>>>>>>>> memories store instructions at consecutive addresses without >>>>>>>>> any deeper structure. A run of instructions is called a block >>>>>>>>> and is ended with the presence of a jump instruction. There >>>>>>>>> is also a third control memory which stores the information >>>>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>>>> which adds fun. >>>>>>>>> >>>>>>>>>> That should be plenty of room for coefficients. >>>>>>>>> >>>>>>>>> But doesn't the FIR require a lot of cells for >>>>>>>>> the past data values? >>>>>>>> >>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>>>> outputs which will be fewer because of the decimation. As the input >>>>>>>> data comes in the output samples can be built up and when one is >>>>>>>> complete it is outputted and replaced by a new one. >>>>>>>> >>>>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>>>> for you. >>>>>>> >>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>>>> >>>>>> I am pretty sure that conclusion is not correct. When I was asked to >>>>>> implement a polyphase filter that was the part I had trouble getting >>>>>> my head around but finally understood how it worked. At the time this >>>>>> was being done on a very early DSP chip with very limited memory, so >>>>>> storing the N-1 previous samples was not an option. The guy >>>>>> presenting the problem to me was the classic engineer who could read >>>>>> and understand things, but couldn't explain anything to other people. >>>>>> So I had to take the paper he gave me and finally figure it out for >>>>>> myself. >>>>>> >>>>>> I will dig around when I get a chance and figure out exactly how this >>>>>> worked. But the basic idea is that instead of saving multiple inputs >>>>>> and calculating the outputs one at a time (iterating over the stored >>>>>> inputs) - as each input comes in it is iterated over the stored >>>>>> outputs and each output is ... well output, when it has been fully >>>>>> calculated. Since there are fewer outputs than inputs (by the >>>>>> decimation factor of M) you store fewer outputs than you would inputs. >>>>>> >>>>>> This is not hard to understand if you just stop thinking it *has* to >>>>>> be the way you are currently seeing it. (shown here with simplified >>>>>> notation since it is too messy to show the indexes with decimation) >>>>>> >>>>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>>>> >>>>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>>>> time (and so store the last N*M values of x), consider that when x(i) >>>>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>>>> of fewer values by a factor of M. It does require more memory fetches >>>>>> than the standard MAC calculation since you have to fetch not only the >>>>>> coefficient and the input, but also the intermediate output value >>>>>> being calculated on each MAC operation. >>>>>> >>>>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>>>> minimum number of calculations on arrival of each input since not all >>>>>> the coefficients are used on each input. So the coefficients are >>>>>> divided into "phases" (N/M coefficients) with only one phase used for >>>>>> any given input sample. >>>>> >>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>>>> the sets of input values for different outputs are disjoint. >>>> >>>> Yes, so what? You still don't need to store any inputs to calculate >>>> the output. >>> >>> Where would they come from if they were not stored? >> >> OK I see what you mean now. You can do a running MAC as they come in. >> That is certainly true. > > Ok, glad we got past that.
Yeah, thanks for hanging with me, Rick. For some reason I kept on overlooking the key concept, which you plainly stated: Instead of thinking you have to calculate all the terms of y(k) AT ONE TIME (and so store the last N*M values of x), consider that when x(i) ... Mea culpa. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
Reply by rickman May 16, 20172017-05-16
On 5/16/2017 3:40 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes: > >> rickman <gnuarm@gmail.com> writes: >> >>> On 5/16/2017 1:11 PM, Randy Yates wrote: >>>> rickman <gnuarm@gmail.com> writes: >>>> >>>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>>> rickman <gnuarm@gmail.com> writes: >>>>>> >>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>>> rickman wrote: >>>>>>>> >>>>>>>>> Are the 248 words for data or also program? >>>>>>>> >>>>>>>> Data only, there are dedicated code memories. The control part >>>>>>>> is totally crazy: there are two separate code memories working >>>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>>> provide instructions with no delay in the case of a branch). >>>>>>>> But there is no such thing as a program counter, the code >>>>>>>> memories store instructions at consecutive addresses without >>>>>>>> any deeper structure. A run of instructions is called a block >>>>>>>> and is ended with the presence of a jump instruction. There >>>>>>>> is also a third control memory which stores the information >>>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>>> which adds fun. >>>>>>>> >>>>>>>>> That should be plenty of room for coefficients. >>>>>>>> >>>>>>>> But doesn't the FIR require a lot of cells for >>>>>>>> the past data values? >>>>>>> >>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>>> outputs which will be fewer because of the decimation. As the input >>>>>>> data comes in the output samples can be built up and when one is >>>>>>> complete it is outputted and replaced by a new one. >>>>>>> >>>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>>> for you. >>>>>> >>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>>> >>>>> I am pretty sure that conclusion is not correct. When I was asked to >>>>> implement a polyphase filter that was the part I had trouble getting >>>>> my head around but finally understood how it worked. At the time this >>>>> was being done on a very early DSP chip with very limited memory, so >>>>> storing the N-1 previous samples was not an option. The guy >>>>> presenting the problem to me was the classic engineer who could read >>>>> and understand things, but couldn't explain anything to other people. >>>>> So I had to take the paper he gave me and finally figure it out for >>>>> myself. >>>>> >>>>> I will dig around when I get a chance and figure out exactly how this >>>>> worked. But the basic idea is that instead of saving multiple inputs >>>>> and calculating the outputs one at a time (iterating over the stored >>>>> inputs) - as each input comes in it is iterated over the stored >>>>> outputs and each output is ... well output, when it has been fully >>>>> calculated. Since there are fewer outputs than inputs (by the >>>>> decimation factor of M) you store fewer outputs than you would inputs. >>>>> >>>>> This is not hard to understand if you just stop thinking it *has* to >>>>> be the way you are currently seeing it. (shown here with simplified >>>>> notation since it is too messy to show the indexes with decimation) >>>>> >>>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>>> >>>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>>> time (and so store the last N*M values of x), consider that when x(i) >>>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>>> of fewer values by a factor of M. It does require more memory fetches >>>>> than the standard MAC calculation since you have to fetch not only the >>>>> coefficient and the input, but also the intermediate output value >>>>> being calculated on each MAC operation. >>>>> >>>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>>> minimum number of calculations on arrival of each input since not all >>>>> the coefficients are used on each input. So the coefficients are >>>>> divided into "phases" (N/M coefficients) with only one phase used for >>>>> any given input sample. >>>> >>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>>> the sets of input values for different outputs are disjoint. >>> >>> Yes, so what? You still don't need to store any inputs to calculate >>> the output. >> >> Where would they come from if they were not stored? > > OK I see what you mean now. You can do a running MAC as they come in. > That is certainly true.
Ok, glad we got past that. -- Rick C
Reply by rickman May 16, 20172017-05-16
On 5/16/2017 3:50 PM, Randy Yates wrote:
> Randy Yates <randyy@garnerundergroundinc.com> writes: > >> Randy Yates <randyy@garnerundergroundinc.com> writes: >> >>> eric.jacobsen@ieee.org writes: >>> >>>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates >>>> <randyy@garnerundergroundinc.com> wrote: >>>> >>>>> rickman <gnuarm@gmail.com> writes: >>>>> >>>>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>>>> rickman <gnuarm@gmail.com> writes: >>>>>>> >>>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>>>> rickman wrote: >>>>>>>>> >>>>>>>>>> Are the 248 words for data or also program? >>>>>>>>> >>>>>>>>> Data only, there are dedicated code memories. The control part >>>>>>>>> is totally crazy: there are two separate code memories working >>>>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>>>> provide instructions with no delay in the case of a branch). >>>>>>>>> But there is no such thing as a program counter, the code >>>>>>>>> memories store instructions at consecutive addresses without >>>>>>>>> any deeper structure. A run of instructions is called a block >>>>>>>>> and is ended with the presence of a jump instruction. There >>>>>>>>> is also a third control memory which stores the information >>>>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>>>> which adds fun. >>>>>>>>> >>>>>>>>>> That should be plenty of room for coefficients. >>>>>>>>> >>>>>>>>> But doesn't the FIR require a lot of cells for >>>>>>>>> the past data values? >>>>>>>> >>>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>>>> outputs which will be fewer because of the decimation. As the input >>>>>>>> data comes in the output samples can be built up and when one is >>>>>>>> complete it is outputted and replaced by a new one. >>>>>>>> >>>>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>>>> for you. >>>>>>> >>>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>>>> >>>>>> I am pretty sure that conclusion is not correct. When I was asked to >>>>>> implement a polyphase filter that was the part I had trouble getting >>>>>> my head around but finally understood how it worked. At the time this >>>>>> was being done on a very early DSP chip with very limited memory, so >>>>>> storing the N-1 previous samples was not an option. The guy >>>>>> presenting the problem to me was the classic engineer who could read >>>>>> and understand things, but couldn't explain anything to other people. >>>>>> So I had to take the paper he gave me and finally figure it out for >>>>>> myself. >>>>>> >>>>>> I will dig around when I get a chance and figure out exactly how this >>>>>> worked. But the basic idea is that instead of saving multiple inputs >>>>>> and calculating the outputs one at a time (iterating over the stored >>>>>> inputs) - as each input comes in it is iterated over the stored >>>>>> outputs and each output is ... well output, when it has been fully >>>>>> calculated. Since there are fewer outputs than inputs (by the >>>>>> decimation factor of M) you store fewer outputs than you would inputs. >>>>>> >>>>>> This is not hard to understand if you just stop thinking it *has* to >>>>>> be the way you are currently seeing it. (shown here with simplified >>>>>> notation since it is too messy to show the indexes with decimation) >>>>>> >>>>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>>>> >>>>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>>>> time (and so store the last N*M values of x), consider that when x(i) >>>>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>>>> of fewer values by a factor of M. It does require more memory fetches >>>>>> than the standard MAC calculation since you have to fetch not only the >>>>>> coefficient and the input, but also the intermediate output value >>>>>> being calculated on each MAC operation. >>>>>> >>>>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>>>> minimum number of calculations on arrival of each input since not all >>>>>> the coefficients are used on each input. So the coefficients are >>>>>> divided into "phases" (N/M coefficients) with only one phase used for >>>>>> any given input sample. >>>>> >>>>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>>>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>>>> the sets of input values for different outputs are disjoint. >>>> >>>> How each input will be used for each output is known when that input >>>> arrives, so if all possible ouputs for that input have their own >>>> partial product accumulator, the input can be multiplied by the >>>> various coefficients for each partial product and accumulated in each, >>>> and then forgotten. >>>> >>>> It's just another way to do it, but it can save a lot of hardware in a >>>> hardware implementation, especially in an FPGA where the >>>> multiply-accumulators are already built and just laying around, >>>> anyway. >>> >>> So are you asserting this is not a counter-example? >>> >>> To state that an N-tap FIR decimator does not require all N inputs for >>> an output at time k is, in general, not true, as I have shown. >>> >>> I believe what you are claiming is that IN CERTAIN CASES you can get >>> away with not using all N inputs at output time k. That I can agree >>> with. However, neither you nor Rick put a constraint on your assertion. >>> >>> And I can see that, in certain situations, this can be mo' better, >>> namely when you have one or more MACs sitting around not being used (as >>> you said) and N > M. But in general it hardly seems a storage saver as a >>> MAC is a storage element+. >> >> In light of the light coming on due to Rick's post that came about the >> same time as I was writing this, I'll have to retract. > > This is a type of systolic array?
I've not seen the two things connected. This method of calculating a polyphase filter does not require multiple processors which is the only context I've seen the term "systolic array" used. I believe that is just a collection of processing elements that each pump data between them at a regular rate and a run-time fixed pattern. Certainly any FIR filter would be a good candidate for a systolic array. I think systolic arrays are not so much used these days. Processing speeds have increased so much that there are a lot more things that can be done on a single processing element without breaking it down into pieces. I suppose radar work will always find a need for more processing and perhaps sonar. I think the cell phone industry has been a real boon to many other fields in terms of pushing the state of the art in DSP hardware. -- Rick C
Reply by Randy Yates May 16, 20172017-05-16
Randy Yates <randyy@garnerundergroundinc.com> writes:

> Randy Yates <randyy@garnerundergroundinc.com> writes: > >> eric.jacobsen@ieee.org writes: >> >>> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates >>> <randyy@garnerundergroundinc.com> wrote: >>> >>>>rickman <gnuarm@gmail.com> writes: >>>> >>>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>>> rickman <gnuarm@gmail.com> writes: >>>>>> >>>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>>> rickman wrote: >>>>>>>> >>>>>>>>> Are the 248 words for data or also program? >>>>>>>> >>>>>>>> Data only, there are dedicated code memories. The control part >>>>>>>> is totally crazy: there are two separate code memories working >>>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>>> provide instructions with no delay in the case of a branch). >>>>>>>> But there is no such thing as a program counter, the code >>>>>>>> memories store instructions at consecutive addresses without >>>>>>>> any deeper structure. A run of instructions is called a block >>>>>>>> and is ended with the presence of a jump instruction. There >>>>>>>> is also a third control memory which stores the information >>>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>>> which adds fun. >>>>>>>> >>>>>>>>> That should be plenty of room for coefficients. >>>>>>>> >>>>>>>> But doesn't the FIR require a lot of cells for >>>>>>>> the past data values? >>>>>>> >>>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>>> outputs which will be fewer because of the decimation. As the input >>>>>>> data comes in the output samples can be built up and when one is >>>>>>> complete it is outputted and replaced by a new one. >>>>>>> >>>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>>> for you. >>>>>> >>>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>>> >>>>> I am pretty sure that conclusion is not correct. When I was asked to >>>>> implement a polyphase filter that was the part I had trouble getting >>>>> my head around but finally understood how it worked. At the time this >>>>> was being done on a very early DSP chip with very limited memory, so >>>>> storing the N-1 previous samples was not an option. The guy >>>>> presenting the problem to me was the classic engineer who could read >>>>> and understand things, but couldn't explain anything to other people. >>>>> So I had to take the paper he gave me and finally figure it out for >>>>> myself. >>>>> >>>>> I will dig around when I get a chance and figure out exactly how this >>>>> worked. But the basic idea is that instead of saving multiple inputs >>>>> and calculating the outputs one at a time (iterating over the stored >>>>> inputs) - as each input comes in it is iterated over the stored >>>>> outputs and each output is ... well output, when it has been fully >>>>> calculated. Since there are fewer outputs than inputs (by the >>>>> decimation factor of M) you store fewer outputs than you would inputs. >>>>> >>>>> This is not hard to understand if you just stop thinking it *has* to >>>>> be the way you are currently seeing it. (shown here with simplified >>>>> notation since it is too messy to show the indexes with decimation) >>>>> >>>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>>> >>>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>>> time (and so store the last N*M values of x), consider that when x(i) >>>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>>> of fewer values by a factor of M. It does require more memory fetches >>>>> than the standard MAC calculation since you have to fetch not only the >>>>> coefficient and the input, but also the intermediate output value >>>>> being calculated on each MAC operation. >>>>> >>>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>>> minimum number of calculations on arrival of each input since not all >>>>> the coefficients are used on each input. So the coefficients are >>>>> divided into "phases" (N/M coefficients) with only one phase used for >>>>> any given input sample. >>>> >>>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>>>the sets of input values for different outputs are disjoint. >>> >>> How each input will be used for each output is known when that input >>> arrives, so if all possible ouputs for that input have their own >>> partial product accumulator, the input can be multiplied by the >>> various coefficients for each partial product and accumulated in each, >>> and then forgotten. >>> >>> It's just another way to do it, but it can save a lot of hardware in a >>> hardware implementation, especially in an FPGA where the >>> multiply-accumulators are already built and just laying around, >>> anyway. >> >> So are you asserting this is not a counter-example? >> >> To state that an N-tap FIR decimator does not require all N inputs for >> an output at time k is, in general, not true, as I have shown. >> >> I believe what you are claiming is that IN CERTAIN CASES you can get >> away with not using all N inputs at output time k. That I can agree >> with. However, neither you nor Rick put a constraint on your assertion. >> >> And I can see that, in certain situations, this can be mo' better, >> namely when you have one or more MACs sitting around not being used (as >> you said) and N > M. But in general it hardly seems a storage saver as a >> MAC is a storage element+. > > In light of the light coming on due to Rick's post that came about the > same time as I was writing this, I'll have to retract.
This is a type of systolic array? -- Randy Yates, Embedded Firmware Developer Garner Underground, Inc. 866-260-9040, x3901 http://www.garnerundergroundinc.com
Reply by Randy Yates May 16, 20172017-05-16
Randy Yates <randyy@garnerundergroundinc.com> writes:

> eric.jacobsen@ieee.org writes: > >> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates >> <randyy@garnerundergroundinc.com> wrote: >> >>>rickman <gnuarm@gmail.com> writes: >>> >>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>> rickman <gnuarm@gmail.com> writes: >>>>> >>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>> rickman wrote: >>>>>>> >>>>>>>> Are the 248 words for data or also program? >>>>>>> >>>>>>> Data only, there are dedicated code memories. The control part >>>>>>> is totally crazy: there are two separate code memories working >>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>> provide instructions with no delay in the case of a branch). >>>>>>> But there is no such thing as a program counter, the code >>>>>>> memories store instructions at consecutive addresses without >>>>>>> any deeper structure. A run of instructions is called a block >>>>>>> and is ended with the presence of a jump instruction. There >>>>>>> is also a third control memory which stores the information >>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>> which adds fun. >>>>>>> >>>>>>>> That should be plenty of room for coefficients. >>>>>>> >>>>>>> But doesn't the FIR require a lot of cells for >>>>>>> the past data values? >>>>>> >>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>> outputs which will be fewer because of the decimation. As the input >>>>>> data comes in the output samples can be built up and when one is >>>>>> complete it is outputted and replaced by a new one. >>>>>> >>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>> for you. >>>>> >>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>> >>>> I am pretty sure that conclusion is not correct. When I was asked to >>>> implement a polyphase filter that was the part I had trouble getting >>>> my head around but finally understood how it worked. At the time this >>>> was being done on a very early DSP chip with very limited memory, so >>>> storing the N-1 previous samples was not an option. The guy >>>> presenting the problem to me was the classic engineer who could read >>>> and understand things, but couldn't explain anything to other people. >>>> So I had to take the paper he gave me and finally figure it out for >>>> myself. >>>> >>>> I will dig around when I get a chance and figure out exactly how this >>>> worked. But the basic idea is that instead of saving multiple inputs >>>> and calculating the outputs one at a time (iterating over the stored >>>> inputs) - as each input comes in it is iterated over the stored >>>> outputs and each output is ... well output, when it has been fully >>>> calculated. Since there are fewer outputs than inputs (by the >>>> decimation factor of M) you store fewer outputs than you would inputs. >>>> >>>> This is not hard to understand if you just stop thinking it *has* to >>>> be the way you are currently seeing it. (shown here with simplified >>>> notation since it is too messy to show the indexes with decimation) >>>> >>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>> >>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>> time (and so store the last N*M values of x), consider that when x(i) >>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>> of fewer values by a factor of M. It does require more memory fetches >>>> than the standard MAC calculation since you have to fetch not only the >>>> coefficient and the input, but also the intermediate output value >>>> being calculated on each MAC operation. >>>> >>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>> minimum number of calculations on arrival of each input since not all >>>> the coefficients are used on each input. So the coefficients are >>>> divided into "phases" (N/M coefficients) with only one phase used for >>>> any given input sample. >>> >>>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>>the sets of input values for different outputs are disjoint. >> >> How each input will be used for each output is known when that input >> arrives, so if all possible ouputs for that input have their own >> partial product accumulator, the input can be multiplied by the >> various coefficients for each partial product and accumulated in each, >> and then forgotten. >> >> It's just another way to do it, but it can save a lot of hardware in a >> hardware implementation, especially in an FPGA where the >> multiply-accumulators are already built and just laying around, >> anyway. > > So are you asserting this is not a counter-example? > > To state that an N-tap FIR decimator does not require all N inputs for > an output at time k is, in general, not true, as I have shown. > > I believe what you are claiming is that IN CERTAIN CASES you can get > away with not using all N inputs at output time k. That I can agree > with. However, neither you nor Rick put a constraint on your assertion. > > And I can see that, in certain situations, this can be mo' better, > namely when you have one or more MACs sitting around not being used (as > you said) and N > M. But in general it hardly seems a storage saver as a > MAC is a storage element+.
In light of the light coming on due to Rick's post that came about the same time as I was writing this, I'll have to retract. -- Randy Yates, Embedded Firmware Developer Garner Underground, Inc. 866-260-9040, x3901 http://www.garnerundergroundinc.com
Reply by Randy Yates May 16, 20172017-05-16
Randy Yates <randyy@garnerundergroundinc.com> writes:

> rickman <gnuarm@gmail.com> writes: > >> On 5/16/2017 1:11 PM, Randy Yates wrote: >>> rickman <gnuarm@gmail.com> writes: >>> >>>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>>> rickman <gnuarm@gmail.com> writes: >>>>> >>>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>>> rickman wrote: >>>>>>> >>>>>>>> Are the 248 words for data or also program? >>>>>>> >>>>>>> Data only, there are dedicated code memories. The control part >>>>>>> is totally crazy: there are two separate code memories working >>>>>>> in parallel (they take both paths simultaneosly in order to >>>>>>> provide instructions with no delay in the case of a branch). >>>>>>> But there is no such thing as a program counter, the code >>>>>>> memories store instructions at consecutive addresses without >>>>>>> any deeper structure. A run of instructions is called a block >>>>>>> and is ended with the presence of a jump instruction. There >>>>>>> is also a third control memory which stores the information >>>>>>> where the i-th block begins and to which j-th state the FSM >>>>>>> should go in the case of a branch. In short, hardware basic >>>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>>> which adds fun. >>>>>>> >>>>>>>> That should be plenty of room for coefficients. >>>>>>> >>>>>>> But doesn't the FIR require a lot of cells for >>>>>>> the past data values? >>>>>> >>>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>>> be asking. I can't picture how it works, but I specifically remember >>>>>> NOT needing to store previous data inputs. I am thinking you store >>>>>> outputs which will be fewer because of the decimation. As the input >>>>>> data comes in the output samples can be built up and when one is >>>>>> complete it is outputted and replaced by a new one. >>>>>> >>>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>>> for you. >>>>> >>>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>>> a length N polyphase FIR decimation by M. Just think first principles: >>>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>>> >>>> I am pretty sure that conclusion is not correct. When I was asked to >>>> implement a polyphase filter that was the part I had trouble getting >>>> my head around but finally understood how it worked. At the time this >>>> was being done on a very early DSP chip with very limited memory, so >>>> storing the N-1 previous samples was not an option. The guy >>>> presenting the problem to me was the classic engineer who could read >>>> and understand things, but couldn't explain anything to other people. >>>> So I had to take the paper he gave me and finally figure it out for >>>> myself. >>>> >>>> I will dig around when I get a chance and figure out exactly how this >>>> worked. But the basic idea is that instead of saving multiple inputs >>>> and calculating the outputs one at a time (iterating over the stored >>>> inputs) - as each input comes in it is iterated over the stored >>>> outputs and each output is ... well output, when it has been fully >>>> calculated. Since there are fewer outputs than inputs (by the >>>> decimation factor of M) you store fewer outputs than you would inputs. >>>> >>>> This is not hard to understand if you just stop thinking it *has* to >>>> be the way you are currently seeing it. (shown here with simplified >>>> notation since it is too messy to show the indexes with decimation) >>>> >>>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>>> >>>> Instead of thinking you have to calculate all the terms of y(k) at one >>>> time (and so store the last N*M values of x), consider that when x(i) >>>> arrives, you can calculate the x(i) terms and add them into y(k) >>>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>>> of fewer values by a factor of M. It does require more memory fetches >>>> than the standard MAC calculation since you have to fetch not only the >>>> coefficient and the input, but also the intermediate output value >>>> being calculated on each MAC operation. >>>> >>>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>>> worked. It ties in nicely with the "polyphase" aspect to allow a >>>> minimum number of calculations on arrival of each input since not all >>>> the coefficients are used on each input. So the coefficients are >>>> divided into "phases" (N/M coefficients) with only one phase used for >>>> any given input sample. >>> >>> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>> the sets of input values for different outputs are disjoint. >> >> Yes, so what? You still don't need to store any inputs to calculate >> the output. > > Where would they come from if they were not stored?
OK I see what you mean now. You can do a running MAC as they come in. That is certainly true. -- Randy Yates, Embedded Firmware Developer Garner Underground, Inc. 866-260-9040, x3901 http://www.garnerundergroundinc.com
Reply by Randy Yates May 16, 20172017-05-16
rickman <gnuarm@gmail.com> writes:

> On 5/16/2017 1:11 PM, Randy Yates wrote: >> rickman <gnuarm@gmail.com> writes: >> >>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>> rickman <gnuarm@gmail.com> writes: >>>> >>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>> rickman wrote: >>>>>> >>>>>>> Are the 248 words for data or also program? >>>>>> >>>>>> Data only, there are dedicated code memories. The control part >>>>>> is totally crazy: there are two separate code memories working >>>>>> in parallel (they take both paths simultaneosly in order to >>>>>> provide instructions with no delay in the case of a branch). >>>>>> But there is no such thing as a program counter, the code >>>>>> memories store instructions at consecutive addresses without >>>>>> any deeper structure. A run of instructions is called a block >>>>>> and is ended with the presence of a jump instruction. There >>>>>> is also a third control memory which stores the information >>>>>> where the i-th block begins and to which j-th state the FSM >>>>>> should go in the case of a branch. In short, hardware basic >>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>> which adds fun. >>>>>> >>>>>>> That should be plenty of room for coefficients. >>>>>> >>>>>> But doesn't the FIR require a lot of cells for >>>>>> the past data values? >>>>> >>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>> be asking. I can't picture how it works, but I specifically remember >>>>> NOT needing to store previous data inputs. I am thinking you store >>>>> outputs which will be fewer because of the decimation. As the input >>>>> data comes in the output samples can be built up and when one is >>>>> complete it is outputted and replaced by a new one. >>>>> >>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>> for you. >>>> >>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>> a length N polyphase FIR decimation by M. Just think first principles: >>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>> >>> I am pretty sure that conclusion is not correct. When I was asked to >>> implement a polyphase filter that was the part I had trouble getting >>> my head around but finally understood how it worked. At the time this >>> was being done on a very early DSP chip with very limited memory, so >>> storing the N-1 previous samples was not an option. The guy >>> presenting the problem to me was the classic engineer who could read >>> and understand things, but couldn't explain anything to other people. >>> So I had to take the paper he gave me and finally figure it out for >>> myself. >>> >>> I will dig around when I get a chance and figure out exactly how this >>> worked. But the basic idea is that instead of saving multiple inputs >>> and calculating the outputs one at a time (iterating over the stored >>> inputs) - as each input comes in it is iterated over the stored >>> outputs and each output is ... well output, when it has been fully >>> calculated. Since there are fewer outputs than inputs (by the >>> decimation factor of M) you store fewer outputs than you would inputs. >>> >>> This is not hard to understand if you just stop thinking it *has* to >>> be the way you are currently seeing it. (shown here with simplified >>> notation since it is too messy to show the indexes with decimation) >>> >>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>> >>> Instead of thinking you have to calculate all the terms of y(k) at one >>> time (and so store the last N*M values of x), consider that when x(i) >>> arrives, you can calculate the x(i) terms and add them into y(k) >>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>> of fewer values by a factor of M. It does require more memory fetches >>> than the standard MAC calculation since you have to fetch not only the >>> coefficient and the input, but also the intermediate output value >>> being calculated on each MAC operation. >>> >>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>> worked. It ties in nicely with the "polyphase" aspect to allow a >>> minimum number of calculations on arrival of each input since not all >>> the coefficients are used on each input. So the coefficients are >>> divided into "phases" (N/M coefficients) with only one phase used for >>> any given input sample. >> >> Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >> 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >> the sets of input values for different outputs are disjoint. > > Yes, so what? You still don't need to store any inputs to calculate > the output.
Where would they come from if they were not stored? -- Randy Yates, Embedded Firmware Developer Garner Underground, Inc. 866-260-9040, x3901 http://www.garnerundergroundinc.com
Reply by rickman May 16, 20172017-05-16
On 5/16/2017 11:59 AM, eric.jacobsen@ieee.org wrote:
> On Tue, 16 May 2017 02:59:51 -0400, rickman <gnuarm@gmail.com> wrote: > >> On 5/15/2017 11:49 AM, Randy Yates wrote: >>> rickman <gnuarm@gmail.com> writes: >>> >>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>> rickman wrote: >>>>> >>>>>> Are the 248 words for data or also program? >>>>> >>>>> Data only, there are dedicated code memories. The control part >>>>> is totally crazy: there are two separate code memories working >>>>> in parallel (they take both paths simultaneosly in order to >>>>> provide instructions with no delay in the case of a branch). >>>>> But there is no such thing as a program counter, the code >>>>> memories store instructions at consecutive addresses without >>>>> any deeper structure. A run of instructions is called a block >>>>> and is ended with the presence of a jump instruction. There >>>>> is also a third control memory which stores the information >>>>> where the i-th block begins and to which j-th state the FSM >>>>> should go in the case of a branch. In short, hardware basic >>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>> which adds fun. >>>>> >>>>>> That should be plenty of room for coefficients. >>>>> >>>>> But doesn't the FIR require a lot of cells for >>>>> the past data values? >>>> >>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>> be asking. I can't picture how it works, but I specifically remember >>>> NOT needing to store previous data inputs. I am thinking you store >>>> outputs which will be fewer because of the decimation. As the input >>>> data comes in the output samples can be built up and when one is >>>> complete it is outputted and replaced by a new one. >>>> >>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>> for you. >>> >>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>> a length N polyphase FIR decimation by M. Just think first principles: >>> for any output y[k], you need N inputs to compute it, even if k = n*M. >> >> I am pretty sure that conclusion is not correct. When I was asked to >> implement a polyphase filter that was the part I had trouble getting my >> head around but finally understood how it worked. At the time this was >> being done on a very early DSP chip with very limited memory, so storing >> the N-1 previous samples was not an option. The guy presenting the >> problem to me was the classic engineer who could read and understand >> things, but couldn't explain anything to other people. So I had to take >> the paper he gave me and finally figure it out for myself. >> >> I will dig around when I get a chance and figure out exactly how this >> worked. But the basic idea is that instead of saving multiple inputs >> and calculating the outputs one at a time (iterating over the stored >> inputs) - as each input comes in it is iterated over the stored outputs >> and each output is ... well output, when it has been fully calculated. >> Since there are fewer outputs than inputs (by the decimation factor of >> M) you store fewer outputs than you would inputs. >> >> This is not hard to understand if you just stop thinking it *has* to be >> the way you are currently seeing it. (shown here with simplified >> notation since it is too messy to show the indexes with decimation) >> >> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >> >> Instead of thinking you have to calculate all the terms of y(k) at one >> time (and so store the last N*M values of x), consider that when x(i) >> arrives, you can calculate the x(i) terms and add them into y(k) (which >> is then output) and y(k+1), y(k+2)... This allows the storage of fewer >> values by a factor of M. It does require more memory fetches than the >> standard MAC calculation since you have to fetch not only the >> coefficient and the input, but also the intermediate output value being >> calculated on each MAC operation. >> >> Maybe I won't have to dig it up. I'm pretty sure this is how it worked. >> It ties in nicely with the "polyphase" aspect to allow a minimum >> number of calculations on arrival of each input since not all the >> coefficients are used on each input. So the coefficients are divided >> into "phases" (N/M coefficients) with only one phase used for any given >> input sample. >> >> -- >> >> Rick C > > You either need to store N-1 previous input samples or you need enough > running partial product accumulators with the proper coefficients > selected at each input. I usually do the latter, but either works.
Exactly. But if you are decimating, storing intermediate output sample calculations takes less storage than storing input values. -- Rick C
Reply by rickman May 16, 20172017-05-16
On 5/16/2017 1:11 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes: > >> On 5/15/2017 11:49 AM, Randy Yates wrote: >>> rickman <gnuarm@gmail.com> writes: >>> >>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>> rickman wrote: >>>>> >>>>>> Are the 248 words for data or also program? >>>>> >>>>> Data only, there are dedicated code memories. The control part >>>>> is totally crazy: there are two separate code memories working >>>>> in parallel (they take both paths simultaneosly in order to >>>>> provide instructions with no delay in the case of a branch). >>>>> But there is no such thing as a program counter, the code >>>>> memories store instructions at consecutive addresses without >>>>> any deeper structure. A run of instructions is called a block >>>>> and is ended with the presence of a jump instruction. There >>>>> is also a third control memory which stores the information >>>>> where the i-th block begins and to which j-th state the FSM >>>>> should go in the case of a branch. In short, hardware basic >>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>> which adds fun. >>>>> >>>>>> That should be plenty of room for coefficients. >>>>> >>>>> But doesn't the FIR require a lot of cells for >>>>> the past data values? >>>> >>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>> be asking. I can't picture how it works, but I specifically remember >>>> NOT needing to store previous data inputs. I am thinking you store >>>> outputs which will be fewer because of the decimation. As the input >>>> data comes in the output samples can be built up and when one is >>>> complete it is outputted and replaced by a new one. >>>> >>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>> for you. >>> >>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>> a length N polyphase FIR decimation by M. Just think first principles: >>> for any output y[k], you need N inputs to compute it, even if k = n*M. >> >> I am pretty sure that conclusion is not correct. When I was asked to >> implement a polyphase filter that was the part I had trouble getting >> my head around but finally understood how it worked. At the time this >> was being done on a very early DSP chip with very limited memory, so >> storing the N-1 previous samples was not an option. The guy >> presenting the problem to me was the classic engineer who could read >> and understand things, but couldn't explain anything to other people. >> So I had to take the paper he gave me and finally figure it out for >> myself. >> >> I will dig around when I get a chance and figure out exactly how this >> worked. But the basic idea is that instead of saving multiple inputs >> and calculating the outputs one at a time (iterating over the stored >> inputs) - as each input comes in it is iterated over the stored >> outputs and each output is ... well output, when it has been fully >> calculated. Since there are fewer outputs than inputs (by the >> decimation factor of M) you store fewer outputs than you would inputs. >> >> This is not hard to understand if you just stop thinking it *has* to >> be the way you are currently seeing it. (shown here with simplified >> notation since it is too messy to show the indexes with decimation) >> >> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >> >> Instead of thinking you have to calculate all the terms of y(k) at one >> time (and so store the last N*M values of x), consider that when x(i) >> arrives, you can calculate the x(i) terms and add them into y(k) >> (which is then output) and y(k+1), y(k+2)... This allows the storage >> of fewer values by a factor of M. It does require more memory fetches >> than the standard MAC calculation since you have to fetch not only the >> coefficient and the input, but also the intermediate output value >> being calculated on each MAC operation. >> >> Maybe I won't have to dig it up. I'm pretty sure this is how it >> worked. It ties in nicely with the "polyphase" aspect to allow a >> minimum number of calculations on arrival of each input since not all >> the coefficients are used on each input. So the coefficients are >> divided into "phases" (N/M coefficients) with only one phase used for >> any given input sample. > > Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by > 8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus > the sets of input values for different outputs are disjoint.
Yes, so what? You still don't need to store any inputs to calculate the output. -- Rick C
Reply by Randy Yates May 16, 20172017-05-16
eric.jacobsen@ieee.org writes:

> On Tue, 16 May 2017 13:11:25 -0400, Randy Yates > <randyy@garnerundergroundinc.com> wrote: > >>rickman <gnuarm@gmail.com> writes: >> >>> On 5/15/2017 11:49 AM, Randy Yates wrote: >>>> rickman <gnuarm@gmail.com> writes: >>>> >>>>> On 5/15/2017 1:21 AM, Piotr Wyderski wrote: >>>>>> rickman wrote: >>>>>> >>>>>>> Are the 248 words for data or also program? >>>>>> >>>>>> Data only, there are dedicated code memories. The control part >>>>>> is totally crazy: there are two separate code memories working >>>>>> in parallel (they take both paths simultaneosly in order to >>>>>> provide instructions with no delay in the case of a branch). >>>>>> But there is no such thing as a program counter, the code >>>>>> memories store instructions at consecutive addresses without >>>>>> any deeper structure. A run of instructions is called a block >>>>>> and is ended with the presence of a jump instruction. There >>>>>> is also a third control memory which stores the information >>>>>> where the i-th block begins and to which j-th state the FSM >>>>>> should go in the case of a branch. In short, hardware basic >>>>>> blocks. The data path is a VLIW with exposed pipelining, >>>>>> which adds fun. >>>>>> >>>>>>> That should be plenty of room for coefficients. >>>>>> >>>>>> But doesn't the FIR require a lot of cells for >>>>>> the past data values? >>>>> >>>>> Like I said, I wrote this a long time ago, so I am not the resource to >>>>> be asking. I can't picture how it works, but I specifically remember >>>>> NOT needing to store previous data inputs. I am thinking you store >>>>> outputs which will be fewer because of the decimation. As the input >>>>> data comes in the output samples can be built up and when one is >>>>> complete it is outputted and replaced by a new one. >>>>> >>>>> But read a proper reference. If I wasn't busy today I'd dig this up >>>>> for you. >>>> >>>> Yes you do need to store N-1 previous inputs (plus 1 current input) for >>>> a length N polyphase FIR decimation by M. Just think first principles: >>>> for any output y[k], you need N inputs to compute it, even if k = n*M. >>> >>> I am pretty sure that conclusion is not correct. When I was asked to >>> implement a polyphase filter that was the part I had trouble getting >>> my head around but finally understood how it worked. At the time this >>> was being done on a very early DSP chip with very limited memory, so >>> storing the N-1 previous samples was not an option. The guy >>> presenting the problem to me was the classic engineer who could read >>> and understand things, but couldn't explain anything to other people. >>> So I had to take the paper he gave me and finally figure it out for >>> myself. >>> >>> I will dig around when I get a chance and figure out exactly how this >>> worked. But the basic idea is that instead of saving multiple inputs >>> and calculating the outputs one at a time (iterating over the stored >>> inputs) - as each input comes in it is iterated over the stored >>> outputs and each output is ... well output, when it has been fully >>> calculated. Since there are fewer outputs than inputs (by the >>> decimation factor of M) you store fewer outputs than you would inputs. >>> >>> This is not hard to understand if you just stop thinking it *has* to >>> be the way you are currently seeing it. (shown here with simplified >>> notation since it is too messy to show the indexes with decimation) >>> >>> y(k) = a(0)x(i) + a(1)x(i-1) + a(2)x(i-2) + ... >>> y(k+1) = a(0)x(i+1) + a(1)x(i) + a(2)x(i-1) + ... >>> y(k+2) = a(0)x(i+2) + a(1)x(i+1) + a(2)x(i) + ... >>> y(k+3) = a(0)x(i+3) + a(1)x(i+2) + a(2)x(i+1) + ... >>> >>> Instead of thinking you have to calculate all the terms of y(k) at one >>> time (and so store the last N*M values of x), consider that when x(i) >>> arrives, you can calculate the x(i) terms and add them into y(k) >>> (which is then output) and y(k+1), y(k+2)... This allows the storage >>> of fewer values by a factor of M. It does require more memory fetches >>> than the standard MAC calculation since you have to fetch not only the >>> coefficient and the input, but also the intermediate output value >>> being calculated on each MAC operation. >>> >>> Maybe I won't have to dig it up. I'm pretty sure this is how it >>> worked. It ties in nicely with the "polyphase" aspect to allow a >>> minimum number of calculations on arrival of each input since not all >>> the coefficients are used on each input. So the coefficients are >>> divided into "phases" (N/M coefficients) with only one phase used for >>> any given input sample. >> >>Counter-example: Consider M = 8, N = 8 (8 coefficients and decimating by >>8). Then each output y(k) depends on x(k*M + n), 0 < n < M - 1, and thus >>the sets of input values for different outputs are disjoint. > > How each input will be used for each output is known when that input > arrives, so if all possible ouputs for that input have their own > partial product accumulator, the input can be multiplied by the > various coefficients for each partial product and accumulated in each, > and then forgotten. > > It's just another way to do it, but it can save a lot of hardware in a > hardware implementation, especially in an FPGA where the > multiply-accumulators are already built and just laying around, > anyway.
So are you asserting this is not a counter-example? To state that an N-tap FIR decimator does not require all N inputs for an output at time k is, in general, not true, as I have shown. I believe what you are claiming is that IN CERTAIN CASES you can get away with not using all N inputs at output time k. That I can agree with. However, neither you nor Rick put a constraint on your assertion. And I can see that, in certain situations, this can be mo' better, namely when you have one or more MACs sitting around not being used (as you said) and N > M. But in general it hardly seems a storage saver as a MAC is a storage element+. -- Randy Yates, Embedded Firmware Developer Garner Underground, Inc. 866-260-9040, x3901 http://www.garnerundergroundinc.com