Reply by Tim Wescott October 7, 20112011-10-07
On Fri, 07 Oct 2011 04:26:49 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.com> wrote: > > (snip, I wrote) >>> I haven't done that much head scratching yet, but that is the idea. > >> I think you could do this pipelined, but I'd have to actually strain my >> brain, which I'm not inclined to do at the moment. (But you're welcome >> to take my example above and run with it). > >>>> * Each one will be the sum of the residual response of the filter >>>> specified by y[n-m] and y[n], and the impulse response of the filter >>>> to the samples up to the current one. > >>> (extension from two cycle delay to ten) >>>>> If you can do that, then try to refactor for more delay before you >>>>> see the output terms. > >>>> Oh, do I have to do the work twice??? > >>> Maybe you don't. I found it easier to think of the two clock delay >>> version first, then extend to 10. > >>>> Note that for OutputLogic, the above algorithm more or less applies, >>>> except that instead of taking multiple clocks to do the pipelined >>>> work, he just executes his 32 "FIR" filters all in one clock tick. >>>> The work will be a lot like doing it as FIR filters from scratch, the >>>> only difference is that the FIR filters are (maybe) shorter, and at >>>> the end he keeps two of the outputs for the next clock tick. > >>> Yes, shorter is always good. If a short FIR takes less logic than an >>> IIR, just do that. In some cases, though, the FIR will be very long. >>> The refactored IIR might be long, too. > >> I was unclear in my terminology. The example I'm using is to implement >> the IIR filter as a "FIR" filter (which is kind of what you do anyway); >> the math for doing either the IIR or a hypothetical FIR filter is going >> to be similar, but one may need fewer taps to do it with IIR. > > OK, what I was trying to describe, and what I thought you were agreeing > with, is a partially unrolled IIR, so not all the way to FIR, but enough > that the feedback path has a longer delay.
Uh, that _is_ what I was describing. This needs a math paper. -- www.wescottdesign.com
Reply by glen herrmannsfeldt October 7, 20112011-10-07
Tim Wescott <tim@seemywebsite.com> wrote:

(snip, I wrote)
>> I haven't done that much head scratching yet, but that is the idea.
> I think you could do this pipelined, but I'd have to actually strain my > brain, which I'm not inclined to do at the moment. (But you're welcome > to take my example above and run with it).
>>> * Each one will be the sum of the residual response of the filter >>> specified by y[n-m] and y[n], and the impulse response of the filter to >>> the samples up to the current one.
>> (extension from two cycle delay to ten) >>>> If you can do that, then try to refactor for more delay before you see >>>> the output terms.
>>> Oh, do I have to do the work twice???
>> Maybe you don't. I found it easier to think of the two clock delay >> version first, then extend to 10.
>>> Note that for OutputLogic, the above algorithm more or less applies, >>> except that instead of taking multiple clocks to do the pipelined work, >>> he just executes his 32 "FIR" filters all in one clock tick. The work >>> will be a lot like doing it as FIR filters from scratch, the only >>> difference is that the FIR filters are (maybe) shorter, and at the end >>> he keeps two of the outputs for the next clock tick.
>> Yes, shorter is always good. If a short FIR takes less logic than an >> IIR, just do that. In some cases, though, the FIR will be very long. >> The refactored IIR might be long, too.
> I was unclear in my terminology. The example I'm using is to implement > the IIR filter as a "FIR" filter (which is kind of what you do anyway); > the math for doing either the IIR or a hypothetical FIR filter is going > to be similar, but one may need fewer taps to do it with IIR.
OK, what I was trying to describe, and what I thought you were agreeing with, is a partially unrolled IIR, so not all the way to FIR, but enough that the feedback path has a longer delay. -- glen
Reply by Tim Wescott October 6, 20112011-10-06
On Thu, 06 Oct 2011 22:09:29 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.com> wrote: > > (snip, I wrote) >>> OK, say you want to build a filter, and the system you have to do it >>> takes 10 sample periods to do a multiply. (That is, input sample >>> times coefficient). You can do many multiplies in parallel, but you >>> can't get the result for the first one out any faster. > > (snip FIR note) >> Let's say that you have y[n-m], y[n], and x[n-10] through x[n] as your >> input data, with 0 < m <= 10. Then y[n+1] through y[n+10] are each >> going to be a weighted sum of the above 13 values*. You launch your >> massively pipelined arithmetic, and at the end of it you have y[n+1] >> through y[n +10]. Now you reset n to n+10, and you start over again. > > I had thought of it pipelined, one result per clock, but yes. > >> There's probably a way to do this where you re-index n at each step, >> but the details escape me at the moment. > >> Note that for a lot of filters it may be more accurate to maintain a >> filter state that's _not_ the output of the filter at any given time -- >> but then you'll have a tradeoff between more arithmetic and (possibly) >> reduced precision needs, for which you'd need to do some head >> scratching to resolve. > > I haven't done that much head scratching yet, but that is the idea.
I think you could do this pipelined, but I'd have to actually strain my brain, which I'm not inclined to do at the moment. (But you're welcome to take my example above and run with it).
>> * Each one will be the sum of the residual response of the filter >> specified by y[n-m] and y[n], and the impulse response of the filter to >> the samples up to the current one. > > (extension from two cycle delay to ten) >>> If you can do that, then try to refactor for more delay before you see >>> the output terms. > >> Oh, do I have to do the work twice??? > > Maybe you don't. I found it easier to think of the two clock delay > version first, then extend to 10. > >> Note that for OutputLogic, the above algorithm more or less applies, >> except that instead of taking multiple clocks to do the pipelined work, >> he just executes his 32 "FIR" filters all in one clock tick. The work >> will be a lot like doing it as FIR filters from scratch, the only >> difference is that the FIR filters are (maybe) shorter, and at the end >> he keeps two of the outputs for the next clock tick. > > Yes, shorter is always good. If a short FIR takes less logic than an > IIR, just do that. In some cases, though, the FIR will be very long. > The refactored IIR might be long, too. > > -- glen
I was unclear in my terminology. The example I'm using is to implement the IIR filter as a "FIR" filter (which is kind of what you do anyway); the math for doing either the IIR or a hypothetical FIR filter is going to be similar, but one may need fewer taps to do it with IIR. -- www.wescottdesign.com
Reply by glen herrmannsfeldt October 6, 20112011-10-06
Tim Wescott <tim@seemywebsite.com> wrote:

(snip, I wrote)
>> OK, say you want to build a filter, and the system you have to do it >> takes 10 sample periods to do a multiply. (That is, input sample times >> coefficient). You can do many multiplies in parallel, but you can't get >> the result for the first one out any faster.
(snip FIR note)
> Let's say that you have y[n-m], y[n], and x[n-10] through x[n] as your > input data, with 0 < m <= 10. Then y[n+1] through y[n+10] are each going > to be a weighted sum of the above 13 values*. You launch your massively > pipelined arithmetic, and at the end of it you have y[n+1] through y[n > +10]. Now you reset n to n+10, and you start over again.
I had thought of it pipelined, one result per clock, but yes.
> There's probably a way to do this where you re-index n at each step, but > the details escape me at the moment.
> Note that for a lot of filters it may be more accurate to maintain a > filter state that's _not_ the output of the filter at any given time -- > but then you'll have a tradeoff between more arithmetic and (possibly) > reduced precision needs, for which you'd need to do some head scratching > to resolve.
I haven't done that much head scratching yet, but that is the idea.
> * Each one will be the sum of the residual response of the filter > specified by y[n-m] and y[n], and the impulse response of the filter to > the samples up to the current one.
(extension from two cycle delay to ten)
>> If you can do that, then try to refactor for more delay before you see >> the output terms.
> Oh, do I have to do the work twice???
Maybe you don't. I found it easier to think of the two clock delay version first, then extend to 10.
> Note that for OutputLogic, the above algorithm more or less applies, > except that instead of taking multiple clocks to do the pipelined work, > he just executes his 32 "FIR" filters all in one clock tick. The work > will be a lot like doing it as FIR filters from scratch, the only > difference is that the FIR filters are (maybe) shorter, and at the end he > keeps two of the outputs for the next clock tick.
Yes, shorter is always good. If a short FIR takes less logic than an IIR, just do that. In some cases, though, the FIR will be very long. The refactored IIR might be long, too. -- glen
Reply by Jerry Avins October 6, 20112011-10-06
On 10/6/2011 4:32 PM, Tim Wescott wrote:
> On Thu, 06 Oct 2011 13:22:10 -0700, OutputLogic wrote: > >> @Glen, >> >> Ok, I see your point and what you're asking. The system is working at >> the max speed the FPGA fabric can provide. The latency has to be exactly >> one clock. The system is using built-in DSP modules in FPGA that do >> multiplication in a single clock. Because routing delays to those DSPs >> are significant, which reduces timing budget, in some places >> multiplications are done using "shift+add" approach (aka CSD: canonic >> signed digit) >> >> Your suggestion might work if I take advantage of the "data >> interleaving" technique. For example, FPGA fabric speed is running at >> 300MHz, but embedded DSP can run at 600. So theoretically that DSP can >> do 2 multiplication in one clock. Practically it's hard because of the >> additional logic overhead due to muxing and demuxing the data. > > Do you need every point of the filter output? > > In other words, you're taking in x[n], x[n+1], ... x[n+31]. Do you need > to calculate y[n], y[n+1], ... y[n+31]? Or do you just need y[n+31], and > the rest don't matter?
So you discard the first 30 outputs. Then what? Jerry -- Engineering is the art of making what you want from things you can get.
Reply by Tim Wescott October 6, 20112011-10-06
On Thu, 06 Oct 2011 19:41:44 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.com> wrote: > > (snip on IIR vs FIR filters) >> The real answer depends a _lot_ on how long of an FIR filter you'd >> need, and to some extent on the pole positions of the IIR (which is >> related to the FIR filter length, to be sure). > >> If you want to implement an IIR filter, the value of the output at >> y[n+m] only depends on y[n-1], y[n], and inputs at [n+1] through [n+m] >> (actually it would depend on y[n-optimal_lag], y[n], and the various >> inputs). > > OK, say you want to build a filter, and the system you have to do it > takes 10 sample periods to do a multiply. (That is, input sample times > coefficient). You can do many multiplies in parallel, but you can't get > the result for the first one out any faster. > > Note that is is very easy to build an FIR filter with that restriction, > though it might take a lot of logic. All you need to do is start 10 > muliplies for each tap on each input (sample) clock, and add as > appropriate. (Assume add is fast enough not to limit it.)
Let's say that you have y[n-m], y[n], and x[n-10] through x[n] as your input data, with 0 < m <= 10. Then y[n+1] through y[n+10] are each going to be a weighted sum of the above 13 values*. You launch your massively pipelined arithmetic, and at the end of it you have y[n+1] through y[n +10]. Now you reset n to n+10, and you start over again. There's probably a way to do this where you re-index n at each step, but the details escape me at the moment. Note that for a lot of filters it may be more accurate to maintain a filter state that's _not_ the output of the filter at any given time -- but then you'll have a tradeoff between more arithmetic and (possibly) reduced precision needs, for which you'd need to do some head scratching to resolve. * Each one will be the sum of the residual response of the filter specified by y[n-m] and y[n], and the impulse response of the filter to the samples up to the current one.
> > Now, assume (temporarily) that multiply only takes two clocks. Can you > refactor an IIR filter such that, even with a y[n-1] term in the > equation, you don't need the y[n-1] as input to the y[n] computation. > (Even if it takes more multiplies to do so.) My first thought is that > you can, though I haven't tried to write it out. > > If you can do that, then try to refactor for more delay before you see > the output terms.
Oh, do I have to do the work twice??? Note that for OutputLogic, the above algorithm more or less applies, except that instead of taking multiple clocks to do the pipelined work, he just executes his 32 "FIR" filters all in one clock tick. The work will be a lot like doing it as FIR filters from scratch, the only difference is that the FIR filters are (maybe) shorter, and at the end he keeps two of the outputs for the next clock tick. -- www.wescottdesign.com
Reply by OutputLogic October 6, 20112011-10-06
> Do you need every point of the filter output? > > In other words, you're taking in x[n], x[n+1], ... x[n+31]. &#4294967295;Do you need > to calculate y[n], y[n+1], ... y[n+31]? &#4294967295;Or do you just need y[n+31], and > the rest don't matter?
Yes, I do need all filter outputs. I understand that the implementation complexity of samples is growing, probably exponentially if doing "loop unfolding". y[n] is the simplest, y[n+31] is the most complex. Thanks, Evgeni
Reply by Tim Wescott October 6, 20112011-10-06
On Thu, 06 Oct 2011 13:22:10 -0700, OutputLogic wrote:

> @Glen, > > Ok, I see your point and what you're asking. The system is working at > the max speed the FPGA fabric can provide. The latency has to be exactly > one clock. The system is using built-in DSP modules in FPGA that do > multiplication in a single clock. Because routing delays to those DSPs > are significant, which reduces timing budget, in some places > multiplications are done using "shift+add" approach (aka CSD: canonic > signed digit) > > Your suggestion might work if I take advantage of the "data > interleaving" technique. For example, FPGA fabric speed is running at > 300MHz, but embedded DSP can run at 600. So theoretically that DSP can > do 2 multiplication in one clock. Practically it's hard because of the > additional logic overhead due to muxing and demuxing the data.
Do you need every point of the filter output? In other words, you're taking in x[n], x[n+1], ... x[n+31]. Do you need to calculate y[n], y[n+1], ... y[n+31]? Or do you just need y[n+31], and the rest don't matter? -- www.wescottdesign.com
Reply by OutputLogic October 6, 20112011-10-06
@Glen,

Ok, I see your point and what you're asking.
The system is working at the max speed the FPGA fabric can provide.
The latency has to be exactly one clock.
The system is using built-in DSP modules in FPGA that do
multiplication in a single clock. Because routing delays to those DSPs
are significant, which reduces timing budget, in some places
multiplications are done using "shift+add" approach (aka CSD: canonic
signed digit)

Your suggestion might work if I take advantage of the "data
interleaving" technique. For example, FPGA fabric speed is running at
300MHz, but embedded DSP can run at 600. So theoretically that DSP can
do 2 multiplication in one clock. Practically it's hard because of the
additional logic overhead due to muxing and demuxing the data.

Thanks,
Evgeni
Reply by glen herrmannsfeldt October 6, 20112011-10-06
Tim Wescott <tim@seemywebsite.com> wrote:

(snip on IIR vs FIR filters)
> The real answer depends a _lot_ on how long of an FIR filter you'd need, > and to some extent on the pole positions of the IIR (which is related to > the FIR filter length, to be sure).
> If you want to implement an IIR filter, the value of the output at y[n+m] > only depends on y[n-1], y[n], and inputs at [n+1] through [n+m] (actually > it would depend on y[n-optimal_lag], y[n], and the various inputs).
OK, say you want to build a filter, and the system you have to do it takes 10 sample periods to do a multiply. (That is, input sample times coefficient). You can do many multiplies in parallel, but you can't get the result for the first one out any faster. Note that is is very easy to build an FIR filter with that restriction, though it might take a lot of logic. All you need to do is start 10 muliplies for each tap on each input (sample) clock, and add as appropriate. (Assume add is fast enough not to limit it.) Now, assume (temporarily) that multiply only takes two clocks. Can you refactor an IIR filter such that, even with a y[n-1] term in the equation, you don't need the y[n-1] as input to the y[n] computation. (Even if it takes more multiplies to do so.) My first thought is that you can, though I haven't tried to write it out. If you can do that, then try to refactor for more delay before you see the output terms.
> Any > necessary recursion can be handled prior to implementing the filter. So > the IIR filter shouldn't be more onerous than (say) a 16-tap FIR filter. > Even the need for high precision may be relaxed: the effective poles of > the IIR filter become your original pole positions raised to the 32nd > power; this pulls them away from 1 and reduces the need for high > precision.
-- glen