DSPRelated.com
Forums

IIR implementation on ARM

Started by luminous March 9, 2011
Hi Comp.DSP.

I have a question about how a biquad is best implemented on an ARM
processor supporting the "ARM DSP-enhanced instructions", giving as low
round-off noise as possible.

I have recieved support issues about this but since I'm not an ARM
programmer I'm asking here.

A biquad needs multiply-and-accumulate instructions so I have looked at the
"ARM DSP-enhanced instructions" and found the SMLAWy instruction, which
takes one 16-bit and one 32-bit argument, multiplies them and accumulates
to a 32-bit result.

Is this the instruction to use?

But I don't understand what is gained by multiplying a 16-bit number by a
32-bit number, which yields a 48-bit number, then truncate it to 32 bits?

What is then gained over multiplying two 16 bit numbers which also yields a
32 bit number?

Your help is much appreciated,

Best regards
Viktor
On 03/09/2011 09:26 AM, luminous wrote:
> Hi Comp.DSP. > > I have a question about how a biquad is best implemented on an ARM > processor supporting the "ARM DSP-enhanced instructions", giving as low > round-off noise as possible. > > I have recieved support issues about this but since I'm not an ARM > programmer I'm asking here. > > A biquad needs multiply-and-accumulate instructions so I have looked at the > "ARM DSP-enhanced instructions" and found the SMLAWy instruction, which > takes one 16-bit and one 32-bit argument, multiplies them and accumulates > to a 32-bit result. > > Is this the instruction to use? > > But I don't understand what is gained by multiplying a 16-bit number by a > 32-bit number, which yields a 48-bit number, then truncate it to 32 bits? > > What is then gained over multiplying two 16 bit numbers which also yields a > 32 bit number? > > Your help is much appreciated,
Pay close attention to the implied radix point of the multiplication, and double-check to see if the accumulator really is a 32-bit result, or if it is extended (it's not always called out directly -- TI calls their accumulator extension an "overflow count"). Doing a MAC operation on a vector of 16-bit fractional coefficients against a vector of 32-bit fractional data gives you the full precision of the data. If you can do a shift as part of the vector multiply (a 'real' fixed-point DSP chip will do this, one way or another) then you get loads more precision in the data than you do with 16-bit inputs. Pick through the instruction set _carefully_. The normal 'real DSP' way to do this (as done on the Motorola 56000, ADI 2100, and TI TMS320F28xx processors) goes more or less: * Set up your loop * zip through a bunch of MAC instructions, one per clock cycle * shift the extended accumulator as necessary (the TI part does this as part of the MAC) * Test the accumulator extension for overflow and saturate * get on with life. Everyone has a different way of letting you accomplish this -- but all the fixed point DSP chips that I've ever evaluated let you do _all_ of this. ARM should provide this functionality, too, but. but. but. There's two meanings to the word "should". One is "can be reasonably expected to". The other is "is morally obliged to". I fear in this case that the operative meaning is the second -- you have to dig to see if the chip lives up to the first. I hope this -- even the cynical parts -- helps. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" was written for you. See details at http://www.wescottdesign.com/actfes/actfes.html
On 03/09/2011 09:26 AM, luminous wrote:
> Hi Comp.DSP. > > I have a question about how a biquad is best implemented on an ARM > processor supporting the "ARM DSP-enhanced instructions", giving as low > round-off noise as possible. > > I have recieved support issues about this but since I'm not an ARM > programmer I'm asking here. > > A biquad needs multiply-and-accumulate instructions so I have looked at the > "ARM DSP-enhanced instructions" and found the SMLAWy instruction, which > takes one 16-bit and one 32-bit argument, multiplies them and accumulates > to a 32-bit result. > > Is this the instruction to use? > > But I don't understand what is gained by multiplying a 16-bit number by a > 32-bit number, which yields a 48-bit number, then truncate it to 32 bits? > > What is then gained over multiplying two 16 bit numbers which also yields a > 32 bit number? > > Your help is much appreciated,
What processor are you using? I'm looking at the ARMv7 architecture manual, and it has a 64bit = 64bit + 32bit * 32bit instruction, and a 32bit = 32bit + 32bit * 32bit instruction, but not the one that you're calling out. If you treat the coefficients right, you should be able to use the ARMv7 SMLAL instruction to get all the precision you'll ever need. Note that this is _not_ a 'true DSP' MAC -- it lacks hardware looping, and the extended accumulator. But it gets you maybe half way there from a plain-jane general-purpose processor, and may be all you need. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" was written for you. See details at http://www.wescottdesign.com/actfes/actfes.html
On Wed, 09 Mar 2011 11:26:47 -0600, luminous wrote:

> Hi Comp.DSP. > > I have a question about how a biquad is best implemented on an ARM > processor supporting the "ARM DSP-enhanced instructions", giving as low > round-off noise as possible. > > I have recieved support issues about this but since I'm not an ARM > programmer I'm asking here. > > A biquad needs multiply-and-accumulate instructions so I have looked at > the "ARM DSP-enhanced instructions" and found the SMLAWy instruction, > which takes one 16-bit and one 32-bit argument, multiplies them and > accumulates to a 32-bit result. > > Is this the instruction to use?
It is useful for many things. If you can scale your data so that you don't overflow in the add part, it is all you need (ARM doesn't have the "extension" bits that regular DSPs do.) This is the preferred approach, because these instructions have a one-cycle throughput (maybe a couple of cycles latency.) If you need to operate left-aligned and guard against overflow with continuous saturation, you might prefer to use sequences of SMULWB;QADD... so that the addition part saturates. This is slower, of course.
> But I don't understand what is gained by multiplying a 16-bit number by > a 32-bit number, which yields a 48-bit number, then truncate it to 32 > bits?
Because your registers are all 32-bits, and it takes two cycles to write a 64-bit result into two of them. There *are* instructions that multiply and multiply-accumulate into 64 bits, but they're slower than the SMLAWB ones (but faster than doing the same thing long-hand.)
> What is then gained over multiplying two 16 bit numbers which also > yields a 32 bit number?
You get to use data that is wider than 16 bits, with round-off (actually truncation, which is worse, but only by 3dB) noise injected 180-ish dB below full scale, which is often fine.
> Your help is much appreciated,
Whether or not you can get away with SMLAWB and friends, or not, depends quite a bit on the exact nature of your IIR filter, including both code structure and filter coefficients. There are many situations where I've found that you really do want the 64-bit results for the feedback paths. This is slower than just using 32-bit values, but faster than other work- arounds like error-feedback structures. At least you have the choice. I've met DSP processors that had no extended precision support whatsoever. ARM CPUs are *not* DSPs. The scheduling of loads, stores and operations requires a significantly different approach. They aren't bad, though, and they can certainly be persuaded to perform comparably to many DSPs. (At least within the same order of magnitude...) Cheers, -- Andrew
On Wed, 09 Mar 2011 11:32:50 -0800, Tim Wescott wrote:

> What processor are you using? I'm looking at the ARMv7 architecture > manual, and it has a 64bit = 64bit + 32bit * 32bit instruction, and a > 32bit = 32bit + 32bit * 32bit instruction, but not the one that you're > calling out.
ARMv7 certainly has all of the ARMv5TE DSP instructions (SMLAWB and friends), as well as the two-element "SIMD" variants added in ARMv6. What did surprise me is that the same integer/fixed-point model was not extended into the NEON SIMD instruction set. There the available instructions are as you describe, I think. Cheers, -- Andrew
Thank you for the answers! I'll have to think about this a bit before I ask
more...

/Viktor 
>Pay close attention to the implied radix point of the multiplication, >and double-check to see if the accumulator really is a 32-bit result, or >if it is extended (it's not always called out directly -- TI calls their >accumulator extension an "overflow count"). > >Doing a MAC operation on a vector of 16-bit fractional coefficients >against a vector of 32-bit fractional data gives you the full precision >of the data. If you can do a shift as part of the vector multiply (a >'real' fixed-point DSP chip will do this, one way or another) then you >get loads more precision in the data than you do with 16-bit inputs. > >Pick through the instruction set _carefully_. The normal 'real DSP' way >to do this (as done on the Motorola 56000, ADI 2100, and TI TMS320F28xx >processors) goes more or less: > >* Set up your loop >* zip through a bunch of MAC instructions, one per clock cycle >* shift the extended accumulator as necessary > (the TI part does this as part of the MAC) >* Test the accumulator extension for overflow and saturate >* get on with life. > >Everyone has a different way of letting you accomplish this -- but all >the fixed point DSP chips that I've ever evaluated let you do _all_ of >this. ARM should provide this functionality, too, but. > >but. > >but. > >There's two meanings to the word "should". One is "can be reasonably >expected to". The other is "is morally obliged to". I fear in this >case that the operative meaning is the second -- you have to dig to see >if the chip lives up to the first. > >I hope this -- even the cynical parts -- helps. > >-- > >Tim Wescott >Wescott Design Services >http://www.wescottdesign.com > >Do you need to implement control loops in software? >"Applied Control Theory for Embedded Systems" was written for you. >See details at http://www.wescottdesign.com/actfes/actfes.html >
Hi Tim, thanks for the explanation. The SMLAWy instruction that I refer to multiplies a 16-bit number with a 32-bit number and then selects the most significant bits of the 48 bit result, so I guess you can call it extended precision. You get the precision of a 48-bit accumulator and a 16-bit shift for free, I'm beginning to grasp the benefits of this.
>On Wed, 09 Mar 2011 11:32:50 -0800, Tim Wescott wrote: > >> What processor are you using? I'm looking at the ARMv7 architecture >> manual, and it has a 64bit = 64bit + 32bit * 32bit instruction, and a >> 32bit = 32bit + 32bit * 32bit instruction, but not the one that you're >> calling out. > >ARMv7 certainly has all of the ARMv5TE DSP instructions (SMLAWB and >friends), as well as the two-element "SIMD" variants added in ARMv6. >What did surprise me is that the same integer/fixed-point model was not >extended into the NEON SIMD instruction set. There the available >instructions are as you describe, I think. > >Cheers, > >-- >Andrew >
I'm interested in a biquad implementation on the 5E version of the ARM architecture. I have drawn a picture of how I imagine that a biquad implementation can be done with the SMLAWy instruction, you can find it here: http://dl.dropbox.com/u/10432980/biqARM.pdf A shift of 16 bits is inherent in the accumulator and not drawn as a separate block. I have indicated the scaling resulting in different parts of the diagram with a QI.F notation where I is the number of integer bits and F the number of fractional bits. The shift of 2 bits to the left is done to avoid that the integer part grows indefinitely due to the multiplications in the feedback path. I would be very glad for any comments on this, is there a smarter or more traditional way to do it so that the shifter isn't needed? /Viktor
i know this is old, but i hadn't paid too much attention to the
thread...

On Mar 9, 3:32&#4294967295;pm, Tim Wescott <t...@seemywebsite.com> wrote:
> On 03/09/2011 09:26 AM, luminous wrote: > > > > I have a question about how a biquad is best implemented on an ARM > > processor supporting the "ARM DSP-enhanced instructions", giving as low > > round-off noise as possible. > > > What processor are you using? &#4294967295;I'm looking at the ARMv7 architecture > manual, and it has a 64bit = 64bit + 32bit * 32bit instruction, ... > > If you treat the coefficients right, you should be able to use the ARMv7 > SMLAL instruction to get all the precision you'll ever need. > > Note that this is _not_ a 'true DSP' MAC -- it lacks hardware looping, > and the extended accumulator.
Tim, how is a 32x32 --> 64 bit MAC not an extended accumulator? do you mean guard bits to the left? if you treat these all as integers, you can adjust the coefficients so that you trade off bits on the right (of the coefs) for guard bits on the left. to get the properly scaled 32 bit result, you still need to make use of a barrel shift, but it wouldn't necessarily be a right shift of 31 bits. it all depends on how many bits to the left of the binary point you give your fixed-point coefficients. i dunno, never used an ARM in my life, but i'm not opposed to the idea. r b-j
On 03/25/2011 12:45 PM, robert bristow-johnson wrote:
> > i know this is old, but i hadn't paid too much attention to the > thread... > > On Mar 9, 3:32 pm, Tim Wescott<t...@seemywebsite.com> wrote: >> On 03/09/2011 09:26 AM, luminous wrote: >> >> >>> I have a question about how a biquad is best implemented on an ARM >>> processor supporting the "ARM DSP-enhanced instructions", giving as low >>> round-off noise as possible. >> >> >> What processor are you using? I'm looking at the ARMv7 architecture >> manual, and it has a 64bit = 64bit + 32bit * 32bit instruction, ... >> >> If you treat the coefficients right, you should be able to use the ARMv7 >> SMLAL instruction to get all the precision you'll ever need. >> >> Note that this is _not_ a 'true DSP' MAC -- it lacks hardware looping, >> and the extended accumulator. > > Tim, how is a 32x32 --> 64 bit MAC not an extended accumulator? do > you mean guard bits to the left?
Yes, I mean guard bits to the left. The integer DSP chips that I've seen all provide them. Granted, this is much less necessary when the multiplication is 32x32 into 64, as opposed to 16x16 into 32.
> if you treat these all as integers, > you can adjust the coefficients so that you trade off bits on the > right (of the coefs) for guard bits on the left. to get the properly > scaled 32 bit result, you still need to make use of a barrel shift, > but it wouldn't necessarily be a right shift of 31 bits. it all > depends on how many bits to the left of the binary point you give your > fixed-point coefficients. > > i dunno, never used an ARM in my life, but i'm not opposed to the > idea.
I think it's neat that they have the instructions. It's still a long ways off from what you can do with a (relatively) cheap integer DSP chip, but it's not a bad thing at all. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" was written for you. See details at http://www.wescottdesign.com/actfes/actfes.html