DSPRelated.com
Forums

128-taps signed FIR ---> 128-taps float FIR -- (ASM) -- Please help!

Started by nino...@yahoo.it April 24, 2006
Hi all,
I'm completing my thesis working with a TMS320C6713DSK and
studying from Chassaing's book. I'm filtering 3 magnetic field
sensors signals (x-y-z) with a 100-taps FIR filter (flat-top windowed)
and that works fine, but CPU graph is about 50% (maybe normal filtering 3 signals indipendently) and I want optimize my FIR function at least
using ASM FIR code and circular buffering AMR.
Chassaing's book provides a FIRcircfunc.asm, but the problem is that
I work with 32-bit float samples while this .asm function uses 16-bit short type.
I tried but I'm not able to modificate .asm code yet.
I'm learning programming in C, and ASM code is still difficult for me, although I'm sure that it's not a hard effort for who knows .asm programming to adapt the code.
I give code below, if someone can modify it from a short-type 128-taps FIR
to a float-type 128-taps FIR, I'll be very gratefull.
Thanks a lot, bye all.

Nino
;FIRcircfunc.asm ASM function called from C using circular addressing
;A4=newest sample, B4=coefficient address, A6=filter order
;Delay samples organized: x[n-(N-1)]...x[n]; coeff as h(0)...h[N-1]

.def _fircircfunc
.def last_addr
.def delays
.sect "circdata" ;circular data section
.align 256 ;align delay buffer 256-byte boundary
delays .space 256 ;init 256-byte buffer with 0's
last_addr .int last_addr-1 ;point to bottom of delays buffer
.text ;code section
_fircircfunc: ;FIR function using circ addr

MV A6,A1 ;setup loop count
MPY A6,2,A6 ;since dly buffer data as byte
ZERO A8 ;init A8 for accumulation
ADD A6,B4,B4 ;since coeff buffer data as bytes
SUB B4,1,B4 ;B4=bottom coeff array h[N-1]

MVKL 0x00070040,B6 ;select A7 as pointer and BK0
MVKH 0x00070040,B6 ;BK0 for 256 bytes (128 shorts)

MVC B6,AMR ;set address mode register AMR

MVKL last_addr,A9 ;A9=last circ addr(lower 16 bits)
MVKH last_addr,A9 ;last circ addr (higher 16 bits)
LDW *A9,A7 ;A7=last circ addr
NOP 4
STH A4,*A7++ ;newest sample-->last address

loop: ;begin FIR loop
LDH *A7++,A2 ;A2=x[n-(N-1)+i] i=0,1,...,N-1
|| LDH *B4--,B2 ;B2=h[N-1-i] i=0,1,...,N-1
SUB A1,1,A1 ;decrement count
[A1] B loop ;branch to loop if count # 0
NOP 2
MPY A2,B2,A6 ;A6=x[n-(N-1)+i]*h[N-1+i]
NOP
ADD A6,A8,A8 ;accumulate in A8

STW A7,*A9 ;store last circ addr to last_addr
B B3 ;return addr to calling routine
MV A8,A4 ;result returned in A4
NOP 4
Try the attached FIR filtering code. Design the FIR filter using FDATool in
MATLAB and generate the header "fir_coeff.h" with this tools. Then you can
use the coeffs in the main code after some modifications in the header file.

Rifat

-----Original Message-----
From: c... [mailto:c...] On Behalf Of
n...@yahoo.it
Sent: Monday, April 24, 2006 11:36 PM
To: c...
Subject: [c6x] 128-taps signed FIR ---> 128-taps float FIR -- (ASM) --
Please help!

Hi all,
I'm completing my thesis working with a TMS320C6713DSK and studying from
Chassaing's book. I'm filtering 3 magnetic field sensors signals (x-y-z)
with a 100-taps FIR filter (flat-top windowed) and that works fine, but CPU
graph is about 50% (maybe normal filtering 3 signals indipendently) and I
want optimize my FIR function at least using ASM FIR code and circular
buffering AMR.
Chassaing's book provides a FIRcircfunc.asm, but the problem is that I work
with 32-bit float samples while this .asm function uses 16-bit short type.
I tried but I'm not able to modificate .asm code yet.
I'm learning programming in C, and ASM code is still difficult for me,
although I'm sure that it's not a hard effort for who knows .asm programming
to adapt the code.
I give code below, if someone can modify it from a short-type 128-taps FIR
to a float-type 128-taps FIR, I'll be very gratefull.
Thanks a lot, bye all.

Nino
;FIRcircfunc.asm ASM function called from C using circular addressing
;A4=newest sample, B4=coefficient address, A6=filter order ;Delay samples
organized: x[n-(N-1)]...x[n]; coeff as h(0)...h[N-1]

.def _fircircfunc
.def last_addr
.def delays
.sect "circdata" ;circular data section
.align 256 ;align delay buffer 256-byte boundary
delays .space 256 ;init 256-byte buffer with 0's
last_addr .int last_addr-1 ;point to bottom of delays buffer
.text ;code section
_fircircfunc: ;FIR function using circ addr

MV A6,A1 ;setup loop count
MPY A6,2,A6 ;since dly buffer data as byte
ZERO A8 ;init A8 for accumulation

ADD A6,B4,B4 ;since coeff buffer data as bytes
SUB B4,1,B4 ;B4=bottom coeff array h[N-1]

MVKL 0x00070040,B6 ;select A7 as pointer and BK0
MVKH 0x00070040,B6 ;BK0 for 256 bytes (128 shorts)

MVC B6,AMR ;set address mode register AMR

MVKL last_addr,A9 ;A9=last circ addr(lower 16 bits)
MVKH last_addr,A9 ;last circ addr (higher 16 bits)

LDW *A9,A7 ;A7=last circ addr
NOP 4
STH A4,*A7++ ;newest sample-->last address

loop: ;begin FIR loop
LDH *A7++,A2 ;A2=x[n-(N-1)+i] i=0,1,...,N-1
|| LDH *B4--,B2 ;B2=h[N-1-i] i=0,1,...,N-1
SUB A1,1,A1 ;decrement count
[A1] B loop ;branch to loop if count # 0
NOP 2
MPY A2,B2,A6 ;A6=x[n-(N-1)+i]*h[N-1+i]
NOP
ADD A6,A8,A8 ;accumulate in A8

STW A7,*A9 ;store last circ addr to last_addr
B B3 ;return addr to calling routine
MV A8,A4 ;result returned in A4
NOP 4
Hi,
thank you Rifat Edizkan, code you sand me is similar
to code I'm using now and it's not very efficient
because it does a for cycle to shift delayed sampes.
I want to use a circular buffering and AMR, so I shift
pointers achieving the same resault with less DSP effort.
Asm code I provided in my first post does that,
but it makes filtering with samples and coeffs in short
format.
I could export FDAtool coeffs in short type, but I don't
want that because I need more precision in calculate
filtered signals.
I tried to adapt .asm code from a short-type 128-taps FIR
to a float-type 128-taps FIR, but asm is new for me, my
knowladge is Microchip asm code, but it's not the same, at all!

Below there is code I modificated to filter float samples
with float coeffs. It is commented in parts where I don't
know what to do.
Thanks all for help, bye all!
;FIRcircfunc.asm ASM function called from C using circular addressing
;A4=newest sample, B4=coefficient address, A6=filter order
;Delay samples organized: x[n-(N-1)]...x[n]; coeff as h(0)...h[N-1]

.def _fircircfunc
.def last_addr
.def delays
.sect "circdata" ;circular data section
.align 512 ;align delay buffer 512-byte boundary
delays .space 512 ;init 512-byte buffer with 0's
last_addr .int last_addr-1 ;point to bottom of delays buffer
;(last_addr-1 or last_addr-3 ?)
.text ;code section
_fircircfunc: ;FIR function using circ addr

MV A6,A1 ;setup loop count
MPY A6,4,A6 ;since dly buffer data as byte (N=4xN)
ZERO A8 ;init A8 for accumulation
ADD A6,B4,B4 ;since coeff buffer data as bytes
SUB B4,1,B4 ;B4=bottom coeff array h[N-1]
;(SUB B4,1,B4 or SUB B4,3,B4 ?)

MVKL 0x00080040,B6 ;select A7 as pointer and BK0
MVKH 0x00080040,B6 ;BK0 for 512 bytes (128 floats)

MVC B6,AMR ;set address mode register AMR

MVKL last_addr,A9 ;A9=last circ addr(lower 16 bits)
MVKH last_addr,A9 ;last circ addr (higher 16 bits)
LDW *A9,A7 ;A7=last circ addr
NOP 4
STW A4,*A7++ ;newest sample-->last address
;(in my code A4 is a float!!
; so increment 1 or 3 ?)

loop: ;begin FIR loop
LDW *A7++,A2 ;A2=x[n-(N-1)+i] i=0,1,...,N-1 (??)
|| LDW *B4--,B2 ;B2=h[N-1-i] i=0,1,...,N-1 (??)
SUB A1,1,A1 ;decrement count
[A1] B loop ;branch to loop if count # 0
NOP 2
MPYSP A2,B2,A6 ;A6=x[n-(N-1)+i]*h[N-1+i] (??)
;(samples and coeffs are floats !
; is it a 32x32 mpy in fixed-point
; format ?)
NOP
ADDSP A6,A8,A8 ;accumulate in A8
;is it a 32 add in fixed-point
; format ?)

STW A7,*A9 ;store last circ addr to last_addr (??)
B B3 ;return addr to calling routine
MV A8,A4 ;result returned in A4 (??)
NOP 4
You shouldn't need to use asm at all to get optimal code.
The tricks are to:
1) Make the compiler use LDDW for loading signal and coeffs. To do this
both the data array and filter coeffs have to be 8-byte aligned. Also
they need to be referenced as doubles and then something like the below
used to extract the upper or lower float.
#define FDHI(a,b) _itof(_hi(a[b]))
#define FDLO(a,b) _itof(_lo(a[b]))
2) Use more than one accumulator.

Inner loop looks something like:
for(i=0;i {
sum0 += FDLO(dfInput,i) * FDLO(dfFilterCoef,i);
sum1 += FDHI(dfInput,i) * FDHI(dfFilterCoef,i);
}
sum = sum0 + sum1;
That should be it.

- Andrew E.

n...@yahoo.it wrote:

>Hi,
>thank you Rifat Edizkan, code you sand me is similar
>to code I'm using now and it's not very efficient
>because it does a for cycle to shift delayed sampes.
>I want to use a circular buffering and AMR, so I shift
>pointers achieving the same resault with less DSP effort.
>Asm code I provided in my first post does that,
>but it makes filtering with samples and coeffs in short
>format.
>I could export FDAtool coeffs in short type, but I don't
>want that because I need more precision in calculate
>filtered signals.
>I tried to adapt .asm code from a short-type 128-taps FIR
>to a float-type 128-taps FIR, but asm is new for me, my
>knowladge is Microchip asm code, but it's not the same, at all!
>
>Below there is code I modificated to filter float samples
>with float coeffs. It is commented in parts where I don't
>know what to do.
>Thanks all for help, bye all!
>;FIRcircfunc.asm ASM function called from C using circular addressing
>;A4=newest sample, B4=coefficient address, A6=filter order
>;Delay samples organized: x[n-(N-1)]...x[n]; coeff as h(0)...h[N-1]
>
> .def _fircircfunc
> .def last_addr
> .def delays
> .sect "circdata" ;circular data section
> .align 512 ;align delay buffer 512-byte boundary
>delays .space 512 ;init 512-byte buffer with 0's
>last_addr .int last_addr-1 ;point to bottom of delays buffer
> ;(last_addr-1 or last_addr-3 ?)
> .text ;code section
>_fircircfunc: ;FIR function using circ addr
>
> MV A6,A1 ;setup loop count
> MPY A6,4,A6 ;since dly buffer data as byte (N=4xN)
> ZERO A8 ;init A8 for accumulation
> ADD A6,B4,B4 ;since coeff buffer data as bytes
> SUB B4,1,B4 ;B4=bottom coeff array h[N-1]
> ;(SUB B4,1,B4 or SUB B4,3,B4 ?)
>
>
> MVKL 0x00080040,B6 ;select A7 as pointer and BK0
> MVKH 0x00080040,B6 ;BK0 for 512 bytes (128 floats)
>
> MVC B6,AMR ;set address mode register AMR
>
> MVKL last_addr,A9 ;A9=last circ addr(lower 16 bits)
> MVKH last_addr,A9 ;last circ addr (higher 16 bits)
> LDW *A9,A7 ;A7=last circ addr
> NOP 4
> STW A4,*A7++ ;newest sample-->last address
> ;(in my code A4 is a float!!
> ; so increment 1 or 3 ?)
>
>loop: ;begin FIR loop
> LDW *A7++,A2 ;A2=x[n-(N-1)+i] i=0,1,...,N-1 (??)
> || LDW *B4--,B2 ;B2=h[N-1-i] i=0,1,...,N-1 (??)
> SUB A1,1,A1 ;decrement count
> [A1] B loop ;branch to loop if count # 0
> NOP 2
> MPYSP A2,B2,A6 ;A6=x[n-(N-1)+i]*h[N-1+i] (??)
> ;(samples and coeffs are floats !
> ; is it a 32x32 mpy in fixed-point
> ; format ?)
> NOP
> ADDSP A6,A8,A8 ;accumulate in A8
> ;is it a 32 add in fixed-point
> ; format ?)
>
> STW A7,*A9 ;store last circ addr to last_addr (??)
> B B3 ;return addr to calling routine
> MV A8,A4 ;result returned in A4 (??)
> NOP 4
>
>