c6x | sample-by-sample FIR optimized| page 2

Reply by Wojciech Rewers ●September 22, 20032003-09-22

--- Arius - Rick Collins <> wrote: > One final suggestion and I will let you get on with > it. You can optimize > the code to use the LDDW if you copy your input data > into *two* buffers, > one odd aligned and one even aligned. Alternate > between them on each > sample and the LDDW should be in the right place > each time. hm... well... I'm affraid it's not gonna work as I'm using circular addressing - so there data block has to be aligned to the block size boundary... __________________________________

Reply by Arius - Rick Collins ●September 22, 20032003-09-22

At 03:31 AM 9/22/2003, you wrote: >--- Arius - Rick Collins <> wrote: > > One final suggestion and I will let you get on with > > it. You can optimize > > the code to use the LDDW if you copy your input data > > into *two* buffers, > > one odd aligned and one even aligned. Alternate > > between them on each > > sample and the LDDW should be in the right place > > each time. > >hm... well... I'm affraid it's not gonna work as I'm >using circular addressing - so there data block has to >be aligned to the block size boundary... You need to stop looking at the limitations and start looking at the ways to make it work. Circular addressing does not need to match the buffer size to the block size. The block size just needs to be as large or larger. If you are tight on buffer space, use 254 taps instead of 256; or use 512 words for your blocks. Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Jeff Brower ●September 22, 20032003-09-22

Rick- > One final suggestion and I will let you get on with it. You can optimize > the code to use the LDDW if you copy your input data into *two* buffers, > one odd aligned and one even aligned. Alternate between them on each > sample and the LDDW should be in the right place each time. Yeah, that's a good idea -- the function could still pass one buffer, and the "copy split" could be done at function start, transparently to the caller. Especially if Wojciech makes the algorithm do more per tap (LMS) then the extra copy loop would become less significant. -Jeff > At 03:15 AM 9/22/2003, you wrote: > >--- Arius - Rick Collins <> wrote: > > > I took a look at the code Andrew Elder posted and I > > > think I understand what they are doing. > > > >yeah... I actually realized I should digest that code > >even before Andrew advised that... so I did... at > >first I couldn't understand any of it, but after a > >while I realized what's going on ;-) god it's simple! > >and very neat too... > > > > > I think this only works from internal memory). > > > >and why would that be? > > > > > The final adds account for > > > the fact that since the pipeline in the mpy and add > > > are 4 clocks, there are > > > really four MACs going on in each register, so they > > > have to be summed to > > > get the final result. > > > >yeah... understanding most of that algorithm was easy > >- the ADDSPs was the most difficult part as I couldn't > >figure out how you can accumulate into just one > >register if the ADDSP takes 4 cycles... in fact - even > >though only one register is used - still there are > >founr accumulation going on - each for every fourth > >cycle - and the results are kept somewhere in the > >pipeline I guess - so then those four sums for each > >register have to be added together... the spru198 > >explains the whole process quite well... so - you know > >how it is - if all possibilities fail - read the > >manual ;-) > > > > > As you can see, it takes a lot to write code like > > > this. > > > >well - again I was proved, there are no difficult > >things in life - only new ones... at first when I > >looked at that dotp function - it was all black magic > >to me - now when I look at it all so simple and neat - > >I think that functions shows excellently the beauty > >and power of C6x devices... > > > > > It is much better to use library code if possible. > > > >well - nobody likes to reinvent the wheel... > >especially when the deadline is close... > > > > > So this code will do what you want if you > > > can change the addressing for a circular buffer. > > > Can you figure out how to do that? > > > >and what do you think I was doing when you were making > >out with Isabelle? ;-) > > > >I had to give up half the speed though... since I need > >"sample-by-sample" algotithm - I can't have any > >constraints on the location of samples in the memory - > >and using LDDW imposes on you that the samples are > >aligned on odd-even addresses or something like > >that... > > > >so what I did is that I changed LDDW into LDW, skipped > >the parallel MPYSPs and ADDSPs plus I use circular > >buffer for data... if anyone is interested - the code > >is here: > > > >http://www.wrewers.karolin.pl/firc.asm > > > >maybe somebody can even find a bug or something - > >please do let me know... > > > >well - I started from thousands of cycles first to > >filter the data (using library code) then to shift all > >the samples in the buffer by one - now I have a code > >that is performing "the same" in around 500 cycles - > >it's enough for me as for now - but the interesting > >aspect is that the same thing can be done for so many > >different ways ;-) the beauty of DSPs I guess... > > > >anyway - I have that FIR now - now I will only need a > >LMS to adapt FIR's coeficients - but I think I'll > >figure something out... > > > >Wojciech Rewers (still looking for a job) > > > >__________________________________ > > > Rick Collins > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Wojciech Rewers ●September 22, 20032003-09-22

--- Arius - Rick Collins <> wrote: > You need to stop looking at the limitations and > start looking at the ways to make it work. well... as I said before I'm fine with the results I obrained... sure there is some space for optimization, but it's not gonna help my application to spend _even_ _more_ time waiting... or is it? so - instead of spending my time for further optimizations - I moved on and I'm working on other critical parts of the project... if I finish before my deadline - I may even come back to that optimization problem - we'll see... > Circular addressing does not need to match the buffer size to the block size. I know that, however - my algorithm is filtering each coming sample 4 times with 4 different filters - the first has 256 coefs, the second and third are 32 taps long and the last one is 64 coefs - they are all power of two length - that's all I need... and with the code I have I'm able to apply all those filters plus some other minor calculations on every single sample taken with 16kHz rate - that's all I need... Wojciech Rewers __________________________________

Reply by Jeff Brower ●September 22, 20032003-09-22

Wojciech Rewers wrote: Hey nice work. Every so often there is a vigorous debate on the group about whether CCS linear assembly can produce code as efficient as hand-optimized. If you search the C6x archive for the thread "C-Coding" in the Nov/Dec 2002 time-frame, you will see some interesting discussion. There was one TI person in particular, Jagadeesh Sankaran, who argued strongly that hand-optimization should be avoided. He made some excellent suggestions for optimization within CCS. But when I see some excellent work like yours and the dramatic speed increase, it tends to reinforce the idea that humans can still be pretty useful. -Jeff > --- Arius - Rick Collins <> wrote: > > I took a look at the code Andrew Elder posted and I > > think I understand what they are doing. > > yeah... I actually realized I should digest that code > even before Andrew advised that... so I did... at > first I couldn't understand any of it, but after a > while I realized what's going on ;-) god it's simple! > and very neat too... > > > I think this only works from internal memory). > > and why would that be? > > > The final adds account for > > the fact that since the pipeline in the mpy and add > > are 4 clocks, there are > > really four MACs going on in each register, so they > > have to be summed to > > get the final result. > > yeah... understanding most of that algorithm was easy > - the ADDSPs was the most difficult part as I couldn't > figure out how you can accumulate into just one > register if the ADDSP takes 4 cycles... in fact - even > though only one register is used - still there are > founr accumulation going on - each for every fourth > cycle - and the results are kept somewhere in the > pipeline I guess - so then those four sums for each > register have to be added together... the spru198 > explains the whole process quite well... so - you know > how it is - if all possibilities fail - read the > manual ;-) > > > As you can see, it takes a lot to write code like > > this. > > well - again I was proved, there are no difficult > things in life - only new ones... at first when I > looked at that dotp function - it was all black magic > to me - now when I look at it all so simple and neat - > I think that functions shows excellently the beauty > and power of C6x devices... > > > It is much better to use library code if possible. > > well - nobody likes to reinvent the wheel... > especially when the deadline is close... > > > So this code will do what you want if you > > can change the addressing for a circular buffer. > > Can you figure out how to do that? > > and what do you think I was doing when you were making > out with Isabelle? ;-) > > I had to give up half the speed though... since I need > "sample-by-sample" algotithm - I can't have any > constraints on the location of samples in the memory - > and using LDDW imposes on you that the samples are > aligned on odd-even addresses or something like > that... > > so what I did is that I changed LDDW into LDW, skipped > the parallel MPYSPs and ADDSPs plus I use circular > buffer for data... if anyone is interested - the code > is here: > > http://www.wrewers.karolin.pl/firc.asm > > maybe somebody can even find a bug or something - > please do let me know... > > well - I started from thousands of cycles first to > filter the data (using library code) then to shift all > the samples in the buffer by one - now I have a code > that is performing "the same" in around 500 cycles - > it's enough for me as for now - but the interesting > aspect is that the same thing can be done for so many > different ways ;-) the beauty of DSPs I guess... > > anyway - I have that FIR now - now I will only need a > LMS to adapt FIR's coeficients - but I think I'll > figure something out... > > Wojciech Rewers (still looking for a job)

Reply by Jeff Brower ●September 22, 20032003-09-22

Jagadeesh- Thanks! I had a feeling you were still there, listening to all of this :-) -Jeff Jagadeesh Sankaran wrote: > > First of all, I would still strongly advise folks! not to > give up hopes on tools and automatic code generation in a > flash. I will illustrate the perils, based on the code that > has been developed at: > > http://www.wrewers.karolin.pl/firc.asm > > This code has several issues. I am not trying to nit-pick. > Developing hand-optimized VLIW code has its perils. Let the > tools do their job. I will list some of the bugs that immediately > catch my eye. I cannot vouch that I have caught all of them. > BTW all these bugs can be avoided by using tools, so that > one does not have to become intricately familiar to do code > development of a simple FIR. > > a. The stack pointer needs to be double word aligned, at all > times. The stores to the stack need to be done by pre-decrementing > the stack frame and leaving it double word aligned. > > b. The save on entry registers A10-A15 and B10-B15 need to be saved > for sure, upon entry to a function. > > c. This code is not single register assignment, and hence you need to > turn off interrupts while you are in this code. > > d. This single cycle loop as shown could have been achieved with the > tools, > without a doubt. Further, if you allow at least two output samples to be > computed in parallel, you can get 100% multiplier utilization. > > e. Most fir implementations are written to perform block processing, > which is why the serial ports are buffered McBSP, with the B for > Buffering. > > f. Also the delay line is implemented once as a memcpy, by moving the > block of samples required for overlap, without explictly doing it in > the kernel. This removes the need for circular buffering. > > g. Take a look at codec_edma.c under the DSK directory for an example of > block based interaction with the serial port. > > By the way I have not seen anything spectacular to give up on the tools > yet! > > Regards > Jagadeesh Sankaran

Reply by Jagadeesh Sankaran ●September 22, 20032003-09-22

First of all, I would still strongly advise folks! not to give up hopes on tools and automatic code generation in a flash. I will illustrate the perils, based on the code that has been developed at: http://www.wrewers.karolin.pl/firc.asm This code has several issues. I am not trying to nit-pick. Developing hand-optimized VLIW code has its perils. Let the tools do their job. I will list some of the bugs that immediately catch my eye. I cannot vouch that I have caught all of them. BTW all these bugs can be avoided by using tools, so that one does not have to become intricately familiar to do code development of a simple FIR. a. The stack pointer needs to be double word aligned, at all times. The stores to the stack need to be done by pre-decrementing the stack frame and leaving it double word aligned. b. The save on entry registers A10-A15 and B10-B15 need to be saved for sure, upon entry to a function. c. This code is not single register assignment, and hence you need to turn off interrupts while you are in this code. d. This single cycle loop as shown could have been achieved with the tools, without a doubt. Further, if you allow at least two output samples to be computed in parallel, you can get 100% multiplier utilization. e. Most fir implementations are written to perform block processing, which is why the serial ports are buffered McBSP, with the B for Buffering. f. Also the delay line is implemented once as a memcpy, by moving the block of samples required for overlap, without explictly doing it in the kernel. This removes the need for circular buffering. g. Take a look at codec_edma.c under the DSK directory for an example of block based interaction with the serial port. By the way I have not seen anything spectacular to give up on the tools yet! Regards Jagadeesh Sankaran

Reply by Jeff Brower ●September 22, 20032003-09-22

Wojciech- > but what do you mean that the stack pointer needs to > be double word aligned? isn't the stack pointer B15? > how can that be double word aligned? or do you mean > that I should push/pop always double words? hm... is > it even possible? could you please elaborate on that? Sub 8 and AND with 0xfffffff8 before doing any additional push/pop. Watch out for alignment upon return. -Jeff

Reply by Jagadeesh Sankaran ●September 22, 20032003-09-22

A lot of my experience has been with fixed point DSP's C62x and more so C64x DSP. However I wanted to write out some code and prove to myself that the tools are indeed the best way to get there, that too for an example as simple as an FIR. Here is my first pass stab at a single sample FIR C code. This code models merely the FIR part of the dot-product, it does not model the circular buffer. I shall address this in the latter part of my e-mail. #include <stdio.h> #include <stdlib.h> float fir(float *restrict x, float *restrict h, int N) { int i; float sum; /*-----------*/ /* Initialize FIR accumulator to zero, prior */ /* to the start of the computation. */ /*-----------*/ sum = 0; /*-----------*/ /* If this is a C6000 build, then inform the */ /* compiler about any safe assumptions that */ /* can be made. In this case we assume that */ /* input array is word aligned and filter */ /* arrays is double word aligned. */ /* word aligned. I am assuming that your */ /* filter has at least 16 taps and is a */ /* multiple of 8. */ /*-----------*/ #ifdef TMS320C6X _nassert((int)(x)%4 == 0); _nassert((int)(h)%8 == 0); _nassert((int)(N)%8 == 0); _nassert((int)(N >= 16)); #endif /*-----------*/ /* The following loop iterates over N filter */ /* taps computing the FIR sum, one tap at a */ /* time. */ /*-----------*/ for ( i = 0; i < N; i++) { /*--------*/ /* Compute sum of products over all filter */ /* taps accumulating the result into sum. */ /*--------*/ sum += x[i] * h[i]; } /*-----------*/ /* Return accumulated single sample FIR. */ /*-----------*/ return sum; } I compiled using the current shipping 4.31 tools. I used the following flags for my compile: cl6x -k -o2 -mwtx -mv6700 -mh -dTMS320C6X fir.c The compiler produced the following output in which two multiplies are issued every cycle. Although the code is written in a straight forward way, tthe odd and even taps in parallel and accumulates them into seperate accumulators. It finally adds the seperate accumulators prior to returning. I will now reproduce the assembler output, which you can reproduce as well {Isnt that nice ?}. .sect ".text" .global _fir ;****************************************************************************** ;* FUNCTION NAME: _fir * ;* * ;* Regs Modified : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,* ;* B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP * ;* Regs Used : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A14,B0,B1,B2,B3,B4,* ;* B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP * ;* Local Frame Size : 0 Args + 0 Auto + 24 Save = 24 byte * ;****************************************************************************** _fir: ;** --* MV .S1X SP,A9 ; |5| || STW .D2T1 A10,*SP--(24) ; |5| STW .D2T2 B13,*+SP(20) MVK .S2 32,B5 || STW .D2T2 B12,*+SP(16) SUB .L2X A4,B5,B5 || STW .D2T2 B11,*+SP(12) MV .S1X DP,A10 ; save dp || STW .D1T1 A14,*-A9(20) || MV .L2 B4,B11 || STW .D2T2 B10,*+SP(8) || MVC .S2 CSR,B4 ADD .L2 4,B5,B12 || LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <0,1> || MV .S1X B4,A8 || AND .S2 -2,B4,B4 SHR .S1 A6,3,A4 ; |47| || MV .L1X B5,A6 || MVC .S2 B4,CSR ; interrupts off || LDW .D2T2 *++B12(32),DP ; |47| (P) <0,0> LDW .D1T1 *++A6(32),A3 ; |47| (P) <0,2> || LDDW .D2T2 *-B11(24),B5:B4 ; |47| (P) <0,2> ;*----* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop source line : 40 ;* Loop opening brace source line : 41 ;* Loop closing brace source line : 48 ;* Loop Unroll Multiple : 8x ;* Known Minimum Trip Count : 2 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 4 ;* Unpartitioned Resource Bound : 6 ;* Partitioned Resource Bound(*) : 6 ;* Resource Partition: ;* A-side B-side ;* .L units 2 6* ;* .S units 1 0 ;* .D units 6* 6* ;* .M units 2 6* ;* .X cross paths 2 4 ;* .T address paths 6* 6* ;* Long read paths 0 0 ;* Long write paths 0 4 ;* Logical ops (.LS) 0 0 (.L or .S unit) ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit) ;* Bound(.L .S .LS) 2 3 ;* Bound(.L .S .D .LS .LSD) 4 4 ;* ;* Searching for software pipeline schedule at ... ;* ii = 6 Schedule found with 4 iterations in parallel ;* ;* Register Usage Table: ;* +---------------------------------+ ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB| ;* |0000000000111111|0000000000111111| ;* |0123456789012345|0123456789012345| ;* |----------------+----------------| ;* 0: |**** * | ** **** *** | ;* 1: |***** ** | ****** *** | ;* 2: |******** | *** ******* | ;* 3: | ******* |* *** ***** * | ;* 4: | **** * |** **** **** * | ;* 5: | *** * |*** ***** ** * | ;* +---------------------------------+ ;* ;* Done ;* ;* Epilog not entirely removed ;* Collapsed epilog stages : 2 ;* ;* Prolog not entirely removed ;* Collapsed prolog stages : 1 ;* ;* Minimum required memory pad : 64 bytes ;* ;* Minimum safe trip count : 1 (after unrolling) ;*----* ;* SETUP CODE ;* ;* MV A6,B12 ;* ADD 4,B12,B12 ;* ;* SINGLE SCHEDULED ITERATION ;* ;* C23: ;* 0 LDW .D2T2 *++B12(32),DP ; |47| ;* 1 LDDW .D2T2 *B11++(32),B7:B6 ; |47| ;* 2 LDW .D1T1 *++A6(32),A3 ; |47| ;* || LDDW .D2T2 *-B11(24),B5:B4 ; |47| ;* 3 LDW .D2T2 *+B12(4),B6 ; |47| ;* || LDW .D1T1 *+A6(12),A3 ; |47| ;* 4 LDDW .D2T2 *-B11(16),B7:B6 ; |47| ;* || LDW .D1T1 *+A6(16),A4 ; |47| ;* 5 LDW .D1T1 *+A6(20),A4 ; |47| ;* || LDDW .D2T2 *-B11(8),B9:B8 ; |47| ;* 6 LDW .D1T1 *+A6(24),A5 ; |47| ;* 7 MPYSP .M1X B6,A3,A4 ; |47| ;* || MPYSP .M2 B7,DP,B4 ; |47| ;* || LDW .D1T1 *+A6(28),A4 ; |47| ;* 8 MPYSP .M2 B4,B6,B4 ; |47| ;* 9 MPYSP .M2X B6,A4,B8 ; |47| ;* 10 MPYSP .M2X B7,A4,B7 ; |47| ;* 11 ADDSP .L1 A4,A7,A7 ; |47| ^ ;* || ADDSP .L2 B4,B3,B3 ; |47| ^ ;* || MPYSP .M2X B8,A5,B4 ; |47| ;* 12 ADDSP .L2 B4,B10,B10 ; |47| ^ ;* || MPYSP .M2X B5,A3,B4 ; |47| ;* || MPYSP .M1X B9,A4,A5 ; |47| ;* 13 ADDSP .L2 B8,B0,B0 ; |47| ^ ;* || [ A1] SUB .S1 A1,1,A1 ; |48| ;* 14 ADDSP .L2 B7,B1,B1 ; |47| ^ ;* || [ A1] B .S1 C23 ; |48| ;* 15 ADDSP .L2 B4,B2,B2 ; |47| ^ ;* 16 ADDSP .L2 B4,B13,B13 ; |47| ^ ;* || ADDSP .L1 A5,A0,A0 ; |47| ^ ;* 17 NOP 3 ;* ; BRANCH OCCURS ; |48| ;*----* L1: ; PIPED LOOP PROLOG LDW .D1T1 *+A6(12),A3 ; |47| (P) <0,3> || LDW .D2T2 *+B12(4),B6 ; |47| (P) <0,3> LDW .D1T1 *+A6(16),A4 ; |47| (P) <0,4> || LDDW .D2T2 *-B11(16),B7:B6 ; |47| (P) <0,4> ZERO .S2 B0 ; |47| || ZERO .L2 B2 ; |47| || ZERO .S1 A7 ; |47| || LDDW .D2T2 *-B11(8),B9:B8 ; |47| (P) <0,5> || LDW .D1T1 *+A6(20),A4 ; |47| (P) <0,5> ZERO .S2 B1 ; |47| || ZERO .L1 A0 ; |47| || ZERO .L2 B13 ; |47| || MV .S1X B3,A14 || LDW .D2T2 *++B12(32),DP ; |47| (P) <1,0> || LDW .D1T1 *+A6(24),A5 ; |47| (P) <0,6> MVK .S1 0x1,A2 ; init prolog collapse predicate || SUB .L1 A4,1,A1 || ZERO .S2 B3 ; |47| || ZERO .L2 B10 ; |47| || MPYSP .M1X B6,A3,A4 ; |47| (P) <0,7> || MPYSP .M2 B7,DP,B4 ; |47| (P) <0,7> || LDDW .D2T2 *B11++(32),B7:B6 ; |47| (P) <1,1> || LDW .D1T1 *+A6(28),A4 ; |47| (P) <0,7> ;** --* L2: ; PIPED LOOP KERNEL [ A1] B .S1 L2 ; |48| <0,14> || [!A2] ADDSP .L2 B7,B1,B1 ; |47| <0,14> ^ || MPYSP .M2 B4,B6,B4 ; |47| <1,8> || LDDW .D2T2 *-B11(24),B5:B4 ; |47| <2,2> || LDW .D1T1 *++A6(32),A3 ; |47| <2,2> [!A2] ADDSP .L2 B4,B2,B2 ; |47| <0,15> ^ || MPYSP .M2X B6,A4,B8 ; |47| <1,9> || LDW .D1T1 *+A6(12),A3 ; |47| <2,3> || LDW .D2T2 *+B12(4),B6 ; |47| <2,3> [!A2] ADDSP .L2 B4,B13,B13 ; |47| <0,16> ^ || [!A2] ADDSP .L1 A5,A0,A0 ; |47| <0,16> ^ || MPYSP .M2X B7,A4,B7 ; |47| <1,10> || LDDW .D2T2 *-B11(16),B7:B6 ; |47| <2,4> || LDW .D1T1 *+A6(16),A4 ; |47| <2,4> ADDSP .L1 A4,A7,A7 ; |47| <1,11> ^ || MPYSP .M2X B8,A5,B4 ; |47| <1,11> || ADDSP .L2 B4,B3,B3 ; |47| <1,11> ^ || LDDW .D2T2 *-B11(8),B9:B8 ; |47| <2,5> || LDW .D1T1 *+A6(20),A4 ; |47| <2,5> [ A2] SUB .S1 A2,1,A2 ; <0,18> || MPYSP .M2X B5,A3,B4 ; |47| <1,12> || MPYSP .M1X B9,A4,A5 ; |47| <1,12> || ADDSP .L2 B4,B10,B10 ; |47| <1,12> ^ || LDW .D1T1 *+A6(24),A5 ; |47| <2,6> || LDW .D2T2 *++B12(32),DP ; |47| <3,0> [ A1] SUB .S1 A1,1,A1 ; |48| <1,13> || ADDSP .L2 B8,B0,B0 ; |47| <1,13> ^ || MPYSP .M1X B6,A3,A4 ; |47| <2,7> || LDW .D1T1 *+A6(28),A4 ; |47| <2,7> || MPYSP .M2 B7,DP,B4 ; |47| <2,7> || LDDW .D2T2 *B11++(32),B7:B6 ; |47| <3,1> ;** --* L3: ; PIPED LOOP EPILOG MV .S1X SP,A9 ; |55| || ADDSP .L2 B7,B1,B1 ; |47| (E) <3,14> ^ ADDSP .L2 B4,B2,B2 ; |47| (E) <3,15> ^ ADDSP .L1 A5,A0,A0 ; |47| (E) <3,16> ^ || ADDSP .L2 B4,B13,B13 ; |47| (E) <3,16> ^ LDDW .D2T2 *+SP(8),B11:B10 ; |55| || MV .S1X B10,A8 || MV .S2X A8,B4 MV .S2X A10,DP ; restore dp MVC .S2 B4,CSR ; interrupts on LDDW .D2T2 *+SP(16),B13:B12 ; |55| || ADDSP .L1X A8,B13,A6 ; |54| NOP 3 LDW .D1T1 *+A9(4),A14 ; |55| || MV .S2X A14,B3 ; |55| || ADDSP .L1X B3,A6,A3 ; |54| LDW .D2T1 *++SP(24),A10 ; |55| NOP 2 ADDSP .L1 A7,A3,A3 ; |54| NOP 3 ADDSP .L1 A0,A3,A0 ; |54| NOP 3 ADDSP .L1X B2,A0,A0 ; |54| NOP 3 ADDSP .L1X B1,A0,A0 ; |54| NOP 1 RET .S2 B3 ; |55| NOP 1 ADDSP .L1X B0,A0,A4 ; |54| NOP 3 ; BRANCH OCCURS ; |55| Extra Comments --------------- a. This code decrements the stack frame by 16-bytes to store 3 words A10, B10, B11. Even though 12 bytes of stack storage would have been adequate, it needs to decrement 16 bytes, in order to leave the incoming double word aligned stack frame double word aligned at the end of the transaction. b. You only need to save A10-A15 and B10-B15 if these are being modified by your code. You need not worry about other registers you are modifying. c. Also notice the comments "interrupts off". This is where the compiler truns off the "GIE" bit off CSR to guarantee that interrupts dont mess you up, now that you are no longer in single register assignment mode. d. Also notice how two prolog and epilog stages have been collapsed to achieve code-size reductions. e. The compiler finds a 6 cycle loop in which 8 multiplies are performed to achieve 1.3 multiplies/cycle. The reason, this is not a 4 cycle loop, is because the input array can only be assumed to be word-aligned, as only one output fir sample is computed at a time. the filter array can be assumed to be double word aligned. f. Notice from the compiler feedback {-mw} flag, how all units are maxed out. g. Also, if one could compute even two fir output samples at a time. In the C code, specifying _nassert((int)(x)%8 == 0); and computing two output samples in parallel would give a better multiplier utilization. Circular buffer --------------- The delay line is modeled by keeping an array of input samples, in the input array of size KN and copying the last N -1 samples, after computing (K-1)N output samples, to the head of the array. Since the memcpy is done once, for every (K-1)N output samples, it is inexpensive, as opposed to maintaining the delay line manually. This avoids explictly having to incorporate circular buffering in your code. X: <-N input samples->|<N input samples->|..............|<N input samples>| If you still want to use circular buffering you could use serial assembly to do so. Regards Jagadeesh Sankaran

Reply by Jagadeesh Sankaran ●September 23, 20032003-09-23

There is a way to get 1.6 multiplies/cycle for single sample FIR. This involves carrying two versions, one for the even output samples where both the input and the filter arrays are double word aligned, and one for the odd output samples where the input is word aligned and filter array is double word aligned. In this case you compute even output samples at the rate of 2 multiplies/cycle and odd output samples at the rate of 1.2 multiplies/cycle. Since both these versions are going to be called for an equal number of times, in steady state you will get an averaging effect of computing at the rate of 1.6 multiplies/cycle. double -word aligned: are addresses that end in 0x0 and 0x8. word-aligned : are addresses that end in 0x0, 0x4, 0x8, 0xC All double-word addresses are word_aligned. Not all word aligned addresses are double word aligned. You can request alignment from C for the linker to use by saying: #pragma DATA_ALIGN(x, 8) float x[100]; These statements align the start of array x, or x itself to be double-word aligned from C. Regards Jagadeesh Sankaran

123 Next

sample-by-sample FIR optimized

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group