c6x | sample-by-sample FIR optimized

Two days ago I issued a question here about sample-by-sample FIR implementation... J G referenced me to this place: http://www.integrated-dsp.com/software/c6xdsk/C6711/fira/ at first I thought "great - I have what I was looking for" - but then - I looked closer to the code and figured, that it's not only producing incorrect results, but also it's not optimized that well... so I fixed both problems... from 11+13n I got 16+7n cycles where n is the number of taps calculated... while checking the clock counts for 256 taps long FIR I get results: fira: 3941 counts (and incorrect results!) fir_my: 2215 counts (and correct results) that is almost 50% time reduction! however - my profiler is giving me absurd results... I have never really worked with the profiler so maybe I'm doing something wrong - are there any "gotchas" I'm not aware of? anyway - here is the code I wrote if anybody needs it: www.wrewers.karolin.pl/fira_my.asm Wojciech Rewers PS. I'm a young (26) male with MSEE degree (specialized in DSP) with 15 monts of work experience as embedded systems engineer looking for a job in DSP/embedded field - ready to relocate anywhere. __________________________________

Reply by Arius - Rick Collins ●September 18, 20032003-09-18

Perhaps I am missing something. But it is a poor algorithm indeed that can't process a FIR filter with 1 clock per tap! I have not taken the time to look at the code since I am packing up now to travel closer to Isabel's path. I am thinking there is a disconnect in how you are counting your clock cycles. Are your figures for just one data sample input? The MAC on every DSP can process one tap in one clock cycle. The C6x processors should be able to fetch two operands, perform a multiply and add into the accumulator on each clock cycle. This would give you some fixed number of startup clock cycles plus N cycles where N is the number of taps. There should be no coefficient for the N unless you perform two inputs at once in which case the coefficient could be 0.5. Instead of working with the clearly inefficient fira code, perhaps you should start with some other optimized code to see how to do it right. The entire filter loop should be a single parallel operation. Of course you will have a bit of messy setup and exit code, but the core loop should be one set of parallel ops. 0) Setup pointers, counts... 1) fetch two inputs (one coeff, one data) 2) fetch two inputs (one coeff, one data) || multiply operands 3) fetch two inputs (one coeff, one data) || multiply operands || add to accumulator || loop on count to (3) 4) multiply operands || add to accumulator 5) add to accumulator 6) done! This is the form your code should have. Am I out to lunch on this? If you can't get to k + N cycles, it might be faster running on an MCU! At 05:34 PM 9/17/2003, Wojciech Rewers wrote: >Two days ago I issued a question here about >sample-by-sample FIR implementation... > >J G referenced me to this place: >http://www.integrated-dsp.com/software/c6xdsk/C6711/fira/ > >at first I thought "great - I have what I was looking >for" - but then - I looked closer to the code and >figured, that it's not only producing incorrect >results, but also it's not optimized that well... > >so I fixed both problems... from 11+13n I got 16+7n >cycles where n is the number of taps calculated... > >while checking the clock counts for 256 taps long FIR >I get results: >fira: 3941 counts (and incorrect results!) >fir_my: 2215 counts (and correct results) >that is almost 50% time reduction! > >however - my profiler is giving me absurd results... I >have never really worked with the profiler so maybe >I'm doing something wrong - are there any "gotchas" >I'm not aware of? > >anyway - here is the code I wrote if anybody needs it: >www.wrewers.karolin.pl/fira_my.asm Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Wojciech Rewers ●September 18, 20032003-09-18

--- Arius - Rick Collins <> wrote: > Perhaps I am missing something. clearly one os us is ;-) > But it is a poor algorithm indeed that > can't process a FIR filter with 1 clock per tap! hm... > I have not taken the time > to look at the code since I am packing up now to > travel closer to Isabel's path. I though the stream of people is flowing the other way? > I am thinking there is a disconnect in how > you are counting your > clock cycles. Are your figures for just one data > sample input? ? one data sample input? hm... I'm not sure what you mean by that... my figure was for "one data sample output" - meaning - the input parameters are: dly (delay line of input samples), h (vector of coefitients), nh (number of coeficients)... and as a result I get one output sample (from calculating dotprod as you wish) > The MAC on every DSP can process one tap in one > clock cycle. hm... that's what I remember from my DSP classes and labs where we were using C5x Starter Kits... > The C6x > processors should be able to fetch two operands, > perform a multiply and add > into the accumulator on each clock cycle. hm... well - from what I understand so far C6711 has this LDW instruction which can fetch a float number but it takes 5 cycles to do so... then the MPYSP takes 4 cycles to multiply two float operands and then the ADDSP takes 4 cycles to accumulate the result... but then... I'm not sure myself whether the ADDSP can directly follow MPYSP - from what I figured so far from single-stepping some sample code - it cannot because only after 3 delay slots the results are ready from MPYSP... but then - I also took a look at the assembly code published by TI for block FIR filter where the "main loop" includes 4 cycles with many parallel instructions in each - I didn't really analyze all that though... well - as I said - I'm not an expert here in assembly coding - when J G gave referenced me to that Integrated-DSP code I analyzed that and since it's much more straightforward I stack with it and optimized it by almost 50% - I thought it's OK and I should move on to some other stuff I have on my to do list... but now you're telling me it sucks? ;-) > This would give you some fixed > number of startup clock cycles plus N cycles where N > is the number of taps. well... as I said - I would expect that on C5x, but I'm affraid you're wrong here about C67x and floating-point operations... or is it really me who is missing something? > Instead of working with the clearly inefficient fira > code, perhaps you > should start with some other optimized code to see > how to do it right. The > entire filter loop should be a single parallel > operation. Of course you > will have a bit of messy setup and exit code, but > the core loop should be > one set of parallel ops. well... I'm really not buying it for now - have you actually done some assembly coding on C6x? > Am I out to lunch on this? If you can't get to k + > N cycles, it might be > faster running on an MCU! well... I'm not sure if you're "out to lunch" or not - for me you are - but maybe someone else could be the judge? Wojciech Rewers (still looking for a job) __________________________________

Reply by Arius - Rick Collins ●September 18, 20032003-09-18

Yes, I see that neither of us are experts in C6x assembly programming. ;) I know that you can get to 1 clock per tap, but to do so requires that you interleave all the pipelined instructions so that you are feeding a new pair of operands in on every clock, starting a new multiply on every clock and starting a new add on every clock. Even though it may take more than one clock to complete each of these steps, a new operation can be started on each clock cycle. You don't have to wait for one multiply to finish before you start the next. This is the same on the C6x and the C5x and all other high speed DSPs, they are all pipelined, which means it takes more than one clock to complete an operation, but you can start a new one on every clock. The TI code for a dot product should be a good place to start, assuming that they achieve the k + N timing value. I would be surprised if they don't. This code only needs to be modified for circular buffering to be what you need. BTW, my comment about processing one input sample meant one *new* input sample. The idea is that you want a new output sample every time you get a new input sample, rather than waiting until you have a new block of input samples to yield a new block of output samples. Unless you are decimating or interpolating, input and output size will be the same. At 12:23 PM 9/18/2003, you wrote: >--- Arius - Rick Collins <> wrote: > > Perhaps I am missing something. > >clearly one os us is ;-) > > > But it is a poor algorithm indeed that > > can't process a FIR filter with 1 clock per tap! > >hm... > > > I have not taken the time > > to look at the code since I am packing up now to > > travel closer to Isabel's path. > >I though the stream of people is flowing the other >way? > > > I am thinking there is a disconnect in how > > you are counting your > > clock cycles. Are your figures for just one data > > sample input? > >? one data sample input? hm... I'm not sure what you >mean by that... my figure was for "one data sample >output" - meaning - the input parameters are: dly >(delay line of input samples), h (vector of >coefitients), nh (number of coeficients)... and as a >result I get one output sample (from calculating >dotprod as you wish) > > > The MAC on every DSP can process one tap in one > > clock cycle. > >hm... that's what I remember from my DSP classes and >labs where we were using C5x Starter Kits... > > > The C6x > > processors should be able to fetch two operands, > > perform a multiply and add > > into the accumulator on each clock cycle. > >hm... well - from what I understand so far C6711 has >this LDW instruction which can fetch a float number >but it takes 5 cycles to do so... then the MPYSP takes >4 cycles to multiply two float operands and then the >ADDSP takes 4 cycles to accumulate the result... > >but then... I'm not sure myself whether the ADDSP can >directly follow MPYSP - from what I figured so far >from single-stepping some sample code - it cannot >because only after 3 delay slots the results are ready >from MPYSP... but then - I also took a look at the >assembly code published by TI for block FIR filter >where the "main loop" includes 4 cycles with many >parallel instructions in each - I didn't really >analyze all that though... > >well - as I said - I'm not an expert here in assembly >coding - when J G gave referenced me to that >Integrated-DSP code I analyzed that and since it's >much more straightforward I stack with it and >optimized it by almost 50% - I thought it's OK and I >should move on to some other stuff I have on my to do >list... but now you're telling me it sucks? ;-) > > > This would give you some fixed > > number of startup clock cycles plus N cycles where N > > is the number of taps. > >well... as I said - I would expect that on C5x, but >I'm affraid you're wrong here about C67x and >floating-point operations... or is it really me who is >missing something? > > > Instead of working with the clearly inefficient fira > > code, perhaps you > > should start with some other optimized code to see > > how to do it right. The > > entire filter loop should be a single parallel > > operation. Of course you > > will have a bit of messy setup and exit code, but > > the core loop should be > > one set of parallel ops. > >well... I'm really not buying it for now - have you >actually done some assembly coding on C6x? > > > Am I out to lunch on this? If you can't get to k + > > N cycles, it might be > > faster running on an MCU! > >well... I'm not sure if you're "out to lunch" or not - >for me you are - but maybe someone else could be the >judge? > >Wojciech Rewers (still looking for a job) > >__________________________________ > Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Arius - Rick Collins ●September 18, 20032003-09-18

At 12:28 PM 9/18/2003, Wojciech Rewers wrote: >--- Arius - Rick Collins <> wrote: >and one more thing... could someone please explain me >why the profiler shows different clock cycles count >than obtained from the equations I formulated? (11+13n >and 16+7n)? I guessed it could have something to do >with cache operation, but I'm certainly not an expert >in that field, so I didn't investigate that... but I'm >just curious... anybody knows an explanation? Which way are the results off? If the profiler is giving very high numbers it is likely because your code is in external memory and it takes multiple clock cycles to load each instruction, and multiple instructions are needed for each execution cycle. Internal memory runs in one clock cycle and works with a 256 bit wide bus to allow maximum execution speeds. Cache is ok if you are running large code which is not optimized, but if your code fits the 90/10 rule (90% of your execution time is spent in 10% of your code) an internal memory model will work better. Also the fetches of data can slow the execution if they are from external memory. Instructions will be cached after the first time through the loop, but the data will always be slow. So put the data in internal memory. If the profiler is giving low numbers, you may be seeing optimization by the software tools. It has been a long time since I have coded in assembly for this chip, but I seem to recall that there were two forms; one for ease of coding where the tool would do as much parallelism as it could figure out, and one where you had to do all the work, but the tool would not change anything you coded. Which are you working with? Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Andrew Elder ●September 18, 20032003-09-18

Wojciech, Below is TI's dot product asm code, which is close to what you want. Big problem with it is that the LDDW operation requires both arrays to be DWORD aligned. This would preclude stepping by a single sample. Anyway, you get the drift of what is happening here. - Andrew E. *=============================================================================== * * TEXAS INSTRUMENTS, INC. * * C67X FLOATING POINT DOT PRODUCT (HAND-CODED ASSEMBLY VERSION) * * Revision Date: 03/11/98 * * USAGE This routine is C Callable and can be called as: * * float dotp(const float *x, const float *y, const int count); * * x is pointer to array holding the first floating point vector * y is pointer to array holding the second floating point vector * count is the number of values in the x & y vectors * * If the routine is not to be used as a C callable function, * then you need to initialize values for all of the parameters * passed to the function since these are assumed to be in * registers as defined by the calling convention of the * compiler, (refer to the TMS320C6x Optimizing C Compiler * User's Guide). * * C CODE * This is the C equivalent for the assembly code. Note that * the assembly code is hand optimized and restrictions may * apply. * * float dotp(const float *x, const float *y, const int count) * { * int i; * float sum = 0; * * for (i=0; i < count; i++) * { * sum += x[i] * y[i]; * } * return sum; * } * * DESCRIPTION * * This routine calculates the dot product of 2 vectors. * * TECHNIQUES * * 1. LDDW instructions are used to load two SP floating point * values at a time for the x and y arrays. * 2. The loop is unrolled once and software pipelined. * 3. Since the ADDSP and MPYSP instructions take 4 cycles, * A8, B8, A0, and B0 multiplex different variables to save * on register usage. * This multiple assignment is possible since the variables * are always read just once on the first cycle that they * are availble. * 4. The loop is primed to reduce the prolog by 4 cycles * (14 words) with no increase in cycle time. * 5. The load counter is used as the loop counter which * requires a 3 cycle (6 word) epilog to finish the * calculations. This does not increase the cycle time. * The advantage is that the routine does not perform * extraneous loads. * * ASSUMPTIONS * * 1. Little Endian is assumed for LDDW instructions. * 2. The value of count must be greater than or equal * to 10 and even (10, 12, 14, ...). * 3. Since single assignment of registers is not used, * interrupts should be disabled before this function is * called. * * MEMORY NOTE * * The arrays x and y should be aligned on opposite * (even and odd) double word boundaries to avoid memory * bank hits. This routine does not perform extraneous loads. * * ARGUMENTS PASSED * * x -> A4 = ptr_x * y -> B4 = ptr_y * count -> A6 * * CYCLES * * N/2 + 24 with C overhead * N/2 + 24 without C overhead * *=============================================================================== .global _dotp .text _dotp: *** BEGIN Benchmark Timing *** B_START: * Prolog Begins **************************************************************** ZERO .L1 A5 ; prod0 = 0 || ZERO .L2 B5 ; prod1 = 0 || LDDW .D1 *A4++[1],A7:A6 ; load x1:x0 from memory || LDDW .D2 *B4++[1],B7:B6 ; load y1:y0 from memory || B .S1 LOOP ; branch to loop ZERO .L1 A8 ; sum0 = 0 || ZERO .L2 B8 ; sum1 = 0 || LDDW .D1 *A4++[1],A7:A6 ; @ load x1:x0 from memory || LDDW .D2 *B4++[1],B7:B6 ; @ load y1:y0 from memory || SUB .S2X A6,10,B0 ; lcntr = count - 10 || B .S1 LOOP ; branch to loop LDDW .D1 *A4++[1],A7:A6 ; @@ load x1:x0 from memory || LDDW .D2 *B4++[1],B7:B6 ; @@ load y1:y0 from memory || B .S1 LOOP ; branch to loop LDDW .D1 *A4++[1],A7:A6 ; @@@ load x1:x0 from memory || LDDW .D2 *B4++[1],B7:B6 ; @@@ load y1:y0 from memory || B .S1 LOOP ; branch to loop LDDW .D1 *A4++[1],A7:A6 ; @@@@ load x1:x0 from memory || LDDW .D2 *B4++[1],B7:B6 ; @@@@ load y1:y0 from memory || B .S1 LOOP ; branch to loop ****** Loop Begins ************************************************************* LOOP: [B0] LDDW .D1 *A4++[1],A7:A6 ; @@@@@ if(lcntr) load x1:x0 from memory ||[B0] LDDW .D2 *B4++[1],B7:B6 ; @@@@@ if(lcntr) load y1:y0 from memory || MPYSP .M1X A6,B6,A5 ; prod0 = x0 * y0 || MPYSP .M2X A7,B7,B5 ; prod1 = x1 * y1 || ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0 || ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1 ||[B0] B .S1 LOOP ; if(lcntr) branch to loop ||[B0] SUB .S2 B0,2,B0 ; if(lcntr) lcntr -= 2 ****** Epilog Begins *********************************************************** ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0 || ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1 ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0 || ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1 ADDSP .L1 A5,A8,A8 ; sum0 = prod0 + sum0 || ADDSP .L2 B5,B8,B8 ; sum1 = prod1 + sum1 * Epilog Ends ****************************************************************** ADDSP .L1X A8,B8,A0 ; A0 = sum0 + sum1 ADDSP .L2X A8,B8,B0 ; B0 = sum0 + sum1 ADDSP .L1X A8,B8,A0 ; A0 = sum0 + sum1 ADDSP .L2X A8,B8,B0 ; B0 = sum0 + sum1 NOP ; wait for B0 ADDSP .L1X A0,B0,A5 ; A5 = A0 + B0 NOP ; wait for next B0 ADDSP .L2X A0,B0,B5 ; B5 = A0 + B0 NOP ; wait for B5 B .S2 B3 ; return from function NOP ; wait for B5 ADDSP .L1X A5,B5,A4 ; return A4 NOP 3 ; wait for A4 and branch B_END: *** END Benchmark Timing *** At 09:23 AM 9/18/2003 -0700, Wojciech Rewers wrote: >--- Arius - Rick Collins <> wrote: >> Perhaps I am missing something. > >clearly one os us is ;-) > >> But it is a poor algorithm indeed that >> can't process a FIR filter with 1 clock per tap! > >hm... > >> I have not taken the time >> to look at the code since I am packing up now to >> travel closer to Isabel's path. > >I though the stream of people is flowing the other >way? > >> I am thinking there is a disconnect in how >> you are counting your >> clock cycles. Are your figures for just one data >> sample input? > >? one data sample input? hm... I'm not sure what you >mean by that... my figure was for "one data sample >output" - meaning - the input parameters are: dly >(delay line of input samples), h (vector of >coefitients), nh (number of coeficients)... and as a >result I get one output sample (from calculating >dotprod as you wish) > >> The MAC on every DSP can process one tap in one >> clock cycle. > >hm... that's what I remember from my DSP classes and >labs where we were using C5x Starter Kits... > >> The C6x >> processors should be able to fetch two operands, >> perform a multiply and add >> into the accumulator on each clock cycle. > >hm... well - from what I understand so far C6711 has >this LDW instruction which can fetch a float number >but it takes 5 cycles to do so... then the MPYSP takes >4 cycles to multiply two float operands and then the >ADDSP takes 4 cycles to accumulate the result... > >but then... I'm not sure myself whether the ADDSP can >directly follow MPYSP - from what I figured so far >from single-stepping some sample code - it cannot >because only after 3 delay slots the results are ready >from MPYSP... but then - I also took a look at the >assembly code published by TI for block FIR filter >where the "main loop" includes 4 cycles with many >parallel instructions in each - I didn't really >analyze all that though... > >well - as I said - I'm not an expert here in assembly >coding - when J G gave referenced me to that >Integrated-DSP code I analyzed that and since it's >much more straightforward I stack with it and >optimized it by almost 50% - I thought it's OK and I >should move on to some other stuff I have on my to do >list... but now you're telling me it sucks? ;-) > >> This would give you some fixed >> number of startup clock cycles plus N cycles where N >> is the number of taps. > >well... as I said - I would expect that on C5x, but >I'm affraid you're wrong here about C67x and >floating-point operations... or is it really me who is >missing something? > >> Instead of working with the clearly inefficient fira >> code, perhaps you >> should start with some other optimized code to see >> how to do it right. The >> entire filter loop should be a single parallel >> operation. Of course you >> will have a bit of messy setup and exit code, but >> the core loop should be >> one set of parallel ops. > >well... I'm really not buying it for now - have you >actually done some assembly coding on C6x? > >> Am I out to lunch on this? If you can't get to k + >> N cycles, it might be >> faster running on an MCU! > >well... I'm not sure if you're "out to lunch" or not - >for me you are - but maybe someone else could be the >judge? > >Wojciech Rewers (still looking for a job) > >__________________________________ >_____________________________________ >Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > >_____________________________________ >About this discussion group: > >To Join: Send an email to > >To Post: Send an email to > >To Leave: Send an email to > >Archives: http://www.yahoogroups.com/group/c6x > >Other Groups: http://www.dsprelated.com >">http://docs.yahoo.com/info/terms/

Reply by Wojciech Rewers ●September 18, 20032003-09-18

> Even though it may take more than one clock to complete each of these steps, a new operation can be started on each clock cycle. You don't have to wait for one multiply to finish before you start the next. This is the same on the C6x and the C5x and all other high speed DSPs, they are all pipelined, which means it takes more than one clock to complete an operation, but you can start a new one on every clock. well... all that you wrote was also my general knowledge... sure I've heard what the pipeline is - but never actually played with it in details... and now I tried... unfortunately - it's not as you are describing it... I wrote the code: .global _MAC .text _MAC: MPYSP.m1 A4, A6, A5 NOP 3 ADDSP.l1 A5, B4, A7 NOP 2 B .s2 b3 MV .l1 a7,a4 NOP 4 and I'm calling it like this: float MAC(float, float, float); a = 2; b = 45; c = 3; d = MAC(a, b, c); and it only gives the correct result when there are at least 3 NOPs after MPYSP instruction (because MPYSP has 3 delay slots)... so it's not as you said that you can start a new instruction before waiting for the previous to finish - in this example MPYSP has 3 delay slots and only after those 3 NOPS the result is ready for the ADDSP instruction... then I thought - maybe it's because those commands are not executed by the same units? so I changed the original m1x to m1 (to be at least on the same side without any crosspaths) - but then - the MPYSP instruction can only operate on either M1 or M2 units and the ADDSP can only operate on either L1 or L2 unites - so in the end - I really don't know how ADDSP can directly follow MPYSP (unless they are operating on exclusive sets of registers)... or am I missing something here? anyone? please... > The TI code for a dot product should be a good place to start, assuming that they achieve the k + N timing value. I would be surprised if they don't. This code only needs to be modified for circular buffering to be what you need. right... well - I was really surprised because I started with the integratedDSP's code and there were comments in that code as though it was based on TI's dotprod function - well - it certainly is not! unless they meant some other function than DSPF_sp_dotprod() function from C67x DSPLIB... I briefly examined this one and it's really "complex" with lots of parallelism - nothing even slightly similar to IntegratedDSP's code... > my comment about processing one input sample meant one *new* input sample. The idea is that you want a new output sample every time you get a new input sample, rather than waiting until you have a new block of input samples to yield a new block of output samples. Unless you are decimating or interpolating, input and output size will be the same. right - we're synchronized on this one... as I said in the topic - I need a "sample-by-sample" (as the opposite of block FIR) algorithm... now about the profiler: > Which way are the results off? well - to begin with - the profiler is giving absurd results - hundreds of thousands of cycles in the "average" collumn but then 0 cycles in minimum and maximum collumns - I really don't know how to interpret those... but as I said - I have never really worked with the profiler... the values I published before were from the observing the "clock count" while running the code from breakpoint to breakpoint - there are those options of: enable clock/view clock in the profiler menu... so - that's what I used... > If the profiler is giving very high numbers it is likely because your code is in external memory and it takes multiple clock cycles to load each instruction, and multiple instructions are needed for each execution cycle. Internal memory runs in one clock cycle and works with a 256 bit wide bus to allow maximum execution speeds. right... well - I've heard all that before... I mean - those "general theoretical explanations" - but then - getting down with bits - I still don't understand the whole process... right now I don't have the time to investigate that, but I will when I have some free time... > Cache is ok if you are running large code which is not optimized, but if your code fits the 90/10 rule (90% of your execution time is spent in 10% of your code) an internal memory model will work better. Also the fetches of data can slow the execution if they are from external memory. Instructions will be cached after the first time through the loop, but the data will always be slow. So put the data in internal memory. right... well - I will do some reading on that as soon as I find some more time... > If the profiler is giving low numbers, you may be seeing optimization by the software tools. It has been a long time since I have coded in assembly for this chip, but I seem to recall that there were two forms; one for ease of coding where the tool would do as much parallelism as it could figure out, and one where you had to do all the work, but the tool would not change anything you coded. Which are you working with? well - you lost me here again - I'm writing assembly - not linear assembly which is "optimized or not" - I believe what I write is the exact instruction that the device is executing - and that's what I see in the debugger while single-stepping... anyway - the posts got long ;-) but hey - everyone - if anybody sees something incorect in what I wrote - please do let me know... so - how is Isabelle treating you guys? ;-) Wojciech Rewers __________________________________

Reply by Arius - Rick Collins ●September 22, 20032003-09-22

Let's start with the important stuff, Isabel... I headed to my lake house in VA since that was right on the storm track. Turns out the winds were not all that bad and I had no damage, but the power went off Thursday night and had not come back on Sunday, so I bailed. I am back in Frederick where I can use my computer again. Seems a lot of power lines in VA have tall trees right next to them and when one topples, it can take down two or three poles and sections of line. It takes a long time to fix all that. At 04:48 PM 9/18/2003, you wrote: >well... all that you wrote was also my general >knowledge... sure I've heard what the pipeline is - >but never actually played with it in details... and >now I tried... unfortunately - it's not as you are >describing it... I wrote the code: > > .global _MAC > .text >_MAC: > MPYSP.m1 A4, A6, A5 > NOP 3 > ADDSP.l1 A5, B4, A7 > NOP 2 > B .s2 b3 > MV .l1 a7,a4 > NOP 4 > >and I'm calling it like this: > >float MAC(float, float, float); >a = 2; b = 45; c = 3; > >d = MAC(a, b, c); > >and it only gives the correct result when there are at >least 3 NOPs after MPYSP instruction (because MPYSP >has 3 delay slots)... so it's not as you said that you >can start a new instruction before waiting for the >previous to finish - in this example MPYSP has 3 delay >slots and only after those 3 NOPS the result is ready >for the ADDSP instruction... then I thought - maybe >it's because those commands are not executed by the >same units? so I changed the original m1x to m1 (to be >at least on the same side without any crosspaths) - >but then - the MPYSP instruction can only operate on >either M1 or M2 units and the ADDSP can only operate >on either L1 or L2 unites - so in the end - I really >don't know how ADDSP can directly follow MPYSP (unless >they are operating on exclusive sets of registers)... >or am I missing something here? anyone? please... That is because the result from one operation is not out for several clocks. Only the core of the loop can do all the ops in one clock cycle. You have to lead into it with code that starts each part at the right time. This is sort of like singing a "round", like "row, row, row your boat" if you know what that is. I took a look at the code Andrew Elder posted and I think I understand what they are doing. They use a "prolog" to start the fetches from memory. This must take 5 clocks since they start 5 of them before they reach the loop. In the loop, they start the multiply and also the add. The add should not be started until the mpy results are ready, but it works out ok since the mpy result registers are zeroed in the beginning. So the extra adds in the loop do nothing. The epilog runs more adds to handle the outputs of the multiplies and then adds the two sums (I think they were loading double words so they could keep two MACs busy, I think this only works from internal memory). The final adds account for the fact that since the pipeline in the mpy and add are 4 clocks, there are really four MACs going on in each register, so they have to be summed to get the final result. The NOPs are for timing. As you can see, it takes a lot to write code like this. It is much better to use library code if possible. So this code will do what you want if you can change the addressing for a circular buffer. Can you figure out how to do that? > > The TI code for a dot product should be a good place >to start, assuming that they achieve the k + N timing >value. I would be surprised if they don't. This code >only needs to be modified for circular buffering to be >what you need. > >right... well - I was really surprised because I >started with the integratedDSP's code and there were >comments in that code as though it was based on TI's >dotprod function - well - it certainly is not! unless >they meant some other function than DSPF_sp_dotprod() >function from C67x DSPLIB... I briefly examined this >one and it's really "complex" with lots of parallelism >- nothing even slightly similar to IntegratedDSP's >code... The parallelism is how you get a MAC (or two) on every clock! Other DSPs hide a lot of the details from you and have shorter pipelines. If you want to write optimized assembly for the C6x, you will need to master this. I can't help you a lot on the profiler since I have never used it. Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Wojciech Rewers ●September 22, 20032003-09-22

--- Arius - Rick Collins <> wrote: > I took a look at the code Andrew Elder posted and I > think I understand what they are doing. yeah... I actually realized I should digest that code even before Andrew advised that... so I did... at first I couldn't understand any of it, but after a while I realized what's going on ;-) god it's simple! and very neat too... > I think this only works from internal memory). and why would that be? > The final adds account for > the fact that since the pipeline in the mpy and add > are 4 clocks, there are > really four MACs going on in each register, so they > have to be summed to > get the final result. yeah... understanding most of that algorithm was easy - the ADDSPs was the most difficult part as I couldn't figure out how you can accumulate into just one register if the ADDSP takes 4 cycles... in fact - even though only one register is used - still there are founr accumulation going on - each for every fourth cycle - and the results are kept somewhere in the pipeline I guess - so then those four sums for each register have to be added together... the spru198 explains the whole process quite well... so - you know how it is - if all possibilities fail - read the manual ;-) > As you can see, it takes a lot to write code like > this. well - again I was proved, there are no difficult things in life - only new ones... at first when I looked at that dotp function - it was all black magic to me - now when I look at it all so simple and neat - I think that functions shows excellently the beauty and power of C6x devices... > It is much better to use library code if possible. well - nobody likes to reinvent the wheel... especially when the deadline is close... > So this code will do what you want if you > can change the addressing for a circular buffer. > Can you figure out how to do that? and what do you think I was doing when you were making out with Isabelle? ;-) I had to give up half the speed though... since I need "sample-by-sample" algotithm - I can't have any constraints on the location of samples in the memory - and using LDDW imposes on you that the samples are aligned on odd-even addresses or something like that... so what I did is that I changed LDDW into LDW, skipped the parallel MPYSPs and ADDSPs plus I use circular buffer for data... if anyone is interested - the code is here: http://www.wrewers.karolin.pl/firc.asm maybe somebody can even find a bug or something - please do let me know... well - I started from thousands of cycles first to filter the data (using library code) then to shift all the samples in the buffer by one - now I have a code that is performing "the same" in around 500 cycles - it's enough for me as for now - but the interesting aspect is that the same thing can be done for so many different ways ;-) the beauty of DSPs I guess... anyway - I have that FIR now - now I will only need a LMS to adapt FIR's coeficients - but I think I'll figure something out... Wojciech Rewers (still looking for a job) __________________________________

Reply by Arius - Rick Collins ●September 22, 20032003-09-22

One final suggestion and I will let you get on with it. You can optimize the code to use the LDDW if you copy your input data into *two* buffers, one odd aligned and one even aligned. Alternate between them on each sample and the LDDW should be in the right place each time. At 03:15 AM 9/22/2003, you wrote: >--- Arius - Rick Collins <> wrote: > > I took a look at the code Andrew Elder posted and I > > think I understand what they are doing. > >yeah... I actually realized I should digest that code >even before Andrew advised that... so I did... at >first I couldn't understand any of it, but after a >while I realized what's going on ;-) god it's simple! >and very neat too... > > > I think this only works from internal memory). > >and why would that be? > > > The final adds account for > > the fact that since the pipeline in the mpy and add > > are 4 clocks, there are > > really four MACs going on in each register, so they > > have to be summed to > > get the final result. > >yeah... understanding most of that algorithm was easy >- the ADDSPs was the most difficult part as I couldn't >figure out how you can accumulate into just one >register if the ADDSP takes 4 cycles... in fact - even >though only one register is used - still there are >founr accumulation going on - each for every fourth >cycle - and the results are kept somewhere in the >pipeline I guess - so then those four sums for each >register have to be added together... the spru198 >explains the whole process quite well... so - you know >how it is - if all possibilities fail - read the >manual ;-) > > > As you can see, it takes a lot to write code like > > this. > >well - again I was proved, there are no difficult >things in life - only new ones... at first when I >looked at that dotp function - it was all black magic >to me - now when I look at it all so simple and neat - >I think that functions shows excellently the beauty >and power of C6x devices... > > > It is much better to use library code if possible. > >well - nobody likes to reinvent the wheel... >especially when the deadline is close... > > > So this code will do what you want if you > > can change the addressing for a circular buffer. > > Can you figure out how to do that? > >and what do you think I was doing when you were making >out with Isabelle? ;-) > >I had to give up half the speed though... since I need >"sample-by-sample" algotithm - I can't have any >constraints on the location of samples in the memory - >and using LDDW imposes on you that the samples are >aligned on odd-even addresses or something like >that... > >so what I did is that I changed LDDW into LDW, skipped >the parallel MPYSPs and ADDSPs plus I use circular >buffer for data... if anyone is interested - the code >is here: > >http://www.wrewers.karolin.pl/firc.asm > >maybe somebody can even find a bug or something - >please do let me know... > >well - I started from thousands of cycles first to >filter the data (using library code) then to shift all >the samples in the buffer by one - now I have a code >that is performing "the same" in around 500 cycles - >it's enough for me as for now - but the interesting >aspect is that the same thing can be done for so many >different ways ;-) the beauty of DSPs I guess... > >anyway - I have that FIR now - now I will only need a >LMS to adapt FIR's coeficients - but I think I'll >figure something out... > >Wojciech Rewers (still looking for a job) > >__________________________________ > Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Previous12 3 Next

sample-by-sample FIR optimized

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group