DSPRelated.com
Forums

Slow EMIF transfer

Started by d.st...@yahoo.com June 23, 2009
Hello Dominic,

On Wed, Sep 2, 2009 at 8:13 AM, d.stuartnl wrote:
> Dear Mike,
>
> I have been away on holiday, I hope (for the people who are reading this
> they had a holiday too (and a good one). Right, back to work it is! I was
> wondering about your earlier idea about longer reads to force the compiler
> to use the LDDW instruction. Now is my question, how do I do this?


I probably did not express my point well [I sometimes answer these
posts while I am on the telephone or during a break from another
task]. *IF* the hardware was designed with knowledge of the software,
the first register would have been on a 64 bit boundary[add ends in 0
or 8.
Ignore that thought for your case.
more comments near end.

>
> the 2 reads are as follows:
>
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> these are both 32 bit reads but i only need 48 bits in total (32 bits from
> tmpRead1, and 16 (least significant) bits of tmpRead2. Further more these 2
> reads represent 6 (8-bit) channels:
>
> read: tmpRead2 tmpRead1
> M L M L
> S S S S
> B B B B
> bit :******************************** ********************************
> use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
>
> (forgive my primitive ASCII art :P)
>
> In my fetchData routine (which is pipelining!!) I fetch the data into these
> 2 variables and then distribute the data in 6 channels. My code is as
> follows:
>
> unsigned int Calculator_FetchData(Bool curvature)
> {
> unsigned int tmpRead1 =0;
> unsigned int tmpRead2;
>
> unsigned int sampleCount;
> float * restrict pCH1;
> float * restrict pCH2;
> float * restrict pCH3;
> char * restrict pBinData3;
> float * restrict pCH4;
> float * restrict pCH5;
> float * restrict pCH6;
>
> const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> const int termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pBinData3 = &binData3[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
>
> #pragma MUST_ITERATE(16,4096,2);
> while(tmpRead1 != termValue)
> {
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> //CHANNEL 1
> *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
>
> // CHANNEL 2
> *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> if(curvature)
> {
> *pCH1 += *pCH2;
> if(*pCH1 > 5000)
> {
> *pCH1 = 5000;
> }
> }
>
> //CHANNEL 5
> *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
> // CHANNEL 6
> *pCH6 = LUT1[tmpRead1 & 0xFF];
>
> // CHANNEL 3 this channel is always read for particle matching on this
> channel
> *pBinData3 = tmpRead2 & 0xFF;
> *pCH3 = LUT0[*pBinData3];
>
> // CHANNEL 4
> *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> pCH1++;
> pCH2++;
> pCH3++;
> pBinData3++;
> pCH4++;
> pCH5++;
> pCH6++;
>
> if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> {
> tmpRead1 = termValue;
> }
> }
> sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> Screen_updateSamples(sampleCount);
> return sampleCount;
> }
> At the moment I'm getting the folowing pipeline information is the ASM file:
>
> _Calculator_FetchData:
> ;** ----------------------*
> ;*----------------------*
> ;* SOFTWARE PIPELINE INFORMATION
> ;*
> ;* Loop source line : 219
> ;* Loop opening brace source line : 220
> ;* Loop closing brace source line : 266
> ;* Known Minimum Trip Count : 16
> ;* Known Maximum Trip Count : 4096
> ;* Known Max Trip Count Factor : 2
> ;* Loop Carried Dependency Bound(^) : 7
> ;* Unpartitioned Resource Bound : 9
> ;* Partitioned Resource Bound(*) : 9
> ;* Resource Partition:
> ;* A-side B-side
> ;* .L units 2 1
> ;* .S units 4 4
> ;* .D units 8 9*
> ;* .M units 0 0
> ;* .X cross paths 1 1
> ;* .T address paths 9* 8
> ;* Long read paths 5 4
> ;* Long write paths 0 0
> ;* Logical ops (.LS) 1 1 (.L or .S unit)
> ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> ;* Bound(.L .S .LS) 4 3
> ;* Bound(.L .S .D .LS .LSD) 6 6
> ;*
> ;* Searching for software pipeline schedule at ...
> ;* ii = 9 Unsafe schedule for irregular loop
> ;* ii = 9 Did not find schedule
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Did not find schedule
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Did not find schedule
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Did not find schedule
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Did not find schedule
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Did not find schedule
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Did not find schedule
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Did not find schedule
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Did not find schedule
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Did not find schedule
> ;* ii = 19 Unsafe schedule for irregular loop
> ;* ii = 19 Schedule found with 1 iterations in parallel
> ;*
> ;* Register Usage Table:
> ;* +---------------------------------+
> ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> ;* |0000000000111111|0000000000111111|
> ;* |0123456789012345|0123456789012345|
> ;* |----------------+----------------|
> ;* 0: |* * ********* | ***** * *** |
> ;* 1: |* * ********** | ***** * *** |
> ;* 2: |* * ********** | ***** * *** |
> ;* 3: |* * ********** | ***** * *** |
> ;* 4: |* * ********** | ***** * *** |
> ;* 5: |* * ********** | ***** * *** |
> ;* 6: |* ************* | ***** * *** |
> ;* 7: |* **************| ***** * *** |
> ;* 8: |* **************| ***** ** *** |
> ;* 9: |* **************| ******** *** |
> ;* 10: |* **************|********* *** |
> ;* 11: |****************| ***** ****** |
> ;* 12: |****************| ***** ****** |
> ;* 13: |****************| ************ |
> ;* 14: |* **************| ***** ****** |
> ;* 15: |* **************| ***** * **** |
> ;* 16: |*************** | ******* *** |
> ;* 17: |*** * ********* | ******* *** |
> ;* 18: |*** * ********* | ******* *** |
> ;* +---------------------------------+
> ;*
> ;* Done
> ;*
> ;* Loop is interruptible
> ;* Collapsed epilog stages : 0
> ;* Collapsed prolog stages : 0
> ;*
> ;* Minimum safe trip count : 1
>
> Looking at the register usage it looks like it's using quite a lot of
> registers and I thought that maybe the LDDW would relieve some registers.
> Also i was wondering if i can force-allign arrays in memory? If that's
> possible I can use 1 pointer to access all the channels (by using an offset
> when addressing:
> ----------------------
> IDEA:
> //CHANNEL 1
> *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> // OTHER CHANNELS addressed using bigger offsets...
> ----------------------
>
> I don't know if this idea is feasible but if it is i think it would relieve
> some more pressure of the register usage.
>
> Anyone's idea's/comments are welcome. At the moment the code is running 33%
> to slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to load into
> my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an
> ideal situation i would like to complete the FetchData routine in 1000 us
> (not any shorter or I would read faster then data is being written :P).
>

I took a quick look at your code and comments above. You only mention
acquisition time - nothing about processing time.
If you are only interested in acquisition time, you could do something like:
1.
unsigned int aMyIntArray[numOfSamples];
unsigned short aMyShortArray[numberOfSamples];
2.
Just grab the raw data [this would allow a future EDMA implementation]
3.
Process the data from the array.
4.
You will now have a new problem - it runs too fast.

mikedunn
> With kind regards,
>
> Dominic Stuart
> --- In c..., Michael Dunn wrote:
>>
>> Dominic,
>>
>> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
>> >
>> >
>> > Thanks Mike,
>> >
>> > I'm starting to enjoy this "tweaking" and am trying to push it as far as
>> > I
>> > can because every microsecond that i gain means the DSP can handle more
>> > particles/second. I've applied the tips I've gotten on this forum on the
>> > rest of my source code as well (the actual loops in my program that do
>> > the
>> > calculations on the data) and those are pipelining aswell now. Compared
>> > to
>> > the initial source total improvement is over 900%! Amazing (looks like I
>> > was
>> > using the DSP as a glorified MCU) but the true power of the DSP is
>> > starting
>> > to show! I thank you for your input but it raises some questions if you
>> > don't mind:
>> >
>> > --- In c..., Michael Dunn wrote:
>> >>
>> >> Congratulations, Dominic!!
>> >>
>> >> I'll top post this minor comment wrt 16/32 bit memory accesses and
>> >> speed.
>> >>
>> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
>> >> 16, and 8 bit accesses will be the same speed.
>> >
>> > What do you mean with aligned exactly?
>>
>> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
>> == 0' then it is aligned.
>> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
>> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
>> are aligned [byte addresses are always aligned].
>>
>> >
>> >> Only if your external memory is 8 or 16 bits wide would there be any
>> >> potential advantage in performing 16 bit accesses instead of 32 bit
>> >> accesses.
>> >> Also, there would be an advantage in fetching 32 bits at a time if you
>> >> an entire array of 8 or 16 bit values.
>> >>
>> >
>> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1
>> > reads
>> > the first two together (logic tied together so they "act" like 1 32bits
>> > wide
>> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>> >
>> >> I haven't looked at the details of your code, but if you always fetch
>> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
>> >> that your hardware addresses are preventing you from picking up some
>> >> additional speed. *If* the input addresses began on a 64 bit boundary
>> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
>> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
>> >> bit read].
>> >
>> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by
>> > telling my
>> > compiler to fetch a 64 bit read (I mean this still has to be split
>> > somehow
>> > in 2 read cycles somehow?)
>>
>> First of all, I wrote this before I had the idea of using a single
>> pointer. Your code has 2 pointers that load data - this means that
>> you are using 4 processor registers. Changing to a single 64 bit read
>> [32 x 2] would result in requiring only 3 registers. If your routine
>> has a lot of register pressure [utilization] where it is loading and
>> unloading CPU registers, then a 'register reduction change' would help
>> performance.
>>
>> As I finished writing about the double read, I thought of 'plan B' -
>> just use one pointer with an offset. When you look at the asm
>> listing, it should give you some register usage info. If you are
>> getting 'spills' then definitely try this.
>>
>> >
>> >>
>> >> Since your hardware addresses are fixed, you only need 1 pointer. You
>> >> could use
>> >> tmpRead2 = *(read1 + 4);
>> >> This would free up one register and, depending on register
>> >> utilization, could improve the performance.
>> >>
>> >
>> > Improve performance, thats what I like to hear ;) I hope my questions
>> > aren't
>> > too "basic".
>>
>> Most active members of this group are willing to help someone who
>> wants to learn. As long as your questions are informed and you show a
>> willingness to participate, most of us will help if we can. We come
>> from a variety of backgrounds and each of us end up learning something
>> from time to time.
>>
>> As you are learning, 'performance improvement' is not something that
>> has a single solution. Rather, it is a journey with many stops along
>> the way.
>>
>> mikedunn
>> >
>> > Dominic
>> >
>> >> mikedunn
>> >>
>> >>
>> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
>> >> >
>> >> >
>> >> > d.stuartnl,
>> >> >
>> >> > my comments in-line and prefixed with
>> >> >
>> >> > R. Williams
>> >> >
>> >> > ---------- Original Message -----------
>> >> > From: "d.stuartnl"
>> >> > To: c...
>> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> >> > Subject: [c6x] Re: Slow EMIF transfer
>> >> >
>> >> >> R.Williams,
>> >> >>
>> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
>> >> >> now thanks to your tips:
>> >> >
>> >> > congratulations!!
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0])
>> >> >> -1;
>> >> >> is working fine as it is. Dont know why though.
>> >> >
>> >> > two reasons:
>> >> > 1) the data size of a float is the same as the address data size
>> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
>> >> > loop
>> >> > to point 1 past the last location used.
>> >> >
>> >> >
>> >> >> >
>> >> >> still have them in a single loop and it's pipelining. Do you think
>> >> >> it's worth considering splitting it into two loops and check if
>> >> >> there's (an even better) speed increase?
>> >> >
>> >> > you could experiment, but it looks like it is not necessary to
>> >> > separate
>> >> > the code into two loops.
>> >> >
>> >> >
>> >> >> My new and improved function:
>> >> >
>> >> >
>> >> >> // CHANNEL 3 this channel is always read for particle matching
>> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >> >
>> >> > there seems to be a problem in the editing of the above 4 lines
>> >> > It looks like pCH3 is not being used; however, pCH3 is still being
>> >> > initialized
>> >> > and incremented in the code.
>> >> > Also when testing for execution speed, adding new operations
>> >> > (pBinData3)
>> >> > makes
>> >> > it very difficult to make timing comparisons.
>> >> >
>> >> >
>> >> >>
>> >> >> As you might have seen in my code the second read (tempRead2) is a
>> >> >> 32
>> >> >> bits int but I'm only interrested in the first 16 bits (where
>> >> >> channel
>> >> >> 3 and 4 reside), is there a way i can inform the compiler
>> >> >
>> >> > the natural size of a operation is 32bits, changing to a 16 bit
>> >> > operation
>> >> > would slow the code execution.
>> >> >
>> >> >>
>> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
>> >> >> these keywords the software pipelining was disabled again (Cannot
>> >> >> find
>> >> >> schedule).
>> >> >
>> >> > the 'volatile' is needed for the two parameters because they DO
>> >> > change
>> >> > between reads. I had suggested to remove the 'volatile' from the
>> >> > variables,
>> >> > not
>> >> > the parameters.
>> >> >
>> >> >
>> >> >>
>> >> >> With kind regards,
>> >> >>
>> >> >> Dominic
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>> >>
>> >
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>>

--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________
Hi Mike,

--- In c..., Michael Dunn wrote:
>
> Hello Dominic,
>
> On Wed, Sep 2, 2009 at 8:13 AM, d.stuartnl wrote:
> >
> >
> > Dear Mike,
> >
> > I have been away on holiday, I hope (for the people who are reading this
> > they had a holiday too (and a good one). Right, back to work it is! I was
> > wondering about your earlier idea about longer reads to force the compiler
> > to use the LDDW instruction. Now is my question, how do I do this?
>
>
> I probably did not express my point well [I sometimes answer these
> posts while I am on the telephone or during a break from another
> task]. *IF* the hardware was designed with knowledge of the software,
> the first register would have been on a 64 bit boundary[add ends in 0
> or 8.
> Ignore that thought for your case.
> more comments near end.
>
> >
> > the 2 reads are as follows:
> >
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > these are both 32 bit reads but i only need 48 bits in total (32 bits from
> > tmpRead1, and 16 (least significant) bits of tmpRead2. Further more these 2
> > reads represent 6 (8-bit) channels:
> >
> > read: tmpRead2 tmpRead1
> > M L M L
> > S S S S
> > B B B B
> > bit :******************************** ********************************
> > use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
> >
> > (forgive my primitive ASCII art :P)
> >
> > In my fetchData routine (which is pipelining!!) I fetch the data into these
> > 2 variables and then distribute the data in 6 channels. My code is as
> > follows:
> >
> > unsigned int Calculator_FetchData(Bool curvature)
> > {
> > unsigned int tmpRead1 =0;
> > unsigned int tmpRead2;
> >
> > unsigned int sampleCount;
> > float * restrict pCH1;
> > float * restrict pCH2;
> > float * restrict pCH3;
> > char * restrict pBinData3;
> > float * restrict pCH4;
> > float * restrict pCH5;
> > float * restrict pCH6;
> >
> > const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> > const int termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pBinData3 = &binData3[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> >
> > #pragma MUST_ITERATE(16,4096,2);
> > while(tmpRead1 != termValue)
> > {
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > //CHANNEL 1
> > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> > // CHANNEL 2
> > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > if(curvature)
> > {
> > *pCH1 += *pCH2;
> > if(*pCH1 > 5000)
> > {
> > *pCH1 = 5000;
> > }
> > }
> >
> > //CHANNEL 5
> > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> > // CHANNEL 6
> > *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> > // CHANNEL 3 this channel is always read for particle matching on this
> > channel
> > *pBinData3 = tmpRead2 & 0xFF;
> > *pCH3 = LUT0[*pBinData3];
> >
> > // CHANNEL 4
> > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > pCH1++;
> > pCH2++;
> > pCH3++;
> > pBinData3++;
> > pCH4++;
> > pCH5++;
> > pCH6++;
> >
> > if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> > {
> > tmpRead1 = termValue;
> > }
> > }
> > sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> > Screen_updateSamples(sampleCount);
> > return sampleCount;
> > }
> > At the moment I'm getting the folowing pipeline information is the ASM file:
> >
> > _Calculator_FetchData:
> > ;** ----------------------*
> > ;*----------------------*
> > ;* SOFTWARE PIPELINE INFORMATION
> > ;*
> > ;* Loop source line : 219
> > ;* Loop opening brace source line : 220
> > ;* Loop closing brace source line : 266
> > ;* Known Minimum Trip Count : 16
> > ;* Known Maximum Trip Count : 4096
> > ;* Known Max Trip Count Factor : 2
> > ;* Loop Carried Dependency Bound(^) : 7
> > ;* Unpartitioned Resource Bound : 9
> > ;* Partitioned Resource Bound(*) : 9
> > ;* Resource Partition:
> > ;* A-side B-side
> > ;* .L units 2 1
> > ;* .S units 4 4
> > ;* .D units 8 9*
> > ;* .M units 0 0
> > ;* .X cross paths 1 1
> > ;* .T address paths 9* 8
> > ;* Long read paths 5 4
> > ;* Long write paths 0 0
> > ;* Logical ops (.LS) 1 1 (.L or .S unit)
> > ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> > ;* Bound(.L .S .LS) 4 3
> > ;* Bound(.L .S .D .LS .LSD) 6 6
> > ;*
> > ;* Searching for software pipeline schedule at ...
> > ;* ii = 9 Unsafe schedule for irregular loop
> > ;* ii = 9 Did not find schedule
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Did not find schedule
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Did not find schedule
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Did not find schedule
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Did not find schedule
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Did not find schedule
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Did not find schedule
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Did not find schedule
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Did not find schedule
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Did not find schedule
> > ;* ii = 19 Unsafe schedule for irregular loop
> > ;* ii = 19 Schedule found with 1 iterations in parallel
> > ;*
> > ;* Register Usage Table:
> > ;* +---------------------------------+
> > ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> > ;* |0000000000111111|0000000000111111|
> > ;* |0123456789012345|0123456789012345|
> > ;* |----------------+----------------|
> > ;* 0: |* * ********* | ***** * *** |
> > ;* 1: |* * ********** | ***** * *** |
> > ;* 2: |* * ********** | ***** * *** |
> > ;* 3: |* * ********** | ***** * *** |
> > ;* 4: |* * ********** | ***** * *** |
> > ;* 5: |* * ********** | ***** * *** |
> > ;* 6: |* ************* | ***** * *** |
> > ;* 7: |* **************| ***** * *** |
> > ;* 8: |* **************| ***** ** *** |
> > ;* 9: |* **************| ******** *** |
> > ;* 10: |* **************|********* *** |
> > ;* 11: |****************| ***** ****** |
> > ;* 12: |****************| ***** ****** |
> > ;* 13: |****************| ************ |
> > ;* 14: |* **************| ***** ****** |
> > ;* 15: |* **************| ***** * **** |
> > ;* 16: |*************** | ******* *** |
> > ;* 17: |*** * ********* | ******* *** |
> > ;* 18: |*** * ********* | ******* *** |
> > ;* +---------------------------------+
> > ;*
> > ;* Done
> > ;*
> > ;* Loop is interruptible
> > ;* Collapsed epilog stages : 0
> > ;* Collapsed prolog stages : 0
> > ;*
> > ;* Minimum safe trip count : 1
> >
> > Looking at the register usage it looks like it's using quite a lot of
> > registers and I thought that maybe the LDDW would relieve some registers.
> > Also i was wondering if i can force-allign arrays in memory? If that's
> > possible I can use 1 pointer to access all the channels (by using an offset
> > when addressing:
> > ----------------------
> > IDEA:
> > //CHANNEL 1
> > *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> >
> > // CHANNEL 2
> > *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > // OTHER CHANNELS addressed using bigger offsets...
> > ----------------------
> >
> > I don't know if this idea is feasible but if it is i think it would relieve
> > some more pressure of the register usage.
> >
> > Anyone's idea's/comments are welcome. At the moment the code is running 33%
> > to slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to load into
> > my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an
> > ideal situation i would like to complete the FetchData routine in 1000 us
> > (not any shorter or I would read faster then data is being written :P).
> >
>
> I took a quick look at your code and comments above. You only mention
> acquisition time - nothing about processing time.
> If you are only interested in acquisition time, you could do something like:
> 1.
> unsigned int aMyIntArray[numOfSamples];
> unsigned short aMyShortArray[numberOfSamples];
> 2.
> Just grab the raw data [this would allow a future EDMA implementation]
> 3.
> Process the data from the array.
> 4.
> You will now have a new problem - it runs too fast.

With aquisition of the data I mean aquiring the true data.. I'll try to explain:

The Raw data is 8 bit compressed packed in 1 32 bit read and 1 16 bit read. To be able to "work" with the data i need to seperate and decompress (with a LUT which results in floats) the data so every channel gets in it's own (float)array.

I start my "fetching" algorithm after the first byte has been written in the FIFO. Fetching the (RAW) data faster then it is being read has no use because I need to process the (seperated and decompressed) data ASAP. So for (lets say 4000 samples) 1000 us it takes to write all the samples to my FIFO I want to, not only acquire the compressed data, but also decompress it (using the LUT) and seperate the channels. This way I can start processing the decompressed data ASAP.

In an ideal world (read: not very likely) I would like to do some of the calculations on the decompressed & seperated data while I am fetching it. This would give me a huge boost because this would mean it is possible to make a decision based on these calculations to ignore the current datastream and wait for the next one, or continue processing it.

I think some calculations will be possible if I can tweak this fetching routine someway.

With kind regards,

Dominic
>
> mikedunn
> > With kind regards,
> >
> > Dominic Stuart
> >
> >
> > --- In c..., Michael Dunn wrote:
> >>
> >> Dominic,
> >>
> >> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
> >> >
> >> >
> >> > Thanks Mike,
> >> >
> >> > I'm starting to enjoy this "tweaking" and am trying to push it as far as
> >> > I
> >> > can because every microsecond that i gain means the DSP can handle more
> >> > particles/second. I've applied the tips I've gotten on this forum on the
> >> > rest of my source code as well (the actual loops in my program that do
> >> > the
> >> > calculations on the data) and those are pipelining aswell now. Compared
> >> > to
> >> > the initial source total improvement is over 900%! Amazing (looks like I
> >> > was
> >> > using the DSP as a glorified MCU) but the true power of the DSP is
> >> > starting
> >> > to show! I thank you for your input but it raises some questions if you
> >> > don't mind:
> >> >
> >> > --- In c..., Michael Dunn wrote:
> >> >>
> >> >> Congratulations, Dominic!!
> >> >>
> >> >> I'll top post this minor comment wrt 16/32 bit memory accesses and
> >> >> speed.
> >> >>
> >> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> >> >> 16, and 8 bit accesses will be the same speed.
> >> >
> >> > What do you mean with aligned exactly?
> >>
> >> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
> >> == 0' then it is aligned.
> >> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
> >> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
> >> are aligned [byte addresses are always aligned].
> >>
> >> >
> >> >> Only if your external memory is 8 or 16 bits wide would there be any
> >> >> potential advantage in performing 16 bit accesses instead of 32 bit
> >> >> accesses.
> >> >> Also, there would be an advantage in fetching 32 bits at a time if you
> >> >> an entire array of 8 or 16 bit values.
> >> >>
> >> >
> >> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1
> >> > reads
> >> > the first two together (logic tied together so they "act" like 1 32bits
> >> > wide
> >> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
> >> >
> >> >> I haven't looked at the details of your code, but if you always fetch
> >> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> >> >> that your hardware addresses are preventing you from picking up some
> >> >> additional speed. *If* the input addresses began on a 64 bit boundary
> >> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> >> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> >> >> bit read].
> >> >
> >> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by
> >> > telling my
> >> > compiler to fetch a 64 bit read (I mean this still has to be split
> >> > somehow
> >> > in 2 read cycles somehow?)
> >>
> >> First of all, I wrote this before I had the idea of using a single
> >> pointer. Your code has 2 pointers that load data - this means that
> >> you are using 4 processor registers. Changing to a single 64 bit read
> >> [32 x 2] would result in requiring only 3 registers. If your routine
> >> has a lot of register pressure [utilization] where it is loading and
> >> unloading CPU registers, then a 'register reduction change' would help
> >> performance.
> >>
> >> As I finished writing about the double read, I thought of 'plan B' -
> >> just use one pointer with an offset. When you look at the asm
> >> listing, it should give you some register usage info. If you are
> >> getting 'spills' then definitely try this.
> >>
> >> >
> >> >>
> >> >> Since your hardware addresses are fixed, you only need 1 pointer. You
> >> >> could use
> >> >> tmpRead2 = *(read1 + 4);
> >> >> This would free up one register and, depending on register
> >> >> utilization, could improve the performance.
> >> >>
> >> >
> >> > Improve performance, thats what I like to hear ;) I hope my questions
> >> > aren't
> >> > too "basic".
> >>
> >> Most active members of this group are willing to help someone who
> >> wants to learn. As long as your questions are informed and you show a
> >> willingness to participate, most of us will help if we can. We come
> >> from a variety of backgrounds and each of us end up learning something
> >> from time to time.
> >>
> >> As you are learning, 'performance improvement' is not something that
> >> has a single solution. Rather, it is a journey with many stops along
> >> the way.
> >>
> >> mikedunn
> >> >
> >> > Dominic
> >> >
> >> >> mikedunn
> >> >>
> >> >>
> >> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> >> >> >
> >> >> >
> >> >> > d.stuartnl,
> >> >> >
> >> >> > my comments in-line and prefixed with
> >> >> >
> >> >> > R. Williams
> >> >> >
> >> >> > ---------- Original Message -----------
> >> >> > From: "d.stuartnl"
> >> >> > To: c...
> >> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >> >
> >> >> >> R.Williams,
> >> >> >>
> >> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
> >> >> >> now thanks to your tips:
> >> >> >
> >> >> > congratulations!!
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0])
> >> >> >> -1;
> >> >> >> is working fine as it is. Dont know why though.
> >> >> >
> >> >> > two reasons:
> >> >> > 1) the data size of a float is the same as the address data size
> >> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
> >> >> > loop
> >> >> > to point 1 past the last location used.
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> still have them in a single loop and it's pipelining. Do you think
> >> >> >> it's worth considering splitting it into two loops and check if
> >> >> >> there's (an even better) speed increase?
> >> >> >
> >> >> > you could experiment, but it looks like it is not necessary to
> >> >> > separate
> >> >> > the code into two loops.
> >> >> >
> >> >> >
> >> >> >> My new and improved function:
> >> >> >
> >> >> >
> >> >> >> // CHANNEL 3 this channel is always read for particle matching
> >> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >> >
> >> >> > there seems to be a problem in the editing of the above 4 lines
> >> >> > It looks like pCH3 is not being used; however, pCH3 is still being
> >> >> > initialized
> >> >> > and incremented in the code.
> >> >> > Also when testing for execution speed, adding new operations
> >> >> > (pBinData3)
> >> >> > makes
> >> >> > it very difficult to make timing comparisons.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> As you might have seen in my code the second read (tempRead2) is a
> >> >> >> 32
> >> >> >> bits int but I'm only interrested in the first 16 bits (where
> >> >> >> channel
> >> >> >> 3 and 4 reside), is there a way i can inform the compiler
> >> >> >
> >> >> > the natural size of a operation is 32bits, changing to a 16 bit
> >> >> > operation
> >> >> > would slow the code execution.
> >> >> >
> >> >> >>
> >> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> >> >> >> these keywords the software pipelining was disabled again (Cannot
> >> >> >> find
> >> >> >> schedule).
> >> >> >
> >> >> > the 'volatile' is needed for the two parameters because they DO
> >> >> > change
> >> >> > between reads. I had suggested to remove the 'volatile' from the
> >> >> > variables,
> >> >> > not
> >> >> > the parameters.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> With kind regards,
> >> >> >>
> >> >> >> Dominic
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >>
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________
Dominic-

> I have been away on holiday, I hope (for the people who are
> reading this they had a holiday too (and a good one).
> Right, back to work it is! I was wondering about your earlier
> idea about longer reads to force the compiler to use the
> LDDW instruction. Now is my question, how do I do this?
>
> the 2 reads are as follows:
>
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> these are both 32 bit reads but i only need 48 bits in total
> (32 bits from tmpRead1, and 16 (least significant) bits
> of tmpRead2.

One general comment I might make...

Normally it's a good idea to put some FPGA logic between a FIFO and the DSP, so if you're required to collate FIFO
values into larger words -- or split them apart, or arrange data into chunks that are natural size for the DSP's DMA
engine, whatever is needed -- it's do-able *after* the hardware design process, with minimal effort required for
trial-and-error.

I know in this case that you have a direct EMIF interface to the FIFO, without intermediate FPGA logic... and I'm not
trying to say anything negative about your method, which seems to be working fine. My point is, it might be better to
not to spend too much time worrying about how you split 6 channels out in software and how to optimize the last 10 to
20% of your performance, but instead just call it a "phase 1" or proof-of-concept. If increased performance is
needed, then recommend to your managers that the hardware design should be modified and flexibility added.

Otherwise, you could work on this many months, and write increasingly specific, non-general C code that fits only one
type of hardware model, and still not hit the MByte/sec performance level that you really need.

Again, just a general comment... comes from many years of experience at this stuff.

-Jeff

> Further more these 2 reads represent 6 (8-bit) channels:
>
> read: tmpRead2 tmpRead1
> M L M L
> S S S S
> B B B B
> bit :******************************** ********************************
> use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
>
> (forgive my primitive ASCII art :P)
>
> In my fetchData routine (which is pipelining!!) I fetch the data into these 2 variables and then distribute the data
> in 6 channels. My code is as follows:
>
> unsigned int Calculator_FetchData(Bool curvature)
> {
> unsigned int tmpRead1 =0;
> unsigned int tmpRead2;
> unsigned int sampleCount;
> float * restrict pCH1;
> float * restrict pCH2;
> float * restrict pCH3;
> char * restrict pBinData3;
> float * restrict pCH4;
> float * restrict pCH5;
> float * restrict pCH6;
>
> const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> const int termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pBinData3 = &binData3[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
>
> #pragma MUST_ITERATE(16,4096,2);
> while(tmpRead1 != termValue)
> {
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> //CHANNEL 1
> *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
>
> // CHANNEL 2
> *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> if(curvature)
> {
> *pCH1 += *pCH2;
> if(*pCH1 > 5000)
> {
> *pCH1 = 5000;
> }
> }
>
> //CHANNEL 5
> *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
> // CHANNEL 6
> *pCH6 = LUT1[tmpRead1 & 0xFF];
>
> // CHANNEL 3 this channel is always read for particle matching on this channel
> *pBinData3 = tmpRead2 & 0xFF;
> *pCH3 = LUT0[*pBinData3];
>
> // CHANNEL 4
> *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> pCH1++;
> pCH2++;
> pCH3++;
> pBinData3++;
> pCH4++;
> pCH5++;
> pCH6++;
>
> if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> {
> tmpRead1 = termValue;
> }
> }
> sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> Screen_updateSamples(sampleCount);
> return sampleCount;
> }
> At the moment I'm getting the folowing pipeline information is the ASM file:
>
> _Calculator_FetchData:
> ;** --*
> ;*----*
> ;* SOFTWARE PIPELINE INFORMATION
> ;*
> ;* Loop source line : 219
> ;* Loop opening brace source line : 220
> ;* Loop closing brace source line : 266
> ;* Known Minimum Trip Count : 16
> ;* Known Maximum Trip Count : 4096
> ;* Known Max Trip Count Factor : 2
> ;* Loop Carried Dependency Bound(^) : 7
> ;* Unpartitioned Resource Bound : 9
> ;* Partitioned Resource Bound(*) : 9
> ;* Resource Partition:
> ;* A-side B-side
> ;* .L units 2 1
> ;* .S units 4 4
> ;* .D units 8 9*
> ;* .M units 0 0
> ;* .X cross paths 1 1
> ;* .T address paths 9* 8
> ;* Long read paths 5 4
> ;* Long write paths 0 0
> ;* Logical ops (.LS) 1 1 (.L or .S unit)
> ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> ;* Bound(.L .S .LS) 4 3
> ;* Bound(.L .S .D .LS .LSD) 6 6
> ;*
> ;* Searching for software pipeline schedule at ...
> ;* ii = 9 Unsafe schedule for irregular loop
> ;* ii = 9 Did not find schedule
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Did not find schedule
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Did not find schedule
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Did not find schedule
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Did not find schedule
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Did not find schedule
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Did not find schedule
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Did not find schedule
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Did not find schedule
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Did not find schedule
> ;* ii = 19 Unsafe schedule for irregular loop
> ;* ii = 19 Schedule found with 1 iterations in parallel
> ;*
> ;* Register Usage Table:
> ;* +---------------------------------+
> ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> ;* |0000000000111111|0000000000111111|
> ;* |0123456789012345|0123456789012345|
> ;* |----------------+----------------|
> ;* 0: |* * ********* | ***** * *** |
> ;* 1: |* * ********** | ***** * *** |
> ;* 2: |* * ********** | ***** * *** |
> ;* 3: |* * ********** | ***** * *** |
> ;* 4: |* * ********** | ***** * *** |
> ;* 5: |* * ********** | ***** * *** |
> ;* 6: |* ************* | ***** * *** |
> ;* 7: |* **************| ***** * *** |
> ;* 8: |* **************| ***** ** *** |
> ;* 9: |* **************| ******** *** |
> ;* 10: |* **************|********* *** |
> ;* 11: |****************| ***** ****** |
> ;* 12: |****************| ***** ****** |
> ;* 13: |****************| ************ |
> ;* 14: |* **************| ***** ****** |
> ;* 15: |* **************| ***** * **** |
> ;* 16: |*************** | ******* *** |
> ;* 17: |*** * ********* | ******* *** |
> ;* 18: |*** * ********* | ******* *** |
> ;* +---------------------------------+
> ;*
> ;* Done
> ;*
> ;* Loop is interruptible
> ;* Collapsed epilog stages : 0
> ;* Collapsed prolog stages : 0
> ;*
> ;* Minimum safe trip count : 1
>
> Looking at the register usage it looks like it's using quite a lot of registers and I thought that maybe the LDDW
> would relieve some registers. Also i was wondering if i can force-allign arrays in memory? If that's possible I can
> use 1 pointer to access all the channels (by using an offset when addressing:
> ----------------------------
> IDEA:
> //CHANNEL 1
> *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> // OTHER CHANNELS addressed using bigger offsets...
> -------------------------------
>
> I don't know if this idea is feasible but if it is i think it would relieve some more pressure of the register usage.
>
> Anyone's idea's/comments are welcome. At the moment the code is running 33% to slow. If I offer 4000 samples @ 4MHz
> (data will take 1000 us to load into my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an ideal
> situation i would like to complete the FetchData routine in 1000 us (not any shorter or I would read faster then data
> is being written :P).
>
> With kind regards,
>
> Dominic Stuart
> --- In c..., Michael Dunn wrote:
>>
>> Dominic,
>>
>> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
>> >
>> >
>> > Thanks Mike,
>> >
>> > I'm starting to enjoy this "tweaking" and am trying to push it as far as I
>> > can because every microsecond that i gain means the DSP can handle more
>> > particles/second. I've applied the tips I've gotten on this forum on the
>> > rest of my source code as well (the actual loops in my program that do the
>> > calculations on the data) and those are pipelining aswell now. Compared to
>> > the initial source total improvement is over 900%! Amazing (looks like I was
>> > using the DSP as a glorified MCU) but the true power of the DSP is starting
>> > to show! I thank you for your input but it raises some questions if you
>> > don't mind:
>> >
>> > --- In c..., Michael Dunn wrote:
>> >>
>> >> Congratulations, Dominic!!
>> >>
>> >> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
>> >>
>> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
>> >> 16, and 8 bit accesses will be the same speed.
>> >
>> > What do you mean with aligned exactly?
>>
>> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
>> == 0' then it is aligned.
>> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
>> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
>> are aligned [byte addresses are always aligned].
>>
>> >
>> >> Only if your external memory is 8 or 16 bits wide would there be any
>> >> potential advantage in performing 16 bit accesses instead of 32 bit
>> >> accesses.
>> >> Also, there would be an advantage in fetching 32 bits at a time if you
>> >> an entire array of 8 or 16 bit values.
>> >>
>> >
>> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
>> > the first two together (logic tied together so they "act" like 1 32bits wide
>> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>> >
>> >> I haven't looked at the details of your code, but if you always fetch
>> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
>> >> that your hardware addresses are preventing you from picking up some
>> >> additional speed. *If* the input addresses began on a 64 bit boundary
>> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
>> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
>> >> bit read].
>> >
>> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
>> > compiler to fetch a 64 bit read (I mean this still has to be split somehow
>> > in 2 read cycles somehow?)
>>
>> First of all, I wrote this before I had the idea of using a single
>> pointer. Your code has 2 pointers that load data - this means that
>> you are using 4 processor registers. Changing to a single 64 bit read
>> [32 x 2] would result in requiring only 3 registers. If your routine
>> has a lot of register pressure [utilization] where it is loading and
>> unloading CPU registers, then a 'register reduction change' would help
>> performance.
>>
>> As I finished writing about the double read, I thought of 'plan B' -
>> just use one pointer with an offset. When you look at the asm
>> listing, it should give you some register usage info. If you are
>> getting 'spills' then definitely try this.
>>
>> >
>> >>
>> >> Since your hardware addresses are fixed, you only need 1 pointer. You
>> >> could use
>> >> tmpRead2 = *(read1 + 4);
>> >> This would free up one register and, depending on register
>> >> utilization, could improve the performance.
>> >>
>> >
>> > Improve performance, thats what I like to hear ;) I hope my questions aren't
>> > too "basic".
>>
>> Most active members of this group are willing to help someone who
>> wants to learn. As long as your questions are informed and you show a
>> willingness to participate, most of us will help if we can. We come
>> from a variety of backgrounds and each of us end up learning something
>> from time to time.
>>
>> As you are learning, 'performance improvement' is not something that
>> has a single solution. Rather, it is a journey with many stops along
>> the way.
>>
>> mikedunn
>> >
>> > Dominic
>> >
>> >> mikedunn
>> >>
>> >>
>> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
>> >> >
>> >> >
>> >> > d.stuartnl,
>> >> >
>> >> > my comments in-line and prefixed with
>> >> >
>> >> > R. Williams
>> >> >
>> >> > ---------- Original Message -----------
>> >> > From: "d.stuartnl"
>> >> > To: c...
>> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> >> > Subject: [c6x] Re: Slow EMIF transfer
>> >> >
>> >> >> R.Williams,
>> >> >>
>> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
>> >> >> now thanks to your tips:
>> >> >
>> >> > congratulations!!
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
>> >> >> is working fine as it is. Dont know why though.
>> >> >
>> >> > two reasons:
>> >> > 1) the data size of a float is the same as the address data size
>> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
>> >> > loop
>> >> > to point 1 past the last location used.
>> >> >
>> >> >
>> >> >> >
>> >> >> still have them in a single loop and it's pipelining. Do you think
>> >> >> it's worth considering splitting it into two loops and check if
>> >> >> there's (an even better) speed increase?
>> >> >
>> >> > you could experiment, but it looks like it is not necessary to
>> >> > separate
>> >> > the code into two loops.
>> >> >
>> >> >
>> >> >> My new and improved function:
>> >> >
>> >> >
>> >> >> // CHANNEL 3 this channel is always read for particle matching
>> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >> >
>> >> > there seems to be a problem in the editing of the above 4 lines
>> >> > It looks like pCH3 is not being used; however, pCH3 is still being
>> >> > initialized
>> >> > and incremented in the code.
>> >> > Also when testing for execution speed, adding new operations (pBinData3)
>> >> > makes
>> >> > it very difficult to make timing comparisons.
>> >> >
>> >> >
>> >> >>
>> >> >> As you might have seen in my code the second read (tempRead2) is a 32
>> >> >> bits int but I'm only interrested in the first 16 bits (where channel
>> >> >> 3 and 4 reside), is there a way i can inform the compiler
>> >> >
>> >> > the natural size of a operation is 32bits, changing to a 16 bit
>> >> > operation
>> >> > would slow the code execution.
>> >> >
>> >> >>
>> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
>> >> >> these keywords the software pipelining was disabled again (Cannot find
>> >> >> schedule).
>> >> >
>> >> > the 'volatile' is needed for the two parameters because they DO
>> >> > change
>> >> > between reads. I had suggested to remove the 'volatile' from the
>> >> > variables,
>> >> > not
>> >> > the parameters.
>> >> >
>> >> >
>> >> >>
>> >> >> With kind regards,
>> >> >>
>> >> >> Dominic
>> >> >

_____________________________________
Jeff,

thanks for the feedback it it appreciated and you are totally right that is would have been a good idea to add aditional FPGA logic between the FIFO and the DSP ;), I will take your idea into consideration when a re-design/hardwareupdate will take place. But for now I am "stuck" with my current design and I was not intending on spending months on this particular problem, I was just wondering if there is some more fine-tuning possible. Not only would that benefit the current design but it would also benefit me because I am learning a lot based on the tips you guys are providing me.

With kind regards,

Dominic

--- In c..., "Jeff Brower" wrote:
>
> Dominic-
>
> > I have been away on holiday, I hope (for the people who are
> > reading this they had a holiday too (and a good one).
> > Right, back to work it is! I was wondering about your earlier
> > idea about longer reads to force the compiler to use the
> > LDDW instruction. Now is my question, how do I do this?
> >
> > the 2 reads are as follows:
> >
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > these are both 32 bit reads but i only need 48 bits in total
> > (32 bits from tmpRead1, and 16 (least significant) bits
> > of tmpRead2.
>
> One general comment I might make...
>
> Normally it's a good idea to put some FPGA logic between a FIFO and the DSP, so if you're required to collate FIFO
> values into larger words -- or split them apart, or arrange data into chunks that are natural size for the DSP's DMA
> engine, whatever is needed -- it's do-able *after* the hardware design process, with minimal effort required for
> trial-and-error.
>
> I know in this case that you have a direct EMIF interface to the FIFO, without intermediate FPGA logic... and I'm not
> trying to say anything negative about your method, which seems to be working fine. My point is, it might be better to
> not to spend too much time worrying about how you split 6 channels out in software and how to optimize the last 10 to
> 20% of your performance, but instead just call it a "phase 1" or proof-of-concept. If increased performance is
> needed, then recommend to your managers that the hardware design should be modified and flexibility added.
>
> Otherwise, you could work on this many months, and write increasingly specific, non-general C code that fits only one
> type of hardware model, and still not hit the MByte/sec performance level that you really need.
>
> Again, just a general comment... comes from many years of experience at this stuff.
>
> -Jeff
>
> > Further more these 2 reads represent 6 (8-bit) channels:
> >
> > read: tmpRead2 tmpRead1
> > M L M L
> > S S S S
> > B B B B
> > bit :******************************** ********************************
> > use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
> >
> > (forgive my primitive ASCII art :P)
> >
> > In my fetchData routine (which is pipelining!!) I fetch the data into these 2 variables and then distribute the data
> > in 6 channels. My code is as follows:
> >
> > unsigned int Calculator_FetchData(Bool curvature)
> > {
> > unsigned int tmpRead1 =0;
> > unsigned int tmpRead2;
> > unsigned int sampleCount;
> > float * restrict pCH1;
> > float * restrict pCH2;
> > float * restrict pCH3;
> > char * restrict pBinData3;
> > float * restrict pCH4;
> > float * restrict pCH5;
> > float * restrict pCH6;
> >
> > const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> > const int termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pBinData3 = &binData3[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> >
> > #pragma MUST_ITERATE(16,4096,2);
> > while(tmpRead1 != termValue)
> > {
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > //CHANNEL 1
> > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> > // CHANNEL 2
> > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > if(curvature)
> > {
> > *pCH1 += *pCH2;
> > if(*pCH1 > 5000)
> > {
> > *pCH1 = 5000;
> > }
> > }
> >
> > //CHANNEL 5
> > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> > // CHANNEL 6
> > *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> > // CHANNEL 3 this channel is always read for particle matching on this channel
> > *pBinData3 = tmpRead2 & 0xFF;
> > *pCH3 = LUT0[*pBinData3];
> >
> > // CHANNEL 4
> > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > pCH1++;
> > pCH2++;
> > pCH3++;
> > pBinData3++;
> > pCH4++;
> > pCH5++;
> > pCH6++;
> >
> > if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> > {
> > tmpRead1 = termValue;
> > }
> > }
> > sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> > Screen_updateSamples(sampleCount);
> > return sampleCount;
> > }
> > At the moment I'm getting the folowing pipeline information is the ASM file:
> >
> > _Calculator_FetchData:
> > ;** --*
> > ;*----*
> > ;* SOFTWARE PIPELINE INFORMATION
> > ;*
> > ;* Loop source line : 219
> > ;* Loop opening brace source line : 220
> > ;* Loop closing brace source line : 266
> > ;* Known Minimum Trip Count : 16
> > ;* Known Maximum Trip Count : 4096
> > ;* Known Max Trip Count Factor : 2
> > ;* Loop Carried Dependency Bound(^) : 7
> > ;* Unpartitioned Resource Bound : 9
> > ;* Partitioned Resource Bound(*) : 9
> > ;* Resource Partition:
> > ;* A-side B-side
> > ;* .L units 2 1
> > ;* .S units 4 4
> > ;* .D units 8 9*
> > ;* .M units 0 0
> > ;* .X cross paths 1 1
> > ;* .T address paths 9* 8
> > ;* Long read paths 5 4
> > ;* Long write paths 0 0
> > ;* Logical ops (.LS) 1 1 (.L or .S unit)
> > ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> > ;* Bound(.L .S .LS) 4 3
> > ;* Bound(.L .S .D .LS .LSD) 6 6
> > ;*
> > ;* Searching for software pipeline schedule at ...
> > ;* ii = 9 Unsafe schedule for irregular loop
> > ;* ii = 9 Did not find schedule
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Did not find schedule
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Did not find schedule
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Did not find schedule
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Did not find schedule
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Did not find schedule
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Did not find schedule
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Did not find schedule
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Did not find schedule
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Did not find schedule
> > ;* ii = 19 Unsafe schedule for irregular loop
> > ;* ii = 19 Schedule found with 1 iterations in parallel
> > ;*
> > ;* Register Usage Table:
> > ;* +---------------------------------+
> > ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> > ;* |0000000000111111|0000000000111111|
> > ;* |0123456789012345|0123456789012345|
> > ;* |----------------+----------------|
> > ;* 0: |* * ********* | ***** * *** |
> > ;* 1: |* * ********** | ***** * *** |
> > ;* 2: |* * ********** | ***** * *** |
> > ;* 3: |* * ********** | ***** * *** |
> > ;* 4: |* * ********** | ***** * *** |
> > ;* 5: |* * ********** | ***** * *** |
> > ;* 6: |* ************* | ***** * *** |
> > ;* 7: |* **************| ***** * *** |
> > ;* 8: |* **************| ***** ** *** |
> > ;* 9: |* **************| ******** *** |
> > ;* 10: |* **************|********* *** |
> > ;* 11: |****************| ***** ****** |
> > ;* 12: |****************| ***** ****** |
> > ;* 13: |****************| ************ |
> > ;* 14: |* **************| ***** ****** |
> > ;* 15: |* **************| ***** * **** |
> > ;* 16: |*************** | ******* *** |
> > ;* 17: |*** * ********* | ******* *** |
> > ;* 18: |*** * ********* | ******* *** |
> > ;* +---------------------------------+
> > ;*
> > ;* Done
> > ;*
> > ;* Loop is interruptible
> > ;* Collapsed epilog stages : 0
> > ;* Collapsed prolog stages : 0
> > ;*
> > ;* Minimum safe trip count : 1
> >
> > Looking at the register usage it looks like it's using quite a lot of registers and I thought that maybe the LDDW
> > would relieve some registers. Also i was wondering if i can force-allign arrays in memory? If that's possible I can
> > use 1 pointer to access all the channels (by using an offset when addressing:
> > ----------------------------
> > IDEA:
> > //CHANNEL 1
> > *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> >
> > // CHANNEL 2
> > *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > // OTHER CHANNELS addressed using bigger offsets...
> > -------------------------------
> >
> > I don't know if this idea is feasible but if it is i think it would relieve some more pressure of the register usage.
> >
> > Anyone's idea's/comments are welcome. At the moment the code is running 33% to slow. If I offer 4000 samples @ 4MHz
> > (data will take 1000 us to load into my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an ideal
> > situation i would like to complete the FetchData routine in 1000 us (not any shorter or I would read faster then data
> > is being written :P).
> >
> > With kind regards,
> >
> > Dominic Stuart
> >
> >
> > --- In c..., Michael Dunn wrote:
> >>
> >> Dominic,
> >>
> >> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
> >> >
> >> >
> >> > Thanks Mike,
> >> >
> >> > I'm starting to enjoy this "tweaking" and am trying to push it as far as I
> >> > can because every microsecond that i gain means the DSP can handle more
> >> > particles/second. I've applied the tips I've gotten on this forum on the
> >> > rest of my source code as well (the actual loops in my program that do the
> >> > calculations on the data) and those are pipelining aswell now. Compared to
> >> > the initial source total improvement is over 900%! Amazing (looks like I was
> >> > using the DSP as a glorified MCU) but the true power of the DSP is starting
> >> > to show! I thank you for your input but it raises some questions if you
> >> > don't mind:
> >> >
> >> > --- In c..., Michael Dunn wrote:
> >> >>
> >> >> Congratulations, Dominic!!
> >> >>
> >> >> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
> >> >>
> >> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> >> >> 16, and 8 bit accesses will be the same speed.
> >> >
> >> > What do you mean with aligned exactly?
> >>
> >> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
> >> == 0' then it is aligned.
> >> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
> >> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
> >> are aligned [byte addresses are always aligned].
> >>
> >> >
> >> >> Only if your external memory is 8 or 16 bits wide would there be any
> >> >> potential advantage in performing 16 bit accesses instead of 32 bit
> >> >> accesses.
> >> >> Also, there would be an advantage in fetching 32 bits at a time if you
> >> >> an entire array of 8 or 16 bit values.
> >> >>
> >> >
> >> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
> >> > the first two together (logic tied together so they "act" like 1 32bits wide
> >> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
> >> >
> >> >> I haven't looked at the details of your code, but if you always fetch
> >> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> >> >> that your hardware addresses are preventing you from picking up some
> >> >> additional speed. *If* the input addresses began on a 64 bit boundary
> >> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> >> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> >> >> bit read].
> >> >
> >> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
> >> > compiler to fetch a 64 bit read (I mean this still has to be split somehow
> >> > in 2 read cycles somehow?)
> >>
> >> First of all, I wrote this before I had the idea of using a single
> >> pointer. Your code has 2 pointers that load data - this means that
> >> you are using 4 processor registers. Changing to a single 64 bit read
> >> [32 x 2] would result in requiring only 3 registers. If your routine
> >> has a lot of register pressure [utilization] where it is loading and
> >> unloading CPU registers, then a 'register reduction change' would help
> >> performance.
> >>
> >> As I finished writing about the double read, I thought of 'plan B' -
> >> just use one pointer with an offset. When you look at the asm
> >> listing, it should give you some register usage info. If you are
> >> getting 'spills' then definitely try this.
> >>
> >> >
> >> >>
> >> >> Since your hardware addresses are fixed, you only need 1 pointer. You
> >> >> could use
> >> >> tmpRead2 = *(read1 + 4);
> >> >> This would free up one register and, depending on register
> >> >> utilization, could improve the performance.
> >> >>
> >> >
> >> > Improve performance, thats what I like to hear ;) I hope my questions aren't
> >> > too "basic".
> >>
> >> Most active members of this group are willing to help someone who
> >> wants to learn. As long as your questions are informed and you show a
> >> willingness to participate, most of us will help if we can. We come
> >> from a variety of backgrounds and each of us end up learning something
> >> from time to time.
> >>
> >> As you are learning, 'performance improvement' is not something that
> >> has a single solution. Rather, it is a journey with many stops along
> >> the way.
> >>
> >> mikedunn
> >> >
> >> > Dominic
> >> >
> >> >> mikedunn
> >> >>
> >> >>
> >> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> >> >> >
> >> >> >
> >> >> > d.stuartnl,
> >> >> >
> >> >> > my comments in-line and prefixed with
> >> >> >
> >> >> > R. Williams
> >> >> >
> >> >> > ---------- Original Message -----------
> >> >> > From: "d.stuartnl"
> >> >> > To: c...
> >> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >> >
> >> >> >> R.Williams,
> >> >> >>
> >> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
> >> >> >> now thanks to your tips:
> >> >> >
> >> >> > congratulations!!
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> >> >> >> is working fine as it is. Dont know why though.
> >> >> >
> >> >> > two reasons:
> >> >> > 1) the data size of a float is the same as the address data size
> >> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
> >> >> > loop
> >> >> > to point 1 past the last location used.
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> still have them in a single loop and it's pipelining. Do you think
> >> >> >> it's worth considering splitting it into two loops and check if
> >> >> >> there's (an even better) speed increase?
> >> >> >
> >> >> > you could experiment, but it looks like it is not necessary to
> >> >> > separate
> >> >> > the code into two loops.
> >> >> >
> >> >> >
> >> >> >> My new and improved function:
> >> >> >
> >> >> >
> >> >> >> // CHANNEL 3 this channel is always read for particle matching
> >> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >> >
> >> >> > there seems to be a problem in the editing of the above 4 lines
> >> >> > It looks like pCH3 is not being used; however, pCH3 is still being
> >> >> > initialized
> >> >> > and incremented in the code.
> >> >> > Also when testing for execution speed, adding new operations (pBinData3)
> >> >> > makes
> >> >> > it very difficult to make timing comparisons.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> As you might have seen in my code the second read (tempRead2) is a 32
> >> >> >> bits int but I'm only interrested in the first 16 bits (where channel
> >> >> >> 3 and 4 reside), is there a way i can inform the compiler
> >> >> >
> >> >> > the natural size of a operation is 32bits, changing to a 16 bit
> >> >> > operation
> >> >> > would slow the code execution.
> >> >> >
> >> >> >>
> >> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> >> >> >> these keywords the software pipelining was disabled again (Cannot find
> >> >> >> schedule).
> >> >> >
> >> >> > the 'volatile' is needed for the two parameters because they DO
> >> >> > change
> >> >> > between reads. I had suggested to remove the 'volatile' from the
> >> >> > variables,
> >> >> > not
> >> >> > the parameters.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> With kind regards,
> >> >> >>
> >> >> >> Dominic
> >> >> >

_____________________________________