DSPRelated.com
Forums

Re: Re: Slow EMIF transfer

Started by Jeff Brower September 3, 2009
Dominic-

> thanks for the feedback it it appreciated and you are totally right that
> is would have been a good idea to add aditional FPGA logic between the FIFO
> and the DSP ;), I will take your idea into consideration when a
> re-design/hardwareupdate will take place. But for now I am "stuck" with my
> current design and I was not intending on spending months on this particular
> problem, I was just wondering if there is some more fine-tuning possible.
> Not only would that benefit the current design but it would also benefit me
> because I am learning a lot based on the tips you guys are providing me.
I understand and agree with your approach. More fine-tuning is always possible...
the suggestions you're getting from Mike and Richard are excellent.

-Jeff

> --- In c..., "Jeff Brower" wrote:
> >
> > Dominic-
> >
> > > I have been away on holiday, I hope (for the people who are
> > > reading this they had a holiday too (and a good one).
> > > Right, back to work it is! I was wondering about your earlier
> > > idea about longer reads to force the compiler to use the
> > > LDDW instruction. Now is my question, how do I do this?
> > >
> > > the 2 reads are as follows:
> > >
> > > tmpRead1 = *(volatile int*) 0x90300004;
> > > tmpRead2 = *(volatile int*) 0x90300008;
> > >
> > > these are both 32 bit reads but i only need 48 bits in total
> > > (32 bits from tmpRead1, and 16 (least significant) bits
> > > of tmpRead2.
> >
> > One general comment I might make...
> >
> > Normally it's a good idea to put some FPGA logic between a FIFO and the DSP, so if you're required to collate FIFO
> > values into larger words -- or split them apart, or arrange data into chunks that are natural size for the DSP's DMA
> > engine, whatever is needed -- it's do-able *after* the hardware design process, with minimal effort required for
> > trial-and-error.
> >
> > I know in this case that you have a direct EMIF interface to the FIFO, without intermediate FPGA logic... and I'm not
> > trying to say anything negative about your method, which seems to be working fine. My point is, it might be better to
> > not to spend too much time worrying about how you split 6 channels out in software and how to optimize the last 10 to
> > 20% of your performance, but instead just call it a "phase 1" or proof-of-concept. If increased performance is
> > needed, then recommend to your managers that the hardware design should be modified and flexibility added.
> >
> > Otherwise, you could work on this many months, and write increasingly specific, non-general C code that fits only one
> > type of hardware model, and still not hit the MByte/sec performance level that you really need.
> >
> > Again, just a general comment... comes from many years of experience at this stuff.
> >
> > -Jeff
> >
> > > Further more these 2 reads represent 6 (8-bit) channels:
> > >
> > > read: tmpRead2 tmpRead1
> > > M L M L
> > > S S S S
> > > B B B B
> > > bit :******************************** ********************************
> > > use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
> > >
> > > (forgive my primitive ASCII art :P)
> > >
> > > In my fetchData routine (which is pipelining!!) I fetch the data into these 2 variables and then distribute the data
> > > in 6 channels. My code is as follows:
> > >
> > > unsigned int Calculator_FetchData(Bool curvature)
> > > {
> > > unsigned int tmpRead1 =0;
> > > unsigned int tmpRead2;
> > > unsigned int sampleCount;
> > > float * restrict pCH1;
> > > float * restrict pCH2;
> > > float * restrict pCH3;
> > > char * restrict pBinData3;
> > > float * restrict pCH4;
> > > float * restrict pCH5;
> > > float * restrict pCH6;
> > >
> > > const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> > > const int termValue = 0x84825131;
> > >
> > > pCH1 = &CH1.deloggedData[0];
> > > pCH2 = &CH2.deloggedData[0];
> > > pCH3 = &CH3.deloggedData[0];
> > > pBinData3 = &binData3[0];
> > > pCH4 = &CH4.deloggedData[0];
> > > pCH5 = &CH5.deloggedData[0];
> > > pCH6 = &CH6.deloggedData[0];
> > >
> > > #pragma MUST_ITERATE(16,4096,2);
> > > while(tmpRead1 != termValue)
> > > {
> > > tmpRead1 = *(volatile int*) 0x90300004;
> > > tmpRead2 = *(volatile int*) 0x90300008;
> > >
> > > //CHANNEL 1
> > > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > >
> > > // CHANNEL 2
> > > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > >
> > > if(curvature)
> > > {
> > > *pCH1 += *pCH2;
> > > if(*pCH1 > 5000)
> > > {
> > > *pCH1 = 5000;
> > > }
> > > }
> > >
> > > //CHANNEL 5
> > > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > >
> > > // CHANNEL 6
> > > *pCH6 = LUT1[tmpRead1 & 0xFF];
> > >
> > > // CHANNEL 3 this channel is always read for particle matching on this channel
> > > *pBinData3 = tmpRead2 & 0xFF;
> > > *pCH3 = LUT0[*pBinData3];
> > >
> > > // CHANNEL 4
> > > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > >
> > > pCH1++;
> > > pCH2++;
> > > pCH3++;
> > > pBinData3++;
> > > pCH4++;
> > > pCH5++;
> > > pCH6++;
> > >
> > > if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> > > {
> > > tmpRead1 = termValue;
> > > }
> > > }
> > > sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> > > Screen_updateSamples(sampleCount);
> > > return sampleCount;
> > > }
> > > At the moment I'm getting the folowing pipeline information is the ASM file:
> > >
> > > _Calculator_FetchData:
> > > ;** --*
> > > ;*----*
> > > ;* SOFTWARE PIPELINE INFORMATION
> > > ;*
> > > ;* Loop source line : 219
> > > ;* Loop opening brace source line : 220
> > > ;* Loop closing brace source line : 266
> > > ;* Known Minimum Trip Count : 16
> > > ;* Known Maximum Trip Count : 4096
> > > ;* Known Max Trip Count Factor : 2
> > > ;* Loop Carried Dependency Bound(^) : 7
> > > ;* Unpartitioned Resource Bound : 9
> > > ;* Partitioned Resource Bound(*) : 9
> > > ;* Resource Partition:
> > > ;* A-side B-side
> > > ;* .L units 2 1
> > > ;* .S units 4 4
> > > ;* .D units 8 9*
> > > ;* .M units 0 0
> > > ;* .X cross paths 1 1
> > > ;* .T address paths 9* 8
> > > ;* Long read paths 5 4
> > > ;* Long write paths 0 0
> > > ;* Logical ops (.LS) 1 1 (.L or .S unit)
> > > ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> > > ;* Bound(.L .S .LS) 4 3
> > > ;* Bound(.L .S .D .LS .LSD) 6 6
> > > ;*
> > > ;* Searching for software pipeline schedule at ...
> > > ;* ii = 9 Unsafe schedule for irregular loop
> > > ;* ii = 9 Did not find schedule
> > > ;* ii = 10 Unsafe schedule for irregular loop
> > > ;* ii = 10 Unsafe schedule for irregular loop
> > > ;* ii = 10 Did not find schedule
> > > ;* ii = 11 Unsafe schedule for irregular loop
> > > ;* ii = 11 Unsafe schedule for irregular loop
> > > ;* ii = 11 Did not find schedule
> > > ;* ii = 12 Unsafe schedule for irregular loop
> > > ;* ii = 12 Unsafe schedule for irregular loop
> > > ;* ii = 12 Did not find schedule
> > > ;* ii = 13 Unsafe schedule for irregular loop
> > > ;* ii = 13 Unsafe schedule for irregular loop
> > > ;* ii = 13 Unsafe schedule for irregular loop
> > > ;* ii = 13 Did not find schedule
> > > ;* ii = 14 Unsafe schedule for irregular loop
> > > ;* ii = 14 Unsafe schedule for irregular loop
> > > ;* ii = 14 Unsafe schedule for irregular loop
> > > ;* ii = 14 Did not find schedule
> > > ;* ii = 15 Unsafe schedule for irregular loop
> > > ;* ii = 15 Unsafe schedule for irregular loop
> > > ;* ii = 15 Unsafe schedule for irregular loop
> > > ;* ii = 15 Did not find schedule
> > > ;* ii = 16 Unsafe schedule for irregular loop
> > > ;* ii = 16 Unsafe schedule for irregular loop
> > > ;* ii = 16 Unsafe schedule for irregular loop
> > > ;* ii = 16 Did not find schedule
> > > ;* ii = 17 Unsafe schedule for irregular loop
> > > ;* ii = 17 Unsafe schedule for irregular loop
> > > ;* ii = 17 Unsafe schedule for irregular loop
> > > ;* ii = 17 Did not find schedule
> > > ;* ii = 18 Unsafe schedule for irregular loop
> > > ;* ii = 18 Unsafe schedule for irregular loop
> > > ;* ii = 18 Unsafe schedule for irregular loop
> > > ;* ii = 18 Did not find schedule
> > > ;* ii = 19 Unsafe schedule for irregular loop
> > > ;* ii = 19 Schedule found with 1 iterations in parallel
> > > ;*
> > > ;* Register Usage Table:
> > > ;* +---------------------------------+
> > > ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> > > ;* |0000000000111111|0000000000111111|
> > > ;* |0123456789012345|0123456789012345|
> > > ;* |----------------+----------------|
> > > ;* 0: |* * ********* | ***** * *** |
> > > ;* 1: |* * ********** | ***** * *** |
> > > ;* 2: |* * ********** | ***** * *** |
> > > ;* 3: |* * ********** | ***** * *** |
> > > ;* 4: |* * ********** | ***** * *** |
> > > ;* 5: |* * ********** | ***** * *** |
> > > ;* 6: |* ************* | ***** * *** |
> > > ;* 7: |* **************| ***** * *** |
> > > ;* 8: |* **************| ***** ** *** |
> > > ;* 9: |* **************| ******** *** |
> > > ;* 10: |* **************|********* *** |
> > > ;* 11: |****************| ***** ****** |
> > > ;* 12: |****************| ***** ****** |
> > > ;* 13: |****************| ************ |
> > > ;* 14: |* **************| ***** ****** |
> > > ;* 15: |* **************| ***** * **** |
> > > ;* 16: |*************** | ******* *** |
> > > ;* 17: |*** * ********* | ******* *** |
> > > ;* 18: |*** * ********* | ******* *** |
> > > ;* +---------------------------------+
> > > ;*
> > > ;* Done
> > > ;*
> > > ;* Loop is interruptible
> > > ;* Collapsed epilog stages : 0
> > > ;* Collapsed prolog stages : 0
> > > ;*
> > > ;* Minimum safe trip count : 1
> > >
> > > Looking at the register usage it looks like it's using quite a lot of registers and I thought that maybe the LDDW
> > > would relieve some registers. Also i was wondering if i can force-allign arrays in memory? If that's possible I can
> > > use 1 pointer to access all the channels (by using an offset when addressing:
> > > ----------------------------
> > > IDEA:
> > > //CHANNEL 1
> > > *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > >
> > >
> > > // CHANNEL 2
> > > *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > >
> > > // OTHER CHANNELS addressed using bigger offsets...
> > > -------------------------------
> > >
> > > I don't know if this idea is feasible but if it is i think it would relieve some more pressure of the register usage.
> > >
> > > Anyone's idea's/comments are welcome. At the moment the code is running 33% to slow. If I offer 4000 samples @ 4MHz
> > > (data will take 1000 us to load into my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an ideal
> > > situation i would like to complete the FetchData routine in 1000 us (not any shorter or I would read faster then data
> > > is being written :P).
> > >
> > > With kind regards,
> > >
> > > Dominic Stuart
> > >
> > >
> > > --- In c..., Michael Dunn wrote:
> > >>
> > >> Dominic,
> > >>
> > >> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
> > >> >
> > >> >
> > >> > Thanks Mike,
> > >> >
> > >> > I'm starting to enjoy this "tweaking" and am trying to push it as far as I
> > >> > can because every microsecond that i gain means the DSP can handle more
> > >> > particles/second. I've applied the tips I've gotten on this forum on the
> > >> > rest of my source code as well (the actual loops in my program that do the
> > >> > calculations on the data) and those are pipelining aswell now. Compared to
> > >> > the initial source total improvement is over 900%! Amazing (looks like I was
> > >> > using the DSP as a glorified MCU) but the true power of the DSP is starting
> > >> > to show! I thank you for your input but it raises some questions if you
> > >> > don't mind:
> > >> >
> > >> > --- In c..., Michael Dunn wrote:
> > >> >>
> > >> >> Congratulations, Dominic!!
> > >> >>
> > >> >> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
> > >> >>
> > >> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> > >> >> 16, and 8 bit accesses will be the same speed.
> > >> >
> > >> > What do you mean with aligned exactly?
> > >>
> > >> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
> > >> == 0' then it is aligned.
> > >> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
> > >> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
> > >> are aligned [byte addresses are always aligned].
> > >>
> > >> >
> > >> >> Only if your external memory is 8 or 16 bits wide would there be any
> > >> >> potential advantage in performing 16 bit accesses instead of 32 bit
> > >> >> accesses.
> > >> >> Also, there would be an advantage in fetching 32 bits at a time if you
> > >> >> an entire array of 8 or 16 bit values.
> > >> >>
> > >> >
> > >> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
> > >> > the first two together (logic tied together so they "act" like 1 32bits wide
> > >> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
> > >> >
> > >> >> I haven't looked at the details of your code, but if you always fetch
> > >> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> > >> >> that your hardware addresses are preventing you from picking up some
> > >> >> additional speed. *If* the input addresses began on a 64 bit boundary
> > >> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> > >> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> > >> >> bit read].
> > >> >
> > >> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
> > >> > compiler to fetch a 64 bit read (I mean this still has to be split somehow
> > >> > in 2 read cycles somehow?)
> > >>
> > >> First of all, I wrote this before I had the idea of using a single
> > >> pointer. Your code has 2 pointers that load data - this means that
> > >> you are using 4 processor registers. Changing to a single 64 bit read
> > >> [32 x 2] would result in requiring only 3 registers. If your routine
> > >> has a lot of register pressure [utilization] where it is loading and
> > >> unloading CPU registers, then a 'register reduction change' would help
> > >> performance.
> > >>
> > >> As I finished writing about the double read, I thought of 'plan B' -
> > >> just use one pointer with an offset. When you look at the asm
> > >> listing, it should give you some register usage info. If you are
> > >> getting 'spills' then definitely try this.
> > >>
> > >> >
> > >> >>
> > >> >> Since your hardware addresses are fixed, you only need 1 pointer. You
> > >> >> could use
> > >> >> tmpRead2 = *(read1 + 4);
> > >> >> This would free up one register and, depending on register
> > >> >> utilization, could improve the performance.
> > >> >>
> > >> >
> > >> > Improve performance, thats what I like to hear ;) I hope my questions aren't
> > >> > too "basic".
> > >>
> > >> Most active members of this group are willing to help someone who
> > >> wants to learn. As long as your questions are informed and you show a
> > >> willingness to participate, most of us will help if we can. We come
> > >> from a variety of backgrounds and each of us end up learning something
> > >> from time to time.
> > >>
> > >> As you are learning, 'performance improvement' is not something that
> > >> has a single solution. Rather, it is a journey with many stops along
> > >> the way.
> > >>
> > >> mikedunn
> > >> >
> > >> > Dominic
> > >> >
> > >> >> mikedunn
> > >> >>
> > >> >>
> > >> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> > >> >> >
> > >> >> >
> > >> >> > d.stuartnl,
> > >> >> >
> > >> >> > my comments in-line and prefixed with
> > >> >> >
> > >> >> > R. Williams
> > >> >> >
> > >> >> > ---------- Original Message -----------
> > >> >> > From: "d.stuartnl"
> > >> >> > To: c...
> > >> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> > >> >> > Subject: [c6x] Re: Slow EMIF transfer
> > >> >> >
> > >> >> >> R.Williams,
> > >> >> >>
> > >> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
> > >> >> >> now thanks to your tips:
> > >> >> >
> > >> >> > congratulations!!
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >>
> > >> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> > >> >> >> is working fine as it is. Dont know why though.
> > >> >> >
> > >> >> > two reasons:
> > >> >> > 1) the data size of a float is the same as the address data size
> > >> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
> > >> >> > loop
> > >> >> > to point 1 past the last location used.
> > >> >> >
> > >> >> >
> > >> >> >> >
> > >> >> >> still have them in a single loop and it's pipelining. Do you think
> > >> >> >> it's worth considering splitting it into two loops and check if
> > >> >> >> there's (an even better) speed increase?
> > >> >> >
> > >> >> > you could experiment, but it looks like it is not necessary to
> > >> >> > separate
> > >> >> > the code into two loops.
> > >> >> >
> > >> >> >
> > >> >> >> My new and improved function:
> > >> >> >
> > >> >> >
> > >> >> >> // CHANNEL 3 this channel is always read for particle matching
> > >> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> > >> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > > >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > >> >> >
> > >> >> > there seems to be a problem in the editing of the above 4 lines
> > >> >> > It looks like pCH3 is not being used; however, pCH3 is still being
> > >> >> > initialized
> > >> >> > and incremented in the code.
> > >> >> > Also when testing for execution speed, adding new operations (pBinData3)
> > >> >> > makes
> > >> >> > it very difficult to make timing comparisons.
> > >> >> >
> > >> >> >
> > >> >> >>
> > >> >> >> As you might have seen in my code the second read (tempRead2) is a 32
> > >> >> >> bits int but I'm only interrested in the first 16 bits (where channel
> > >> >> >> 3 and 4 reside), is there a way i can inform the compiler
> > >> >> >
> > >> >> > the natural size of a operation is 32bits, changing to a 16 bit
> > >> >> > operation
> > >> >> > would slow the code execution.
> > >> >> >
> > >> >> >>
> > >> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> > >> >> >> these keywords the software pipelining was disabled again (Cannot find
> > >> >> >> schedule).
> > >> >> >
> > >> >> > the 'volatile' is needed for the two parameters because they DO
> > >> >> > change
> > >> >> > between reads. I had suggested to remove the 'volatile' from the
> > >> >> > variables,
> > >> >> > not
> > >> >> > the parameters.
> > >> >> >
> > >> >> >
> > >> >> >>
> > >> >> >> With kind regards,
> > >> >> >>
> > >> >> >> Dominic
> > >> >> >

_____________________________________