DSPRelated.com
Forums

Slow EMIF transfer

Started by d.st...@yahoo.com June 23, 2009
d.stuartnl,

The reason the time is quicker, even though there is more code, is because the
code produced to do:
CH1.deloggedData[x]
includes quite a lot of math, calculation of an address in a array is slow
compared to incrementing a pointer

R. Williams
---------- Original Message -----------
From: "d.stuartnl"
To: c...
Sent: Fri, 17 Jul 2009 17:06:32 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Dear R.Williams,
>
> I changed my code to your suggestion:
>
> void Calculator_FetchData()
> {
> volatile float * pCH1;
> volatile float * pCH2;
> volatile float * pCH3;
> volatile float * pCH4;
> volatile float * pCH5;
> volatile float * pCH6;
>
> const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> const termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
>
>
>
> tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...
>
> while(*pCH1 < endCH1)
> {
> tmpRead1 = *read1;
> if(tmpRead1 == termValue) break;
> //CHANNEL 1
> *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> if(LRneeded == 1)
> {
> *pCH1 += *pCH2;
> if(*pCH1 > 5000)
> {
> *pCH1 = 5000;
> }
> }
> // CHANNEL 5
> *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
> // CHANNEL 6
> *pCH6 = LUT1[tmpRead1 & 0xFF];
>
> tmpRead2 = *read2;
>
> // CHANNEL 3 this channel is always read for particle matching on
> this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))]; // CHANNEL 4
> *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> pCH1++;
> pCH2++;
> pCH3++;
> pCH4++;
> pCH5++;
> pCH6++;
> x++;
> }
> if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> {
> processTime = TIMER(1)->cnt - tmpprocessTime;
> }
> }
>
> On my testrig I'm offering particles with a fixed lenght of 985. My
> previous code could read 985 samples for 6 channels in 681us. Your
> suggestion cut that time down to 601us!!! My first reaction was WOW
> :P. I have a couple of questions though if you can forgive my
> ignorance. The big question is WHY? Because it looks like it's
> calculating more (6 pointers instead of 1 "x"). I still left in the
> x++; because I need to know how many samples have been read.
>
> With kind regards,
>
> Dominic Stuart
>
> --- In c..., "Richard Williams" wrote:
> >
> > d.stuartnl,
> >
> > I notice that the code, during the first loop, checks for the termination value
> > then throws away the first read values (by reading from read1 and read2 again).
> > is that you wanted to do?
> >
> > Execution could be made much faster, by eliminating the calculations related to
> > 'x' by using pointers to:
> > CH1.deloggedData,
> > CH2.deloggedData,
> > CH3.deloggedData,
> > CH4.deloggedData,
> > CH5.deloggedData,
> > CH6.deloggedData.
> > Initialize the pointers before the loop and increment them at the end of the
loop.
> > Also, eliminate 'x' and related calculation by precalculating the end address
> > for the loop as:
> > const endCH1 = &CH1.deloggedData[0x1000];
> > const termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > --- // rest of initialization
> > while( pCH1 < endCH1 )
> > {
> > ---// processing
> > pCH1++;
> > pCh2++;
> > ...// rest of incrementing
> > } // end while()
> >
> > to avoid processing the termination value from *read1
> > and to exit when the termination value is read:
> > The first code within the 'while' loop would be:
> > tmpRead1 = *read1;
> > if (tmpRead1 == termValue ) break;
> > tmpRead2 = *read2;
> >
> > R. Williams
> >
> >
> > ---------- Original Message -----------
> > From: "d.stuartnl"
> > To: c...
> > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> >
> > > R. Williams,
> >
> > >
> > > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> > > I've been analyzing the compilers feedback and it's stating that it
> > > cannot implement software pipelining because there's a function call
> > > (AddSample()) in the loop. I've removed the AddSample() function and
> > > put the code from the function directly into the loop (see source),
> > > there's still some problems (Disqualified loop: Loop carried
> > > dependency bound too large). But I'm working on it :) I've also found
> > > out that pipelining is not being used in a lot of my loops so I'm
> > > guessing if I adjust my C-code so that software pipelining will be
> > > possible I will notice an increase in performance.
> > >
> > > Source:
> > >
> > > read1 = (int*) 0x90300004;
> > > read2 = (int*) 0x90300008;
> > >
> > > tmpRead1 = *read1;
> > > tmpRead2 = *read2;
> > > x = 0;
> > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > {
> > > tmpRead1 = *read1;
> > > tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > >
> > > CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > // FWS R+L Add
> > > if(LRneeded == 1)
> > > {
> > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > if(CH1.deloggedData[x] > 5000)
> > > {
> > > CH1.deloggedData[x] = 5000;
> > > }
> > > }
> > > CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > > binData[x] = (tmpRead2 & 0xFF);
> > > CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > x++;
> > > }
> > >
> > > With kind regards,
> > >
> > > Dominic
> > >
> > > >
> > > > However, your idea of just using the read operation, since it is much longer
> > > > than a write, is a good one.
> > > >
> > > > R. Williams
> > > >
> > > >
> > > >
> > > > ---------- Original Message -----------
> > > > From: Jeff Brower
> > > > To: Dominic Stuart
> > > > Cc: c...
> > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > Subject: [c6x] Re: Slow EMIF transfer
> > > >
> > > > > Dominic-
> > > > >
> > > > > > I am indeed trying to avoid delay in processing flow. The data needs
to be
> > > > > > decompressed asap. When that is done the DSP performs calculations
on the
> > > > > > data and based on the outcome of those calculations the DSP generates a
> > > > > > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > > > > > always takes longer than a write, I don't have to pull the Empty
Flag and
> > > > > > can just read the data through a loop like so:
> > > > > >
> > > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > > {
> > > > > > Calculator_AddSample();
> > > > > > }
> > > > >
> > > > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > > > then you know the agent on the other side of the FIFO is writing a
> > > > > known block size, and will write it faster than you can read, so your
> > > > > code just needs to read.
> > > > >
> > > > > > I've tested this and it did improve the performance but nothing
shocking,
> > > > > > it seems the decompressing via the LookUp Table is creating the bottle
> > > > > > neck. I've already split the two dimensional LUT into 2 one dimensional
> > > > > > array's. This also helped a bit.
> > > > >
> > > > > One thing you might try is hand-optimized asm code just for the read /
> > > > > look-up sequence, using techniques that Richard was describing. If
> > > > > you take advantage of the pipeline, you can improve performance. For
> > > > > example you can read sample N, then in the next 4 instructions process
> > > > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > > > like it wouldn't be that much code in your loop, maybe a dozen or less
> > > > > asm instructions.
> > > > >
> > > > > -Jeff
> > > > >
> > > > > PS. Please post to the group, not to me. Thanks.
> > > > >
> > > > > > --- In c..., Jeff Brower wrote:
> > > > > > >
> > > > > > > Dominic-
> > > > > > >
> > > > > > > > Thanks for the information, I think I will refrain from using block
> > > > > > > > transfers because I want to process the data as the DSP receives it.
> > > > > > > .
> > > > > > > .
> > > > > > > .
> > > > > > >
> > > > > > > > At the moment I am starting this "prefetch" function when a burst
> > > > > > > > starts and execute this function every time there is data available
> > > > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > > > > > the data before the burst ends. All variables are in IRAM.
> > > > > > >
> > > > > > > The typical reason for doing it that way is to avoid delay
(latency) in
> > > > your signal
> > > > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > > > transmission,
> > > > > > > etc). Is that the case? If not then a block based method would be
> > > > better, otherwise
> > > > > > > you will waste a lot of time polling for each element. You don't
have to
> > > > implement
> > > > > > > DMA as a first step to get that working, you could use a code
loop. Then
> > > > implement
> > > > > > > DMA in order to further improve performance.
> > > > > > >
> > > > > > > -Jeff
> > > > > > >
> > > > > > > > My function looks like this:
> > > > > > > >
> > > > > > > > void Calculator_AddSample()
> > > > > > > > {
> > > > > > > > x++;
> > > > > > > >
> > > > > > > > read1 = (int*) 0x90300004;
> > > > > > > > read2 = (int*) 0x90300008;
> > > > > > > >
> > > > > > > > tmpRead1 = *read1;
> > > > > > > > tmpRead2 = *read2;
> > > > > > > >
> > > > > > > > // CHANNEL 1
> > > > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > // CHANNEL 2
> > > > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > > > // FWS R+L Add
> > > > > > > > if(LRneeded == 1)
> > > > > > > > {
> > > > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > > > {
> > > > > > > > CH1.deloggedData[x] = 5000;
> > > > > > > > }
> > > > > > > > }
> > > > > > > > // CHANNEL 3 this channel is always read for particle matching on
> > > > this channel
> > > > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > >
> > > > > > > > // CHANNEL 4
> > > > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > // CHANNEL 5
> > > > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > // CHANNEL 6
> > > > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > > > }
> > > > > > > > This function executes 2 reads from 2 different FIFO's and then
> > > > seperates the different datachannels and decompresses the value's with a
LookUp
> > > > Table.
> > > > > > > >
> > > > > > > > I am trying to streamline this function so it can keep up with the
> > > > incoming data. The data is written to the FIFO's with 4MHz. The data
consists of
> > > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > > >
> > > > > > > > At the moment I am starting this "prefetch" function when a
burst starts
> > > > and execute this function every time there is data available in the FIFO's
> > > > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
burst
> > > > ends. All variables are in IRAM.
> > > > > > > >
> > > > > > > > I think I made an error in suspecting the EMIF transfer speed
and I now
> > > > suspect that there may be some overhead in the polling scheme I use for
calling
> > > > this function that results in the slow transfer speed. I will look into
this. I
> > > > would like to thank everyone for there input.
> > > > > > > >
> > > > > > > > With kind regards,
> > > > > > > >
> > > > > > > > Dominic
> > > > > > > >
> > > > > > > > --- In c..., Adolf Klemenz wrote:
> > > > > > > > >
> > > > > > > > > Dear Dominic,
> > > > > > > > >
> > > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > > > >as I understand DMA, I would need to work in "blocks" of data but
> > that
> > > > > > > > > >would be very tricky in my application since I do not know how
> > big the
> > > > > > > > > >datastream is gonna be. Or is it possible to use DMA for
single byte
> > > > transfers?
> > > > > > > > >
> > > > > > > > > using DMA makes sense for block transfers only. Typical Fifo
> > applications
> > > > > > > > > will use the Fifo's half-full flag (or a similar signal) to
> > trigger a DMA
> > > > > > > > > block read.
> > > > > > > > > You may use element-synchronized DMA (each trigger transfers only
> > one data
> > > > > > > > > word), but there will be no speed improvement: It takes about
> > 100ns from
> > > > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > > > >
> > > > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > > > >
> > > > > > > > > // compiled with -o2 and without debug info:
> > > > > > > > >
> > > > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > > > // optimizer from code removal
> > > > > > > > > for (;;)
> > > > > > > > > {
> > > > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The screenshot shows chip select and read signal with the expected
> > timings
> > > > > > > > > (20ns strobe width). The gap between sucessive reads is caused by
> > the DSP
> > > > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
> > which should
> > > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > > >
> > > > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Adolf Klemenz, D.SignT
> > > > > > >
> > > > ------- End of Original Message -------
> > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________
Hi all,

I'm trying to further optimize my code but for some reason I cannot get pipelining to work. I've checked several documents (SPRU425, Optimizing C Compiler Tutorial, SPRA666, Hand Tuning Loops and Control Code). These documents primarily focus on improving pipelines but in my .asm file it keeps stating "Unsafe schedule for irregular loop". It produces the folowing Software Pipeline Information:

;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 362
;* Loop opening brace source line : 363
;* Loop closing brace source line : 397
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 110
;* Unpartitioned Resource Bound : 16
;* Partitioned Resource Bound(*) : 16
;* Resource Partition:
;* A-side B-side
;* .L units 2 1
;* .S units 4 5
;* .D units 15 16*
;* .M units 0 0
;* .X cross paths 0 0
;* .T address paths 15 16*
;* Long read paths 4 6
;* Long write paths 0 0
;* Logical ops (.LS) 2 0 (.L or .S unit)
;* Addition ops (.LSD) 0 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 4 3
;* Bound(.L .S .D .LS .LSD) 8 9
;*
;* Searching for software pipeline schedule at ...
;* ii = 110 Unsafe schedule for irregular loop
;* ii = 110 Unsafe schedule for irregular loop
;* ii = 110 Unsafe schedule for irregular loop
;* ii = 110 Did not find schedule
;* ii = 111 Unsafe schedule for irregular loop
;* ii = 111 Unsafe schedule for irregular loop
;* ii = 111 Unsafe schedule for irregular loop
;* ii = 111 Did not find schedule
;* ii = 113 Unsafe schedule for irregular loop
;* ii = 113 Unsafe schedule for irregular loop
;* ii = 113 Unsafe schedule for irregular loop
;* ii = 113 Did not find schedule
;* ii = 117 Unsafe schedule for irregular loop
;* ii = 117 Unsafe schedule for irregular loop
;* ii = 117 Unsafe schedule for irregular loop
;* ii = 117 Did not find schedule
;* Disqualified loop: Did not find schedule
;*----*
My code is as follows:

void Calculator_FetchData(volatile int * restrict p1, volatile int * restrict p2)
{
volatile int tmpRead1;
volatile int tmpRead2;
volatile int tmpStore1;
volatile int tmpStore2;
volatile float * restrict pCH1;
volatile float * restrict pCH2;
volatile float * restrict pCH3;
volatile float * restrict pCH4;
volatile float * restrict pCH5;
volatile float * restrict pCH6;

const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
const termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
pCH3 = &CH3.deloggedData[0];
pCH4 = &CH4.deloggedData[0];
pCH5 = &CH5.deloggedData[0];
pCH6 = &CH6.deloggedData[0];

while((*pCH1 < endCH1) & (tmpRead1 != termValue))
{
tmpRead1 = *p1;

//CHANNEL 1
*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
// CHANNEL 2
*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
if(LRneeded == 1)
{
*pCH1 += *pCH2;
if(*pCH1 > 5000)
{
*pCH1 = 5000;
}
}
//CHANNEL 5
*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

// CHANNEL 6
*pCH6 = LUT1[tmpRead1 & 0xFF];

tmpRead2 = *p2;

// CHANNEL 3 this channel is always read for particle matching on this channel
*pCH3 = LUT0[((tmpRead2 & 0xFF))];
// CHANNEL 4
*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

pCH1++;
pCH2++;
pCH3++;
pCH4++;
pCH5++;
pCH6++;
}
x = (int) (pCH1 - &CH1.deloggedData[0]);
}

Is there a way I can change my C-code so the DSP can Pipeline? I think it should be possible to have at least 2 iterations in parallel:

1st 2nd
read1
delog1
read2 read1
delog2 delog1
etc..

I've already used the "restrict" keyword on the pointers I use since these pointers do not overlap. I'm using the folowing compiler options:

-k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug" -d"CHIP_6713" -d"DEBUG" -mt -mw -mh -mr1 -mv6710 --mem_model:data --consultant

Any advice or pointers to documents regarding how to enable pipelining other then the ones mentioned above would be helpfull.

With kind regards,

Dominic Stuart

PS: I don't know if it's usefull but the folowing asm code is being produced:

C$L6:
$C$DW$L$_Calculator_FetchData$2$B:
.dwpsn file "C:\Documents and Settings\User\Desktop\20090722 works\Calculator.c",line 363,column 0,is_stmt
;** -----------------------g3:
;** 364 ----------------------- tmpRead1 = *p1;
;** 367 ----------------------- *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u, 24u)];
;** 369 ----------------------- *(++pCH2) = K$32[((unsigned)tmpRead1>>22>>2)];
;** 370 ----------------------- if ( LRneeded != 1 ) goto g6;
;** 372 ----------------------- *pCH1 = *pCH1+*pCH2;
;** 373 ----------------------- if ( *pCH1 <= K$36 ) goto g6;
;** 375 ----------------------- *pCH1 = K$36;
;** -----------------------g6:
;** 379 ----------------------- *pCH5++ = K$39[_extu((unsigned)tmpRead1, 16u, 24u)];
;** 382 ----------------------- *pCH6++ = K$39[_extu((unsigned)tmpRead1, 24u, 24u)];
;** 384 ----------------------- tmpRead2 = *p2;
;** 387 ----------------------- *pCH3++ = K$32[_extu((unsigned)tmpRead2, 24u, 24u)];
;** 389 ----------------------- *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u, 24u)];
;** 397 ----------------------- if ( (*(++pCH1) < endCH1)&(tmpRead1 != K$27) ) goto g3;
LDW .D1T2 *A4,B4 ; |364|
ZERO .L1 A1
NOP 3
STW .D2T2 B4,*+SP(4) ; |364|
LDW .D2T2 *+SP(4),B4 ; |367|
NOP 4
EXTU .S2 B4,8,24,B4 ; |367|
LDW .D2T1 *+B7[B4],A9 ; |367|
NOP 4
STW .D1T1 A9,*A8 ; |367|
LDW .D2T2 *+SP(4),B4 ; |369|
NOP 4
SHRU .S2 B4,24,B4 ; |369|
LDW .D2T2 *+B7[B4],B4 ; |369|
NOP 4
STW .D2T2 B4,*++B5 ; |369|
LDHU .D1T2 *A11,B4 ; |370|
NOP 4
CMPEQ .L2 B4,1,B0 ; |370|

[ B0] LDW .D1T1 *A8,A9 ; |372|
|| [ B0] LDW .D2T2 *B5,B4 ; |372|

NOP 4
[ B0] ADDSP .L1X B4,A9,A9 ; |372|
NOP 3
[ B0] STW .D1T1 A9,*A8 ; |372|
[ B0] LDW .D1T1 *A8,A9 ; |373|
NOP 4
[ B0] CMPGTSP .S1 A9,A2,A9 ; |373|
[ B0] MV .L1 A9,A1
[ A1] STW .D1T1 A2,*A8 ; |375|
LDW .D2T1 *+SP(4),A9 ; |379|
NOP 4
EXTU .S1 A9,16,24,A9 ; |379|
LDW .D1T1 *+A10[A9],A9 ; |379|
NOP 4
STW .D1T1 A9,*A5++ ; |379|
LDW .D2T1 *+SP(4),A9 ; |382|
NOP 4
EXTU .S1 A9,24,24,A9 ; |382|
LDW .D1T1 *+A10[A9],A9 ; |382|
NOP 4
STW .D1T1 A9,*A3++ ; |382|
LDW .D1T2 *A0,B4 ; |384|
NOP 4
STW .D2T2 B4,*+SP(8) ; |384|
LDW .D2T2 *+SP(8),B4 ; |387|
NOP 4
EXTU .S2 B4,24,24,B4 ; |387|
LDW .D2T1 *+B7[B4],A9 ; |387|
NOP 4
STW .D1T1 A9,*A6++ ; |387|
LDW .D2T2 *+SP(8),B4 ; |389|
NOP 4
EXTU .S2 B4,16,24,B4 ; |389|
LDW .D2T1 *+B7[B4],A9 ; |389|
NOP 4
STW .D1T1 A9,*A7++ ; |389|
LDW .D2T2 *+SP(12),B4 ; |397|

LDW .D1T1 *++A8,A9 ; |397|
|| LDW .D2T2 *+SP(4),B8 ; |397|

NOP 4

CMPEQ .L2 B8,B6,B8 ; |397|
|| CMPLTSP .S2X A9,B4,B4 ; |397|

XOR .L2 1,B8,B8 ; |397|
AND .L2 B8,B4,B0 ; |397|
[ B0] B .S1 $C$L6 ; |397|
.dwpsn file "C:\Documents and Settings\User\Desktop\20090722 works\Calculator.c",line 397,column 0,is_stmt
NOP 5
; BRANCHCC OCCURS {$C$L6} ; |397|

--- In c..., "Richard Williams" wrote:
> d.stuartnl,
>
> The reason the time is quicker, even though there is more code, is because the
> code produced to do:
> CH1.deloggedData[x]
> includes quite a lot of math, calculation of an address in a array is slow
> compared to incrementing a pointer
>
> R. Williams
> ---------- Original Message -----------
> From: "d.stuartnl"
> To: c...
> Sent: Fri, 17 Jul 2009 17:06:32 -0000
> Subject: [c6x] Re: Slow EMIF transfer
>
> > Dear R.Williams,
> >
> > I changed my code to your suggestion:
> >
> > void Calculator_FetchData()
> > {
> > volatile float * pCH1;
> > volatile float * pCH2;
> > volatile float * pCH3;
> > volatile float * pCH4;
> > volatile float * pCH5;
> > volatile float * pCH6;
> >
> > const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > const termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> >
> >
> >
> > tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...
> >
> > while(*pCH1 < endCH1)
> > {
> > tmpRead1 = *read1;
> > if(tmpRead1 == termValue) break;
> > //CHANNEL 1
> > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > // CHANNEL 2
> > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > if(LRneeded == 1)
> > {
> > *pCH1 += *pCH2;
> > if(*pCH1 > 5000)
> > {
> > *pCH1 = 5000;
> > }
> > }
> > // CHANNEL 5
> > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> > // CHANNEL 6
> > *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> > tmpRead2 = *read2;
> >
> > // CHANNEL 3 this channel is always read for particle matching on
> > this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))]; // CHANNEL 4
> > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > pCH1++;
> > pCH2++;
> > pCH3++;
> > pCH4++;
> > pCH5++;
> > pCH6++;
> > x++;
> > }
> > if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> > {
> > processTime = TIMER(1)->cnt - tmpprocessTime;
> > }
> > }
> >
> > On my testrig I'm offering particles with a fixed lenght of 985. My
> > previous code could read 985 samples for 6 channels in 681us. Your
> > suggestion cut that time down to 601us!!! My first reaction was WOW
> > :P. I have a couple of questions though if you can forgive my
> > ignorance. The big question is WHY? Because it looks like it's
> > calculating more (6 pointers instead of 1 "x"). I still left in the
> > x++; because I need to know how many samples have been read.
> >
> > With kind regards,
> >
> > Dominic Stuart
> >
> > --- In c..., "Richard Williams" wrote:
> > >
> > > d.stuartnl,
> > >
> > > I notice that the code, during the first loop, checks for the termination value
> > > then throws away the first read values (by reading from read1 and read2 again).
> > > is that you wanted to do?
> > >
> > > Execution could be made much faster, by eliminating the calculations related to
> > > 'x' by using pointers to:
> > > CH1.deloggedData,
> > > CH2.deloggedData,
> > > CH3.deloggedData,
> > > CH4.deloggedData,
> > > CH5.deloggedData,
> > > CH6.deloggedData.
> > > Initialize the pointers before the loop and increment them at the end of the
> loop.
> > > Also, eliminate 'x' and related calculation by precalculating the end address
> > > for the loop as:
> > > const endCH1 = &CH1.deloggedData[0x1000];
> > > const termValue = 0x84825131;
> > >
> > > pCH1 = &CH1.deloggedData[0];
> > > pCH2 = &CH2.deloggedData[0];
> > > --- // rest of initialization
> > > while( pCH1 < endCH1 )
> > > {
> > > ---// processing
> > > pCH1++;
> > > pCh2++;
> > > ...// rest of incrementing
> > > } // end while()
> > >
> > > to avoid processing the termination value from *read1
> > > and to exit when the termination value is read:
> > > The first code within the 'while' loop would be:
> > > tmpRead1 = *read1;
> > > if (tmpRead1 == termValue ) break;
> > > tmpRead2 = *read2;
> > >
> > > R. Williams
> > >
> > >
> > > ---------- Original Message -----------
> > > From: "d.stuartnl"
> > > To: c...
> > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > >
> > > > R. Williams,
> > >
> > > >
> > > > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> > > > I've been analyzing the compilers feedback and it's stating that it
> > > > cannot implement software pipelining because there's a function call
> > > > (AddSample()) in the loop. I've removed the AddSample() function and
> > > > put the code from the function directly into the loop (see source),
> > > > there's still some problems (Disqualified loop: Loop carried
> > > > dependency bound too large). But I'm working on it :) I've also found
> > > > out that pipelining is not being used in a lot of my loops so I'm
> > > > guessing if I adjust my C-code so that software pipelining will be
> > > > possible I will notice an increase in performance.
> > > >
> > > > Source:
> > > >
> > > > read1 = (int*) 0x90300004;
> > > > read2 = (int*) 0x90300008;
> > > >
> > > > tmpRead1 = *read1;
> > > > tmpRead2 = *read2;
> > > > x = 0;
> > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > {
> > > > tmpRead1 = *read1;
> > > > tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > >
> > > > CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > > CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > > // FWS R+L Add
> > > > if(LRneeded == 1)
> > > > {
> > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > if(CH1.deloggedData[x] > 5000)
> > > > {
> > > > CH1.deloggedData[x] = 5000;
> > > > }
> > > > }
> > > > CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > > > binData[x] = (tmpRead2 & 0xFF);
> > > > CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > > CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > > CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > > x++;
> > > > }
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic
> > > >
> > > > >
> > > > > However, your idea of just using the read operation, since it is much longer
> > > > > than a write, is a good one.
> > > > >
> > > > > R. Williams
> > > > >
> > > > >
> > > > >
> > > > > ---------- Original Message -----------
> > > > > From: Jeff Brower
> > > > > To: Dominic Stuart
> > > > > Cc: c...
> > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > >
> > > > > > Dominic-
> > > > > >
> > > > > > > I am indeed trying to avoid delay in processing flow. The data needs
> to be
> > > > > > > decompressed asap. When that is done the DSP performs calculations
> on the
> > > > > > > data and based on the outcome of those calculations the DSP generates a
> > > > > > > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > > > > > > always takes longer than a write, I don't have to pull the Empty
> Flag and
> > > > > > > can just read the data through a loop like so:
> > > > > > >
> > > > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > > > {
> > > > > > > Calculator_AddSample();
> > > > > > > }
> > > > > >
> > > > > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > > > > then you know the agent on the other side of the FIFO is writing a
> > > > > > known block size, and will write it faster than you can read, so your
> > > > > > code just needs to read.
> > > > > >
> > > > > > > I've tested this and it did improve the performance but nothing
> shocking,
> > > > > > > it seems the decompressing via the LookUp Table is creating the bottle
> > > > > > > neck. I've already split the two dimensional LUT into 2 one dimensional
> > > > > > > array's. This also helped a bit.
> > > > > >
> > > > > > One thing you might try is hand-optimized asm code just for the read /
> > > > > > look-up sequence, using techniques that Richard was describing. If
> > > > > > you take advantage of the pipeline, you can improve performance. For
> > > > > > example you can read sample N, then in the next 4 instructions process
> > > > > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > > > > like it wouldn't be that much code in your loop, maybe a dozen or less
> > > > > > asm instructions.
> > > > > >
> > > > > > -Jeff
> > > > > >
> > > > > > PS. Please post to the group, not to me. Thanks.
> > > > > >
> > > > > > > --- In c..., Jeff Brower wrote:
> > > > > > > >
> > > > > > > > Dominic-
> > > > > > > >
> > > > > > > > > Thanks for the information, I think I will refrain from using block
> > > > > > > > > transfers because I want to process the data as the DSP receives it.
> > > > > > > > .
> > > > > > > > .
> > > > > > > > .
> > > > > > > >
> > > > > > > > > At the moment I am starting this "prefetch" function when a burst
> > > > > > > > > starts and execute this function every time there is data available
> > > > > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > > > > > > the data before the burst ends. All variables are in IRAM.
> > > > > > > >
> > > > > > > > The typical reason for doing it that way is to avoid delay
> (latency) in
> > > > > your signal
> > > > > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > > > > transmission,
> > > > > > > > etc). Is that the case? If not then a block based method would be
> > > > > better, otherwise
> > > > > > > > you will waste a lot of time polling for each element. You don't
> have to
> > > > > implement
> > > > > > > > DMA as a first step to get that working, you could use a code
> loop. Then
> > > > > implement
> > > > > > > > DMA in order to further improve performance.
> > > > > > > >
> > > > > > > > -Jeff
> > > > > > > >
> > > > > > > > > My function looks like this:
> > > > > > > > >
> > > > > > > > > void Calculator_AddSample()
> > > > > > > > > {
> > > > > > > > > x++;
> > > > > > > > >
> > > > > > > > > read1 = (int*) 0x90300004;
> > > > > > > > > read2 = (int*) 0x90300008;
> > > > > > > > >
> > > > > > > > > tmpRead1 = *read1;
> > > > > > > > > tmpRead2 = *read2;
> > > > > > > > >
> > > > > > > > > // CHANNEL 1
> > > > > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > > // CHANNEL 2
> > > > > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > > > > // FWS R+L Add
> > > > > > > > > if(LRneeded == 1)
> > > > > > > > > {
> > > > > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > > > > {
> > > > > > > > > CH1.deloggedData[x] = 5000;
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > // CHANNEL 3 this channel is always read for particle matching on
> > > > > this channel
> > > > > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > >
> > > > > > > > > // CHANNEL 4
> > > > > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > > // CHANNEL 5
> > > > > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > > // CHANNEL 6
> > > > > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > }
> > > > > > > > > This function executes 2 reads from 2 different FIFO's and then
> > > > > seperates the different datachannels and decompresses the value's with a
> LookUp
> > > > > Table.
> > > > > > > > >
> > > > > > > > > I am trying to streamline this function so it can keep up with the
> > > > > incoming data. The data is written to the FIFO's with 4MHz. The data
> consists of
> > > > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > > > >
> > > > > > > > > At the moment I am starting this "prefetch" function when a
> burst starts
> > > > > and execute this function every time there is data available in the FIFO's
> > > > > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
> burst
> > > > > ends. All variables are in IRAM.
> > > > > > > > >
> > > > > > > > > I think I made an error in suspecting the EMIF transfer speed
> and I now
> > > > > suspect that there may be some overhead in the polling scheme I use for
> calling
> > > > > this function that results in the slow transfer speed. I will look into
> this. I
> > > > > would like to thank everyone for there input.
> > > > > > > > >
> > > > > > > > > With kind regards,
> > > > > > > > >
> > > > > > > > > Dominic
> > > > > > > > >
> > > > > > > > > --- In c..., Adolf Klemenz wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Dominic,
> > > > > > > > > >
> > > > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > > > > >as I understand DMA, I would need to work in "blocks" of data but
> > > that
> > > > > > > > > > >would be very tricky in my application since I do not know how
> > > big the
> > > > > > > > > > >datastream is gonna be. Or is it possible to use DMA for
> single byte
> > > > > transfers?
> > > > > > > > > >
> > > > > > > > > > using DMA makes sense for block transfers only. Typical Fifo
> > > applications
> > > > > > > > > > will use the Fifo's half-full flag (or a similar signal) to
> > > trigger a DMA
> > > > > > > > > > block read.
> > > > > > > > > > You may use element-synchronized DMA (each trigger transfers only
> > > one data
> > > > > > > > > > word), but there will be no speed improvement: It takes about
> > > 100ns from
> > > > > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > > > > >
> > > > > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > > > > >
> > > > > > > > > > // compiled with -o2 and without debug info:
> > > > > > > > > >
> > > > > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > > > > // optimizer from code removal
> > > > > > > > > > for (;;)
> > > > > > > > > > {
> > > > > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > The screenshot shows chip select and read signal with the expected
> > > timings
> > > > > > > > > > (20ns strobe width). The gap between sucessive reads is caused by
> > > the DSP
> > > > > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
> > > which should
> > > > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > > > >
> > > > > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > > > > >
> > > > > > > > > > Best Regards,
> > > > > > > > > > Adolf Klemenz, D.SignT
> > > > > > > >
> > > > > ------- End of Original Message -------
> > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________
d.stuartnl,

under the assumption that the included code is actually the code being compiled...
The line: x = (int) (pCH1 - &CH1.deloggedData[0]);
will give the number of addresses rather than the number of entries,
Therefore it should be:
x = ((int) (pCH1 - &CH1.deloggedData[0]) / sizeof(float));
The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
is using a local variable tmpRead1 before it is set,
so garbage is being used in the comparison.
The lines:
volatile int tmpStore1;
volatile int tmpStore2;
describe two local variables that are not being used, so should be deleted.
The lines:
> volatile int * restrict p1,
> volatile int * restrict p2)
have the parameter names p1 and p2.
This yields no useful information about the target of the pointers,
so should be renamed to something meaningful.
The loading of the local variable tmpRead2
tmpRead2 = *p2;
is being immediately used in the next line.
it takes some 4 cycles for the load to complete, so the
loading should be several cycles/lines earlier in the source.
The line: if(LRneeded == 1)
is referencing a global variable.
This makes for maintenance problems and pipelining problems.
It would be better passed in as one of the parameters (and have a
local/parameter name).
We have previously discussed the use of the character and the problems it
produces.
I replaced the characters with spaces in the copied code.
The term 'volatile' is not needed in any of the variables as the code does not
have repeating lines and the variables will not (unexpectedly) change during the
execution of the code.
Removing the 'volatile' will speed up the code because the values, once into a
CPU register, will not have to be re-read at each usage of the value.
The line: const termValue = 0x84825131
is missing the 'type' for the constant.
I would suggest adding 'int' after the 'const'.
The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
is performing a bit-wise 'and' between two logical conditions.
It should be: while( (*pCH1 < endCH1) && (tmpRead1 != termValue) )
so it performs a logical 'and' between two logical conditions.
The use of the global variable 'x' is a maintenance problem.
I would suggest a modification to have a local variable 'x' and return the value
of 'x' rather than returning 'void'
let the caller assign the returned value to the global variable 'x'.
In general, for a loop to be pipelined, the loop must be relatively simple.
Therefore, I would suggest making this two loops,
one for tmpRead1 reading and calculations
one for tmpRead2 reading and calculations
However, for parallel operation, the tmpRead1 and tmpRead2 operations could be
(somewhat) merged in a single loop.
To help absorb the needed CPU cycles after a read of p1 and p2, I would put the
first read(s) before the 'while' statement(s) and read again at the end of the
loop, just before the incrementing of the pCHx pointers.
R. Williams

---------- Original Message -----------
From: "d.stuartnl"
To: c...
Sent: Wed, 22 Jul 2009 13:43:32 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Hi all,
>
> I'm trying to further optimize my code but for some reason I cannot
> get pipelining to work. I've checked several documents (SPRU425,
> Optimizing C Compiler Tutorial, SPRA666, Hand Tuning Loops and Control
> Code). These documents primarily focus on improving pipelines but in
> my .asm file it keeps stating "Unsafe schedule for irregular loop". It
> produces the folowing Software Pipeline Information:
>


> -------* My code is as follows:
>
> void Calculator_FetchData(
> volatile int * restrict p1,
> volatile int * restrict p2)
{
> volatile int tmpRead1;
> volatile int tmpRead2;
> volatile int tmpStore1;
> volatile int tmpStore2;
> volatile float * restrict pCH1;
> volatile float * restrict pCH2;
> volatile float * restrict pCH3;
> volatile float * restrict pCH4;
> volatile float * restrict pCH5;
> volatile float * restrict pCH6;
>
> const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> const termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
>
> while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> {
> tmpRead1 = *p1;
>
> //CHANNEL 1if(LRneeded == 1)
> *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> if(LRneeded == 1)
> {
> *pCH1 += *pCH2;
>
> if(*pCH1 > 5000)
> {
> *pCH1 = 5000;
> }
> }
> //CHANNEL 5
> *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
> // CHANNEL 6
> *pCH6 = LUT1[tmpRead1 & 0xFF];
>
> tmpRead2 = *p2;
>
> // CHANNEL 3 this channel is always read for particle matching on
> this channel
> *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> // CHANNEL 4
> *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> pCH1++;
> pCH2++;
> pCH3++;
> pCH4++;
> pCH5++;
> pCH6++;
> }
>
> x = (int) (pCH1 - &CH1.deloggedData[0]);
> }
>
> Is there a way I can change my C-code so the DSP can Pipeline? I think
> it should be possible to have at least 2 iterations in parallel:
>
> 1st 2nd
> read1
> delog1
> read2 read1
> delog2 delog1
> etc..
>
> I've already used the "restrict" keyword on the pointers I use since
> these pointers do not overlap. I'm using the folowing compiler options:
>
> -k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug" -d"CHIP_6713"
> -d"DEBUG" -mt -mw -mh -mr1 -mv6710 --mem_model:data --consultant
>
> Any advice or pointers to documents regarding how to enable pipelining
> other then the ones mentioned above would be helpfull.
>
> With kind regards,
>
> Dominic Stuart
>
> PS: I don't know if it's usefull but the folowing asm code is being produced:
>
> C$L6:
> $C$DW$L$_Calculator_FetchData$2$B:
> .dwpsn file "C:\Documents and Settings\User\Desktop\20090722
> works\Calculator.c",line 363,column 0,is_stmt ;** ---------------------
> --g3: ;** 364 ----------------------- tmpRead1 = *p1; ;** 367 ------
> ----------------- *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u, 24u)];
> ;** 369 ----------------------- *(++pCH2) = K$32[((unsigned)
> tmpRead1>>22>>2)]; ;** 370 ----------------------- if ( LRneeded !> 1 ) goto g6; ;** 372 ----------------------- *pCH1 = *pCH1+*pCH2;
> ;** 373 ----------------------- if ( *pCH1 <= K$36 ) goto g6; ;**
> 375 ----------------------- *pCH1 = K$36; ;** ----------------------
> -g6: ;** 379 ----------------------- *pCH5++ = K$39[_extu((unsigned)
> tmpRead1, 16u, 24u)]; ;** 382 ----------------------- *pCH6++ > K$39[_extu((unsigned)tmpRead1, 24u, 24u)]; ;** 384 --------------------
> --- tmpRead2 = *p2; ;** 387 ----------------------- *pCH3++ > K$32[_extu((unsigned)tmpRead2, 24u, 24u)]; ;** 389 --------------------
> --- *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u, 24u)]; ;** 397 ---
> -------------------- if ( (*(++pCH1) < endCH1)&(tmpRead1 != K$27) )
> goto g3; LDW .D1T2 *A4,B4 ; |364|
> ZERO .L1 A1 NOP 3 STW
> .D2T2 B4,*+SP(4) ; |364| LDW .D2T2 *+SP(4)
> ,B4 ; |367| NOP 4 EXTU .S2
> B4,8,24,B4 ; |367| LDW .D2T1 *+B7[B4],A9
> ; |367| NOP 4 STW .D1T1 A9,
> *A8 ; |367| LDW .D2T2 *+SP(4),B4 ;
> |369| NOP 4 SHRU .S2 B4,24,B4
> ; |369| LDW .D2T2 *+B7[B4],B4 ; |369|
> NOP 4 STW .D2T2 B4,*++B5
> ; |369| LDHU .D1T2 *A11,B4 ; |370|
> NOP 4 CMPEQ .L2 B4,1,B0 ; |370|
>
> [ B0] LDW .D1T1 *A8,A9 ; |372|
> || [ B0] LDW .D2T2 *B5,B4 ; |372|
>
> NOP 4
> [ B0] ADDSP .L1X B4,A9,A9 ; |372|
> NOP 3
> [ B0] STW .D1T1 A9,*A8 ; |372|
> [ B0] LDW .D1T1 *A8,A9 ; |373|
> NOP 4
> [ B0] CMPGTSP .S1 A9,A2,A9 ; |373|
> [ B0] MV .L1 A9,A1
> [ A1] STW .D1T1 A2,*A8 ; |375|
> LDW .D2T1 *+SP(4),A9 ; |379|
> NOP 4
> EXTU .S1 A9,16,24,A9 ; |379|
> LDW .D1T1 *+A10[A9],A9 ; |379|
> NOP 4
> STW .D1T1 A9,*A5++ ; |379|
> LDW .D2T1 *+SP(4),A9 ; |382|
> NOP 4
> EXTU .S1 A9,24,24,A9 ; |382|
> LDW .D1T1 *+A10[A9],A9 ; |382|
> NOP 4
> STW .D1T1 A9,*A3++ ; |382|
> LDW .D1T2 *A0,B4 ; |384|
> NOP 4
> STW .D2T2 B4,*+SP(8) ; |384|
> LDW .D2T2 *+SP(8),B4 ; |387|
> NOP 4
> EXTU .S2 B4,24,24,B4 ; |387|
> LDW .D2T1 *+B7[B4],A9 ; |387|
> NOP 4
> STW .D1T1 A9,*A6++ ; |387|
> LDW .D2T2 *+SP(8),B4 ; |389|
> NOP 4
> EXTU .S2 B4,16,24,B4 ; |389|
> LDW .D2T1 *+B7[B4],A9 ; |389|
> NOP 4
> STW .D1T1 A9,*A7++ ; |389|
> LDW .D2T2 *+SP(12),B4 ; |397|
>
> LDW .D1T1 *++A8,A9 ; |397|
> || LDW .D2T2 *+SP(4),B8 ; |397|
>
> NOP 4
>
> CMPEQ .L2 B8,B6,B8 ; |397|
> || CMPLTSP .S2X A9,B4,B4 ; |397|
>
> XOR .L2 1,B8,B8 ; |397|
> AND .L2 B8,B4,B0 ; |397|
> [ B0] B .S1 $C$L6 ; |397|
> .dwpsn file "C:\Documents and Settings\User\Desktop\20090722
> works\Calculator.c",line 397,column 0,is_stmt NOP
> 5 ; BRANCHCC OCCURS {$C$L6} ; |397|
>
> --- In c..., "Richard Williams" wrote:
> >
> >
> > d.stuartnl,
> >
> > The reason the time is quicker, even though there is more code, is because the
> > code produced to do:
> > CH1.deloggedData[x]
> > includes quite a lot of math, calculation of an address in a array is slow
> > compared to incrementing a pointer
> >
> > R. Williams
> >
> >
> > ---------- Original Message -----------
> > From: "d.stuartnl"
> > To: c...
> > Sent: Fri, 17 Jul 2009 17:06:32 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> >
> > > Dear R.Williams,
> > >
> > > I changed my code to your suggestion:
> > >
> > > void Calculator_FetchData()
> > > {
> > > volatile float * pCH1;
> > > volatile float * pCH2;
> > > volatile float * pCH3;
> > > volatile float * pCH4;
> > > volatile float * pCH5;
> > > volatile float * pCH6;
> > >
> > > const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > > const termValue = 0x84825131;
> > >
> > > pCH1 = &CH1.deloggedData[0];
> > > pCH2 = &CH2.deloggedData[0];
> > > pCH3 = &CH3.deloggedData[0];
> > > pCH4 = &CH4.deloggedData[0];
> > > pCH5 = &CH5.deloggedData[0];
> > > pCH6 = &CH6.deloggedData[0];
> > >
> > >
> > >
> > > tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...
> > >
> > > while(*pCH1 < endCH1)
> > > {
> > > tmpRead1 = *read1;
> > > if(tmpRead1 == termValue) break;
> > > //CHANNEL 1
> > > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > // CHANNEL 2
> > > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > if(LRneeded == 1)
> > > {
> > > *pCH1 += *pCH2;
> > > if(*pCH1 > 5000)
> > > {
> > > *pCH1 = 5000;
> > > }
> > > }
> > > // CHANNEL 5
> > > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > >
> > > // CHANNEL 6
> > > *pCH6 = LUT1[tmpRead1 & 0xFF];
> > >
> > > tmpRead2 = *read2;
> > >
> > > // CHANNEL 3 this channel is always read for particle matching on
> > > this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))]; // CHANNEL 4
> > > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > >
> > > pCH1++;
> > > pCH2++;
> > > pCH3++;
> > > pCH4++;
> > > pCH5++;
> > > pCH6++;
> > > x++;
> > > }
> > > if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> > > {
> > > processTime = TIMER(1)->cnt - tmpprocessTime;
> > > }
> > > }
> > >
> > > On my testrig I'm offering particles with a fixed lenght of 985. My
> > > previous code could read 985 samples for 6 channels in 681us. Your
> > > suggestion cut that time down to 601us!!! My first reaction was WOW
> > > :P. I have a couple of questions though if you can forgive my
> > > ignorance. The big question is WHY? Because it looks like it's
> > > calculating more (6 pointers instead of 1 "x"). I still left in the
> > > x++; because I need to know how many samples have been read.
> > >
> > > With kind regards,
> > >
> > > Dominic Stuart
> > >
> > > --- In c..., "Richard Williams" wrote:
> > > >
> > > > d.stuartnl,
> > > >
> > > > I notice that the code, during the first loop, checks for the
termination value
> > > > then throws away the first read values (by reading from read1 and read2
again).
> > > > is that you wanted to do?
> > > >
> > > > Execution could be made much faster, by eliminating the calculations
related to
> > > > 'x' by using pointers to:
> > > > CH1.deloggedData,
> > > > CH2.deloggedData,
> > > > CH3.deloggedData,
> > > > CH4.deloggedData,
> > > > CH5.deloggedData,
> > > > CH6.deloggedData.
> > > > Initialize the pointers before the loop and increment them at the end of the
> > loop.
> > > > Also, eliminate 'x' and related calculation by precalculating the end
address
> > > > for the loop as:
> > > > const endCH1 = &CH1.deloggedData[0x1000];
> > > > const termValue = 0x84825131;
> > > >
> > > > pCH1 = &CH1.deloggedData[0];
> > > > pCH2 = &CH2.deloggedData[0];
> > > > --- // rest of initialization
> > > > while( pCH1 < endCH1 )
> > > > {
> > > > ---// processing
> > > > pCH1++;
> > > > pCh2++;
> > > > ...// rest of incrementing
> > > > } // end while()
> > > >
> > > > to avoid processing the termination value from *read1
> > > > and to exit when the termination value is read:
> > > > The first code within the 'while' loop would be:
> > > > tmpRead1 = *read1;
> > > > if (tmpRead1 == termValue ) break;
> > > > tmpRead2 = *read2;
> > > >
> > > > R. Williams
> > > >
> > > >
> > > > ---------- Original Message -----------
> > > > From: "d.stuartnl"
> > > > To: c...
> > > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > > Subject: [c6x] Re: Slow EMIF transfer
> > > >
> > > > > R. Williams,
> > > >
> > > > >
> > > > > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> > > > > I've been analyzing the compilers feedback and it's stating that it
> > > > > cannot implement software pipelining because there's a function call
> > > > > (AddSample()) in the loop. I've removed the AddSample() function and
> > > > > put the code from the function directly into the loop (see source),
> > > > > there's still some problems (Disqualified loop: Loop carried
> > > > > dependency bound too large). But I'm working on it :) I've also found
> > > > > out that pipelining is not being used in a lot of my loops so I'm
> > > > > guessing if I adjust my C-code so that software pipelining will be
> > > > > possible I will notice an increase in performance.
> > > > >
> > > > > Source:
> > > > >
> > > > > read1 = (int*) 0x90300004;
> > > > > read2 = (int*) 0x90300008;
> > > > >
> > > > > tmpRead1 = *read1;
> > > > > tmpRead2 = *read2;
> > > > > x = 0;
> > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > {
> > > > > tmpRead1 = *read1;
> > > > > tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > > >
> > > > > CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > > > CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > > > // FWS R+L Add
> > > > > if(LRneeded == 1)
> > > > > {
> > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > if(CH1.deloggedData[x] > 5000)
> > > > > {
> > > > > CH1.deloggedData[x] = 5000;
> > > > > }
> > > > > }
> > > > > CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > > > CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > > > CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > > > x++;
> > > > > }
> > > > >
> > > > > With kind regards,
> > > > >
> > > > > Dominic
> > > > >
> > > > > >
> > > > > > However, your idea of just using the read operation, since it is
much longer
> > > > > > than a write, is a good one.
> > > > > >
> > > > > > R. Williams
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------- Original Message -----------
> > > > > > From: Jeff Brower
> > > > > > To: Dominic Stuart
> > > > > > Cc: c...
> > > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > >
> > > > > > > Dominic-
> > > > > > >
> > > > > > > > I am indeed trying to avoid delay in processing flow. The data needs
> > to be
> > > > > > > > decompressed asap. When that is done the DSP performs calculations
> > on the
> > > > > > > > data and based on the outcome of those calculations the DSP
generates a
> > > > > > > > trigger (GPIO). Your idea of a code loop got me thinking... If a
read
> > > > > > > > always takes longer than a write, I don't have to pull the Empty
> > Flag and
> > > > > > > > can just read the data through a loop like so:
> > > > > > > >
> > > > > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > > > > {
> > > > > > > > Calculator_AddSample();
> > > > > > > > }
> > > > > > >
> > > > > > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > > > > > then you know the agent on the other side of the FIFO is writing a
> > > > > > > known block size, and will write it faster than you can read, so your
> > > > > > > code just needs to read.
> > > > > > >
> > > > > > > > I've tested this and it did improve the performance but nothing
> > shocking,
> > > > > > > > it seems the decompressing via the LookUp Table is creating the
bottle
> > > > > > > > neck. I've already split the two dimensional LUT into 2 one
dimensional
> > > > > > > > array's. This also helped a bit.
> > > > > > >
> > > > > > > One thing you might try is hand-optimized asm code just for the
read /
> > > > > > > look-up sequence, using techniques that Richard was describing. If
> > > > > > > you take advantage of the pipeline, you can improve performance. For
> > > > > > > example you can read sample N, then in the next 4 instructions
process
> > > > > > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > > > > > like it wouldn't be that much code in your loop, maybe a dozen or
less
> > > > > > > asm instructions.
> > > > > > >
> > > > > > > -Jeff
> > > > > > >
> > > > > > > PS. Please post to the group, not to me. Thanks.
> > > > > > >
> > > > > > > > --- In c..., Jeff Brower wrote:
> > > > > > > > >
> > > > > > > > > Dominic-
> > > > > > > > >
> > > > > > > > > > Thanks for the information, I think I will refrain from
using block
> > > > > > > > > > transfers because I want to process the data as the DSP
receives it.
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > >
> > > > > > > > > > At the moment I am starting this "prefetch" function when a
burst
> > > > > > > > > > starts and execute this function every time there is data
available
> > > > > > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching
27.6% of
> > > > > > > > > > the data before the burst ends. All variables are in IRAM.
> > > > > > > > >
> > > > > > > > > The typical reason for doing it that way is to avoid delay
> > (latency) in
> > > > > > your signal
> > > > > > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > > > > > transmission,
> > > > > > > > > etc). Is that the case? If not then a block based method
would be
> > > > > > better, otherwise
> > > > > > > > > you will waste a lot of time polling for each element. You don't
> > have to
> > > > > > implement
> > > > > > > > > DMA as a first step to get that working, you could use a code
> > loop. Then
> > > > > > implement
> > > > > > > > > DMA in order to further improve performance.
> > > > > > > > >
> > > > > > > > > -Jeff
> > > > > > > > >
> > > > > > > > > > My function looks like this:
> > > > > > > > > >
> > > > > > > > > > void Calculator_AddSample()
> > > > > > > > > > {
> > > > > > > > > > x++;
> > > > > > > > > >
> > > > > > > > > > read1 = (int*) 0x90300004;
> > > > > > > > > > read2 = (int*) 0x90300008;
> > > > > > > > > >
> > > > > > > > > > tmpRead1 = *read1;
> > > > > > > > > > tmpRead2 = *read2;
> > > > > > > > > >
> > > > > > > > > > // CHANNEL 1
> > > > > > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > > > // CHANNEL 2
> > > > > > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >>
24)];
> > > > > > > > > > // FWS R+L Add
> > > > > > > > > > if(LRneeded == 1)
> > > > > > > > > > {
> > > > > > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > > > > > {
> > > > > > > > > > CH1.deloggedData[x] = 5000;
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > > // CHANNEL 3 this channel is always read for particle
matching on
> > > > > > this channel
> > > > > > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > > >
> > > > > > > > > > // CHANNEL 4
> > > > > > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > > > // CHANNEL 5
> > > > > > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > > > // CHANNEL 6
> > > > > > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > > }
> > > > > > > > > > This function executes 2 reads from 2 different FIFO's and then
> > > > > > seperates the different datachannels and decompresses the value's with a
> > LookUp
> > > > > > Table.
> > > > > > > > > >
> > > > > > > > > > I am trying to streamline this function so it can keep up
with the
> > > > > > incoming data. The data is written to the FIFO's with 4MHz. The data
> > consists of
> > > > > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > > > > >
> > > > > > > > > > At the moment I am starting this "prefetch" function when a
> > burst starts
> > > > > > and execute this function every time there is data available in the
FIFO's
> > > > > > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
> > burst
> > > > > > ends. All variables are in IRAM.
> > > > > > > > > >
> > > > > > > > > > I think I made an error in suspecting the EMIF transfer speed
> > and I now
> > > > > > suspect that there may be some overhead in the polling scheme I use for
> > calling
> > > > > > this function that results in the slow transfer speed. I will look into
> > this. I
> > > > > > would like to thank everyone for there input.
> > > > > > > > > >
> > > > > > > > > > With kind regards,
> > > > > > > > > >
> > > > > > > > > > Dominic
> > > > > > > > > >
> > > > > > > > > > --- In c..., Adolf Klemenz
wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Dominic,
> > > > > > > > > > >
> > > > > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > > > > > >as I understand DMA, I would need to work in "blocks" of
data but
> > > > that
> > > > > > > > > > > >would be very tricky in my application since I do not
know how
> > > > big the
> > > > > > > > > > > >datastream is gonna be. Or is it possible to use DMA for
> > single byte
> > > > > > transfers?
> > > > > > > > > > >
> > > > > > > > > > > using DMA makes sense for block transfers only. Typical Fifo
> > > > applications
> > > > > > > > > > > will use the Fifo's half-full flag (or a similar signal) to
> > > > trigger a DMA
> > > > > > > > > > > block read.
> > > > > > > > > > > You may use element-synchronized DMA (each trigger
transfers only
> > > > one data
> > > > > > > > > > > word), but there will be no speed improvement: It takes about
> > > > 100ns from
> > > > > > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > > > > > >
> > > > > > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > > > > > >
> > > > > > > > > > > // compiled with -o2 and without debug info:
> > > > > > > > > > >
> > > > > > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > > > > > // optimizer from code removal
> > > > > > > > > > > for (;;)
> > > > > > > > > > > {
> > > > > > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > The screenshot shows chip select and read signal with the
expected
> > > > timings
> > > > > > > > > > > (20ns strobe width). The gap between sucessive reads is
caused by
> > > > the DSP
> > > > > > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
> > > > which should
> > > > > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > > > > >
> > > > > > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > > > > > >
> > > > > > > > > > > Best Regards,
> > > > > > > > > > > Adolf Klemenz, D.SignT
> > > > > > > > >
> > > > > > ------- End of Original Message -------
> > > > > >
> > > > ------- End of Original Message -------
> > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________
R.Williams,

SUCCESS! Looptime has almost halved! Software pipelining is working now thanks to your tips:

--- In c..., "Richard Williams" wrote:
>
> d.stuartnl,
>
> under the assumption that the included code is actually the code being compiled...
> The line: x = (int) (pCH1 - &CH1.deloggedData[0]);
> will give the number of addresses rather than the number of entries,
> Therefore it should be:
> x = ((int) (pCH1 - &CH1.deloggedData[0]) / sizeof(float));
>

For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
is working fine as it is. Dont know why though.

>
> The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> is using a local variable tmpRead1 before it is set,
> so garbage is being used in the comparison.
I've now added a tmpRead1 = 0; when i init the variable every time the function is called.
> The lines:
> volatile int tmpStore1;
> volatile int tmpStore2;
> describe two local variables that are not being used, so should be deleted.
>
check
>
> The lines:
> > volatile int * restrict p1,
> > volatile int * restrict p2)
> have the parameter names p1 and p2.
> This yields no useful information about the target of the pointers,
> so should be renamed to something meaningful.
>
check, renamed them.
>
> The loading of the local variable tmpRead2
> tmpRead2 = *p2;
> is being immediately used in the next line.
> it takes some 4 cycles for the load to complete, so the
> loading should be several cycles/lines earlier in the source.
>
check, moved the read operations to the top of my while loop.
>
> The line: if(LRneeded == 1)
> is referencing a global variable.
> This makes for maintenance problems and pipelining problems.
> It would be better passed in as one of the parameters (and have a
> local/parameter name).
>
check
>
> We have previously discussed the use of the character and the problems it
> produces.
> I replaced the characters with spaces in the copied code.
>
sorry, will do that from now on...
>
> The term 'volatile' is not needed in any of the variables as the code does not
> have repeating lines and the variables will not (unexpectedly) change during the
> execution of the code.
> Removing the 'volatile' will speed up the code because the values, once into a
> CPU register, will not have to be re-read at each usage of the value.
>
removed volatile keyword.
>
> The line: const termValue = 0x84825131
> is missing the 'type' for the constant.
> I would suggest adding 'int' after the 'const'.
>
check
>
> The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> is performing a bit-wise 'and' between two logical conditions.
> It should be: while( (*pCH1 < endCH1) && (tmpRead1 != termValue) )
> so it performs a logical 'and' between two logical conditions.
>
check
>
> The use of the global variable 'x' is a maintenance problem.
> I would suggest a modification to have a local variable 'x' and return the value
> of 'x' rather than returning 'void'
> let the caller assign the returned value to the global variable 'x'.
>
check
>
> In general, for a loop to be pipelined, the loop must be relatively simple.
> Therefore, I would suggest making this two loops,
> one for tmpRead1 reading and calculations
> one for tmpRead2 reading and calculations
> However, for parallel operation, the tmpRead1 and tmpRead2 operations could be
> (somewhat) merged in a single loop.
>
still have them in a single loop and it's pipelining. Do you think it's worth considering splitting it into two loops and check if there's (an even better) speed increase?
>
> To help absorb the needed CPU cycles after a read of p1 and p2, I would put the
> first read(s) before the 'while' statement(s) and read again at the end of the
> loop, just before the incrementing of the pCHx pointers.
>
for some reason when i move the read operations like you suggest the software pipelining is not possible (Cannot find schedule).

My new and improved function:

unsigned int Calculator_FetchData(volatile int * restrict pFifo12, volatile int * restrict pFifo3, Bool curvature)
{
unsigned int tmpRead1 = 0;
unsigned int tmpRead2 = 0;
unsigned int sampleCount;
float * restrict pCH1;
float * restrict pCH2;
float * restrict pCH3;
char * restrict pBinData3;
float * restrict pCH4;
float * restrict pCH5;
float * restrict pCH6;

const float * endCH1 = &CH1.deloggedData[0x1000];
const int termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
pCH3 = &CH3.deloggedData[0];
pBinData3 = &binData3[0];
pCH4 = &CH4.deloggedData[0];
pCH5 = &CH5.deloggedData[0];
pCH6 = &CH6.deloggedData[0];

while((pCH1 < endCH1) && (tmpRead1 != termValue))//(*pCH1 < endCH1) & (tmpRead1 != termValue))
{
tmpRead1 = *pFifo12;
tmpRead2 = *pFifo3;

//CHANNEL 1
*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
// CHANNEL 2
*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
if(curvature)
{
*pCH1 += *pCH2;
if(*pCH1 > 5000)
{
*pCH1 = 5000;
}
}
//CHANNEL 5
*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
// CHANNEL 6
*pCH6 = LUT1[tmpRead1 & 0xFF];
// CHANNEL 3 this channel is always read for particle matching on this channel
*pCH3 = LUT0[((tmpRead2 & 0xFF))];
*pBinData3 = tmpRead2 & 0xFF;
// CHANNEL 4
*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

pCH1++;
pCH2++;
pCH3++;
pBinData3++;
pCH4++;
pCH5++;
pCH6++;
}
sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;

return sampleCount;
}

I would like to thank you very much for helping me improve this code, i'm learning more and more :)

As you might have seen in my code the second read (tempRead2) is a 32 bits int but I'm only interrested in the first 16 bits (where channel 3 and 4 reside), is there a way i can inform the compiler to ignore the other 16 bits for maybe more efficient performance?

I had to leave pFifo12 and pFifo3 volatile because when i removed these keywords the software pipelining was disabled again (Cannot find schedule).

With kind regards,

Dominic

>
> R. Williams
>
> ---------- Original Message -----------
> From: "d.stuartnl"
> To: c...
> Sent: Wed, 22 Jul 2009 13:43:32 -0000
> Subject: [c6x] Re: Slow EMIF transfer
>
> > Hi all,
> >
> > I'm trying to further optimize my code but for some reason I cannot
> > get pipelining to work. I've checked several documents (SPRU425,
> > Optimizing C Compiler Tutorial, SPRA666, Hand Tuning Loops and Control
> > Code). These documents primarily focus on improving pipelines but in
> > my .asm file it keeps stating "Unsafe schedule for irregular loop". It
> > produces the folowing Software Pipeline Information:
> >
> > -------* My code is as follows:
> >
> > void Calculator_FetchData(
> > volatile int * restrict p1,
> > volatile int * restrict p2)
> {
> > volatile int tmpRead1;
> > volatile int tmpRead2;
> > volatile int tmpStore1;
> > volatile int tmpStore2;
> > volatile float * restrict pCH1;
> > volatile float * restrict pCH2;
> > volatile float * restrict pCH3;
> > volatile float * restrict pCH4;
> > volatile float * restrict pCH5;
> > volatile float * restrict pCH6;
> >
> > const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > const termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> >
> > while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> > {
> > tmpRead1 = *p1;
> >
> > //CHANNEL 1if(LRneeded == 1)
> > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > // CHANNEL 2
> > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > if(LRneeded == 1)
> > {
> > *pCH1 += *pCH2;
> >
> > if(*pCH1 > 5000)
> > {
> > *pCH1 = 5000;
> > }
> > }
> > //CHANNEL 5
> > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> > // CHANNEL 6
> > *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> > tmpRead2 = *p2;
> >
> > // CHANNEL 3 this channel is always read for particle matching on
> > this channel
> > *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> > // CHANNEL 4
> > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > pCH1++;
> > pCH2++;
> > pCH3++;
> > pCH4++;
> > pCH5++;
> > pCH6++;
> > }
> >
> > x = (int) (pCH1 - &CH1.deloggedData[0]);
> > }
> >
> > Is there a way I can change my C-code so the DSP can Pipeline? I think
> > it should be possible to have at least 2 iterations in parallel:
> >
> > 1st 2nd
> > read1
> > delog1
> > read2 read1
> > delog2 delog1
> > etc..
> >
> > I've already used the "restrict" keyword on the pointers I use since
> > these pointers do not overlap. I'm using the folowing compiler options:
> >
> > -k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug" -d"CHIP_6713"
> > -d"DEBUG" -mt -mw -mh -mr1 -mv6710 --mem_model:data --consultant
> >
> > Any advice or pointers to documents regarding how to enable pipelining
> > other then the ones mentioned above would be helpfull.
> >
> > With kind regards,
> >
> > Dominic Stuart
> >
> > PS: I don't know if it's usefull but the folowing asm code is being produced:
> >
> > C$L6:
> > $C$DW$L$_Calculator_FetchData$2$B:
> > .dwpsn file "C:\Documents and Settings\User\Desktop\20090722
> > works\Calculator.c",line 363,column 0,is_stmt ;** ---------------------
> > --g3: ;** 364 ----------------------- tmpRead1 = *p1; ;** 367 ------
> > ----------------- *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u, 24u)];
> > ;** 369 ----------------------- *(++pCH2) = K$32[((unsigned)
> > tmpRead1>>22>>2)]; ;** 370 ----------------------- if ( LRneeded !> > 1 ) goto g6; ;** 372 ----------------------- *pCH1 = *pCH1+*pCH2;
> > ;** 373 ----------------------- if ( *pCH1 <= K$36 ) goto g6; ;**
> > 375 ----------------------- *pCH1 = K$36; ;** ----------------------
> > -g6: ;** 379 ----------------------- *pCH5++ = K$39[_extu((unsigned)
> > tmpRead1, 16u, 24u)]; ;** 382 ----------------------- *pCH6++ > > K$39[_extu((unsigned)tmpRead1, 24u, 24u)]; ;** 384 --------------------
> > --- tmpRead2 = *p2; ;** 387 ----------------------- *pCH3++ > > K$32[_extu((unsigned)tmpRead2, 24u, 24u)]; ;** 389 --------------------
> > --- *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u, 24u)]; ;** 397 ---
> > -------------------- if ( (*(++pCH1) < endCH1)&(tmpRead1 != K$27) )
> > goto g3; LDW .D1T2 *A4,B4 ; |364|
> > ZERO .L1 A1 NOP 3 STW
> > .D2T2 B4,*+SP(4) ; |364| LDW .D2T2 *+SP(4)
> > ,B4 ; |367| NOP 4 EXTU .S2
> > B4,8,24,B4 ; |367| LDW .D2T1 *+B7[B4],A9
> > ; |367| NOP 4 STW .D1T1 A9,
> > *A8 ; |367| LDW .D2T2 *+SP(4),B4 ;
> > |369| NOP 4 SHRU .S2 B4,24,B4
> > ; |369| LDW .D2T2 *+B7[B4],B4 ; |369|
> > NOP 4 STW .D2T2 B4,*++B5
> > ; |369| LDHU .D1T2 *A11,B4 ; |370|
> > NOP 4 CMPEQ .L2 B4,1,B0 ; |370|
> >
> > [ B0] LDW .D1T1 *A8,A9 ; |372|
> > || [ B0] LDW .D2T2 *B5,B4 ; |372|
> >
> > NOP 4
> > [ B0] ADDSP .L1X B4,A9,A9 ; |372|
> > NOP 3
> > [ B0] STW .D1T1 A9,*A8 ; |372|
> > [ B0] LDW .D1T1 *A8,A9 ; |373|
> > NOP 4
> > [ B0] CMPGTSP .S1 A9,A2,A9 ; |373|
> > [ B0] MV .L1 A9,A1
> > [ A1] STW .D1T1 A2,*A8 ; |375|
> > LDW .D2T1 *+SP(4),A9 ; |379|
> > NOP 4
> > EXTU .S1 A9,16,24,A9 ; |379|
> > LDW .D1T1 *+A10[A9],A9 ; |379|
> > NOP 4
> > STW .D1T1 A9,*A5++ ; |379|
> > LDW .D2T1 *+SP(4),A9 ; |382|
> > NOP 4
> > EXTU .S1 A9,24,24,A9 ; |382|
> > LDW .D1T1 *+A10[A9],A9 ; |382|
> > NOP 4
> > STW .D1T1 A9,*A3++ ; |382|
> > LDW .D1T2 *A0,B4 ; |384|
> > NOP 4
> > STW .D2T2 B4,*+SP(8) ; |384|
> > LDW .D2T2 *+SP(8),B4 ; |387|
> > NOP 4
> > EXTU .S2 B4,24,24,B4 ; |387|
> > LDW .D2T1 *+B7[B4],A9 ; |387|
> > NOP 4
> > STW .D1T1 A9,*A6++ ; |387|
> > LDW .D2T2 *+SP(8),B4 ; |389|
> > NOP 4
> > EXTU .S2 B4,16,24,B4 ; |389|
> > LDW .D2T1 *+B7[B4],A9 ; |389|
> > NOP 4
> > STW .D1T1 A9,*A7++ ; |389|
> > LDW .D2T2 *+SP(12),B4 ; |397|
> >
> > LDW .D1T1 *++A8,A9 ; |397|
> > || LDW .D2T2 *+SP(4),B8 ; |397|
> >
> > NOP 4
> >
> > CMPEQ .L2 B8,B6,B8 ; |397|
> > || CMPLTSP .S2X A9,B4,B4 ; |397|
> >
> > XOR .L2 1,B8,B8 ; |397|
> > AND .L2 B8,B4,B0 ; |397|
> > [ B0] B .S1 $C$L6 ; |397|
> > .dwpsn file "C:\Documents and Settings\User\Desktop\20090722
> > works\Calculator.c",line 397,column 0,is_stmt NOP
> > 5 ; BRANCHCC OCCURS {$C$L6} ; |397|
> >
> > --- In c..., "Richard Williams" wrote:
> > >
> > >
> > > d.stuartnl,
> > >
> > > The reason the time is quicker, even though there is more code, is because the
> > > code produced to do:
> > > CH1.deloggedData[x]
> > > includes quite a lot of math, calculation of an address in a array is slow
> > > compared to incrementing a pointer
> > >
> > > R. Williams
> > >
> > >
> > > ---------- Original Message -----------
> > > From: "d.stuartnl"
> > > To: c...
> > > Sent: Fri, 17 Jul 2009 17:06:32 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > >
> > > > Dear R.Williams,
> > > >
> > > > I changed my code to your suggestion:
> > > >
> > > > void Calculator_FetchData()
> > > > {
> > > > volatile float * pCH1;
> > > > volatile float * pCH2;
> > > > volatile float * pCH3;
> > > > volatile float * pCH4;
> > > > volatile float * pCH5;
> > > > volatile float * pCH6;
> > > >
> > > > const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > > > const termValue = 0x84825131;
> > > >
> > > > pCH1 = &CH1.deloggedData[0];
> > > > pCH2 = &CH2.deloggedData[0];
> > > > pCH3 = &CH3.deloggedData[0];
> > > > pCH4 = &CH4.deloggedData[0];
> > > > pCH5 = &CH5.deloggedData[0];
> > > > pCH6 = &CH6.deloggedData[0];
> > > >
> > > >
> > > >
> > > > tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...
> > > >
> > > > while(*pCH1 < endCH1)
> > > > {
> > > > tmpRead1 = *read1;
> > > > if(tmpRead1 == termValue) break;
> > > > //CHANNEL 1
> > > > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > > // CHANNEL 2
> > > > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > > if(LRneeded == 1)
> > > > {
> > > > *pCH1 += *pCH2;
> > > > if(*pCH1 > 5000)
> > > > {
> > > > *pCH1 = 5000;
> > > > }
> > > > }
> > > > // CHANNEL 5
> > > > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > >
> > > > // CHANNEL 6
> > > > *pCH6 = LUT1[tmpRead1 & 0xFF];
> > > >
> > > > tmpRead2 = *read2;
> > > >
> > > > // CHANNEL 3 this channel is always read for particle matching on
> > > > this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))]; // CHANNEL 4
> > > > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > >
> > > > pCH1++;
> > > > pCH2++;
> > > > pCH3++;
> > > > pCH4++;
> > > > pCH5++;
> > > > pCH6++;
> > > > x++;
> > > > }
> > > > if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> > > > {
> > > > processTime = TIMER(1)->cnt - tmpprocessTime;
> > > > }
> > > > }
> > > >
> > > > On my testrig I'm offering particles with a fixed lenght of 985. My
> > > > previous code could read 985 samples for 6 channels in 681us. Your
> > > > suggestion cut that time down to 601us!!! My first reaction was WOW
> > > > :P. I have a couple of questions though if you can forgive my
> > > > ignorance. The big question is WHY? Because it looks like it's
> > > > calculating more (6 pointers instead of 1 "x"). I still left in the
> > > > x++; because I need to know how many samples have been read.
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic Stuart
> > > >
> > > > --- In c..., "Richard Williams" wrote:
> > > > >
> > > > > d.stuartnl,
> > > > >
> > > > > I notice that the code, during the first loop, checks for the
> termination value
> > > > > then throws away the first read values (by reading from read1 and read2
> again).
> > > > > is that you wanted to do?
> > > > >
> > > > > Execution could be made much faster, by eliminating the calculations
> related to
> > > > > 'x' by using pointers to:
> > > > > CH1.deloggedData,
> > > > > CH2.deloggedData,
> > > > > CH3.deloggedData,
> > > > > CH4.deloggedData,
> > > > > CH5.deloggedData,
> > > > > CH6.deloggedData.
> > > > > Initialize the pointers before the loop and increment them at the end of the
> > > loop.
> > > > > Also, eliminate 'x' and related calculation by precalculating the end
> address
> > > > > for the loop as:
> > > > > const endCH1 = &CH1.deloggedData[0x1000];
> > > > > const termValue = 0x84825131;
> > > > >
> > > > > pCH1 = &CH1.deloggedData[0];
> > > > > pCH2 = &CH2.deloggedData[0];
> > > > > --- // rest of initialization
> > > > > while( pCH1 < endCH1 )
> > > > > {
> > > > > ---// processing
> > > > > pCH1++;
> > > > > pCh2++;
> > > > > ...// rest of incrementing
> > > > > } // end while()
> > > > >
> > > > > to avoid processing the termination value from *read1
> > > > > and to exit when the termination value is read:
> > > > > The first code within the 'while' loop would be:
> > > > > tmpRead1 = *read1;
> > > > > if (tmpRead1 == termValue ) break;
> > > > > tmpRead2 = *read2;
> > > > >
> > > > > R. Williams
> > > > >
> > > > >
> > > > > ---------- Original Message -----------
> > > > > From: "d.stuartnl"
> > > > > To: c...
> > > > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > >
> > > > > > R. Williams,
> > > > >
> > > > > >
> > > > > > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> > > > > > I've been analyzing the compilers feedback and it's stating that it
> > > > > > cannot implement software pipelining because there's a function call
> > > > > > (AddSample()) in the loop. I've removed the AddSample() function and
> > > > > > put the code from the function directly into the loop (see source),
> > > > > > there's still some problems (Disqualified loop: Loop carried
> > > > > > dependency bound too large). But I'm working on it :) I've also found
> > > > > > out that pipelining is not being used in a lot of my loops so I'm
> > > > > > guessing if I adjust my C-code so that software pipelining will be
> > > > > > possible I will notice an increase in performance.
> > > > > >
> > > > > > Source:
> > > > > >
> > > > > > read1 = (int*) 0x90300004;
> > > > > > read2 = (int*) 0x90300008;
> > > > > >
> > > > > > tmpRead1 = *read1;
> > > > > > tmpRead2 = *read2;
> > > > > > x = 0;
> > > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > > {
> > > > > > tmpRead1 = *read1;
> > > > > > tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > > > >
> > > > > > CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > // FWS R+L Add
> > > > > > if(LRneeded == 1)
> > > > > > {
> > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > {
> > > > > > CH1.deloggedData[x] = 5000;
> > > > > > }
> > > > > > }
> > > > > > CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > > > > CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > > > > CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > > > > x++;
> > > > > > }
> > > > > >
> > > > > > With kind regards,
> > > > > >
> > > > > > Dominic
> > > > > >
> > > > > > >
> > > > > > > However, your idea of just using the read operation, since it is
> much longer
> > > > > > > than a write, is a good one.
> > > > > > >
> > > > > > > R. Williams
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------- Original Message -----------
> > > > > > > From: Jeff Brower
> > > > > > > To: Dominic Stuart
> > > > > > > Cc: c...
> > > > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > > >
> > > > > > > > Dominic-
> > > > > > > >
> > > > > > > > > I am indeed trying to avoid delay in processing flow. The data needs
> > > to be
> > > > > > > > > decompressed asap. When that is done the DSP performs calculations
> > > on the
> > > > > > > > > data and based on the outcome of those calculations the DSP
> generates a
> > > > > > > > > trigger (GPIO). Your idea of a code loop got me thinking... If a
> read
> > > > > > > > > always takes longer than a write, I don't have to pull the Empty
> > > Flag and
> > > > > > > > > can just read the data through a loop like so:
> > > > > > > > >
> > > > > > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > > > > > {
> > > > > > > > > Calculator_AddSample();
> > > > > > > > > }
> > > > > > > >
> > > > > > > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > > > > > > then you know the agent on the other side of the FIFO is writing a
> > > > > > > > known block size, and will write it faster than you can read, so your
> > > > > > > > code just needs to read.
> > > > > > > >
> > > > > > > > > I've tested this and it did improve the performance but nothing
> > > shocking,
> > > > > > > > > it seems the decompressing via the LookUp Table is creating the
> bottle
> > > > > > > > > neck. I've already split the two dimensional LUT into 2 one
> dimensional
> > > > > > > > > array's. This also helped a bit.
> > > > > > > >
> > > > > > > > One thing you might try is hand-optimized asm code just for the
> read /
> > > > > > > > look-up sequence, using techniques that Richard was describing. If
> > > > > > > > you take advantage of the pipeline, you can improve performance. For
> > > > > > > > example you can read sample N, then in the next 4 instructions
> process
> > > > > > > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > > > > > > like it wouldn't be that much code in your loop, maybe a dozen or
> less
> > > > > > > > asm instructions.
> > > > > > > >
> > > > > > > > -Jeff
> > > > > > > >
> > > > > > > > PS. Please post to the group, not to me. Thanks.
> > > > > > > >
> > > > > > > > > --- In c..., Jeff Brower wrote:
> > > > > > > > > >
> > > > > > > > > > Dominic-
> > > > > > > > > >
> > > > > > > > > > > Thanks for the information, I think I will refrain from
> using block
> > > > > > > > > > > transfers because I want to process the data as the DSP
> receives it.
> > > > > > > > > > .
> > > > > > > > > > .
> > > > > > > > > > .
> > > > > > > > > >
> > > > > > > > > > > At the moment I am starting this "prefetch" function when a
> burst
> > > > > > > > > > > starts and execute this function every time there is data
> available
> > > > > > > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching
> 27.6% of
> > > > > > > > > > > the data before the burst ends. All variables are in IRAM.
> > > > > > > > > >
> > > > > > > > > > The typical reason for doing it that way is to avoid delay
> > > (latency) in
> > > > > > > your signal
> > > > > > > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > > > > > > transmission,
> > > > > > > > > > etc). Is that the case? If not then a block based method
> would be
> > > > > > > better, otherwise
> > > > > > > > > > you will waste a lot of time polling for each element. You don't
> > > have to
> > > > > > > implement
> > > > > > > > > > DMA as a first step to get that working, you could use a code
> > > loop. Then
> > > > > > > implement
> > > > > > > > > > DMA in order to further improve performance.
> > > > > > > > > >
> > > > > > > > > > -Jeff
> > > > > > > > > >
> > > > > > > > > > > My function looks like this:
> > > > > > > > > > >
> > > > > > > > > > > void Calculator_AddSample()
> > > > > > > > > > > {
> > > > > > > > > > > x++;
> > > > > > > > > > >
> > > > > > > > > > > read1 = (int*) 0x90300004;
> > > > > > > > > > > read2 = (int*) 0x90300008;
> > > > > > > > > > >
> > > > > > > > > > > tmpRead1 = *read1;
> > > > > > > > > > > tmpRead2 = *read2;
> > > > > > > > > > >
> > > > > > > > > > > // CHANNEL 1
> > > > > > > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > > > > // CHANNEL 2
> > > > > > > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >>
> 24)];
> > > > > > > > > > > // FWS R+L Add
> > > > > > > > > > > if(LRneeded == 1)
> > > > > > > > > > > {
> > > > > > > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > > > > > > {
> > > > > > > > > > > CH1.deloggedData[x] = 5000;
> > > > > > > > > > > }
> > > > > > > > > > > }
> > > > > > > > > > > // CHANNEL 3 this channel is always read for particle
> matching on
> > > > > > > this channel
> > > > > > > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > > > >
> > > > > > > > > > > // CHANNEL 4
> > > > > > > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > > > > // CHANNEL 5
> > > > > > > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > > > > // CHANNEL 6
> > > > > > > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > > > }
> > > > > > > > > > > This function executes 2 reads from 2 different FIFO's and then
> > > > > > > seperates the different datachannels and decompresses the value's with a
> > > LookUp
> > > > > > > Table.
> > > > > > > > > > >
> > > > > > > > > > > I am trying to streamline this function so it can keep up
> with the
> > > > > > > incoming data. The data is written to the FIFO's with 4MHz. The data
> > > consists of
> > > > > > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > > > > > >
> > > > > > > > > > > At the moment I am starting this "prefetch" function when a
> > > burst starts
> > > > > > > and execute this function every time there is data available in the
> FIFO's
> > > > > > > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
> > > burst
> > > > > > > ends. All variables are in IRAM.
> > > > > > > > > > >
> > > > > > > > > > > I think I made an error in suspecting the EMIF transfer speed
> > > and I now
> > > > > > > suspect that there may be some overhead in the polling scheme I use for
> > > calling
> > > > > > > this function that results in the slow transfer speed. I will look into
> > > this. I
> > > > > > > would like to thank everyone for there input.
> > > > > > > > > > >
> > > > > > > > > > > With kind regards,
> > > > > > > > > > >
> > > > > > > > > > > Dominic
> > > > > > > > > > >
> > > > > > > > > > > --- In c..., Adolf Klemenz
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Dominic,
> > > > > > > > > > > >
> > > > > > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > > > > > > >as I understand DMA, I would need to work in "blocks" of
> data but
> > > > > that
> > > > > > > > > > > > >would be very tricky in my application since I do not
> know how
> > > > > big the
> > > > > > > > > > > > >datastream is gonna be. Or is it possible to use DMA for
> > > single byte
> > > > > > > transfers?
> > > > > > > > > > > >
> > > > > > > > > > > > using DMA makes sense for block transfers only. Typical Fifo
> > > > > applications
> > > > > > > > > > > > will use the Fifo's half-full flag (or a similar signal) to
> > > > > trigger a DMA
> > > > > > > > > > > > block read.
> > > > > > > > > > > > You may use element-synchronized DMA (each trigger
> transfers only
> > > > > one data
> > > > > > > > > > > > word), but there will be no speed improvement: It takes about
> > > > > 100ns from
> > > > > > > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > > > > > > >
> > > > > > > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > > > > > > >
> > > > > > > > > > > > // compiled with -o2 and without debug info:
> > > > > > > > > > > >
> > > > > > > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > > > > > > // optimizer from code removal
> > > > > > > > > > > > for (;;)
> > > > > > > > > > > > {
> > > > > > > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > The screenshot shows chip select and read signal with the
> expected
> > > > > timings
> > > > > > > > > > > > (20ns strobe width). The gap between sucessive reads is
> caused by
> > > > > the DSP
> > > > > > > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
> > > > > which should
> > > > > > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > > > > > >
> > > > > > > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > Adolf Klemenz, D.SignT
> > > > > > > > > >
> > > > > > > ------- End of Original Message -------
> > > > > > >
> > > > > ------- End of Original Message -------
> > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________
d.stuartnl,

my comments in-line and prefixed with

R. Williams
---------- Original Message -----------
From: "d.stuartnl"
To: c...
Sent: Fri, 24 Jul 2009 09:26:55 -0000
Subject: [c6x] Re: Slow EMIF transfer

> R.Williams,
>
> SUCCESS! Looptime has almost halved! Software pipelining is working
> now thanks to your tips:

congratulations!!



>
> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> is working fine as it is. Dont know why though.

two reasons:
1) the data size of a float is the same as the address data size
2) the '-1' because the pCH1 pointer is incremented at the end of the loop
to point 1 past the last location used.


> >
> still have them in a single loop and it's pipelining. Do you think
> it's worth considering splitting it into two loops and check if
> there's (an even better) speed increase?

you could experiment, but it looks like it is not necessary to separate
the code into two loops.


> My new and improved function:


> // CHANNEL 3 this channel is always read for particle matching
> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > LUT0[((tmpRead2 & 0xFF00) >> 8)];

there seems to be a problem in the editing of the above 4 lines
It looks like pCH3 is not being used; however, pCH3 is still being initialized
and incremented in the code.
Also when testing for execution speed, adding new operations (pBinData3) makes
it very difficult to make timing comparisons.


>
> As you might have seen in my code the second read (tempRead2) is a 32
> bits int but I'm only interrested in the first 16 bits (where channel
> 3 and 4 reside), is there a way i can inform the compiler

the natural size of a operation is 32bits, changing to a 16 bit operation
would slow the code execution.

>
> I had to leave pFifo12 and pFifo3 volatile because when i removed
> these keywords the software pipelining was disabled again (Cannot find
> schedule).

the 'volatile' is needed for the two parameters because they DO change
between reads. I had suggested to remove the 'volatile' from the variables, not
the parameters.


>
> With kind regards,
>
> Dominic


_____________________________________
Congratulations, Dominic!!

I'll top post this minor comment wrt 16/32 bit memory accesses and speed.

Assuming that you have 32 bit wide memory with aligned accesses, 32,
16, and 8 bit accesses will be the same speed.
Only if your external memory is 8 or 16 bits wide would there be any
potential advantage in performing 16 bit accesses instead of 32 bit
accesses.
Also, there would be an advantage in fetching 32 bits at a time if you
an entire array of 8 or 16 bit values.

I haven't looked at the details of your code, but if you always fetch
48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
that your hardware addresses are preventing you from picking up some
additional speed. *If* the input addresses began on a 64 bit boundary
[0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
any memory fetch would coerce the compiler to performing an 'LDDW' [64
bit read].

Since your hardware addresses are fixed, you only need 1 pointer. You could use
tmpRead2 = *(read1 + 4);
This would free up one register and, depending on register
utilization, could improve the performance.

mikedunn
On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> d.stuartnl,
>
> my comments in-line and prefixed with R. Williams
>
> ---------- Original Message -----------
> From: "d.stuartnl"
> To: c...
> Sent: Fri, 24 Jul 2009 09:26:55 -0000
> Subject: [c6x] Re: Slow EMIF transfer
>
>> R.Williams,
>>
>> SUCCESS! Looptime has almost halved! Software pipelining is working
>> now thanks to your tips:
>
> congratulations!!
>
> >
>> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
>> is working fine as it is. Dont know why though.
>
> two reasons:
> 1) the data size of a float is the same as the address data size
> 2) the '-1' because the pCH1 pointer is incremented at the end of the loop
> to point 1 past the last location used.
> > >
>> still have them in a single loop and it's pipelining. Do you think
>> it's worth considering splitting it into two loops and check if
>> there's (an even better) speed increase?
>
> you could experiment, but it looks like it is not necessary to
> separate
> the code into two loops.
> > My new and improved function:
> > // CHANNEL 3 this channel is always read for particle matching
>> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> there seems to be a problem in the editing of the above 4 lines
> It looks like pCH3 is not being used; however, pCH3 is still being
> initialized
> and incremented in the code.
> Also when testing for execution speed, adding new operations (pBinData3)
> makes
> it very difficult to make timing comparisons.
> >
>> As you might have seen in my code the second read (tempRead2) is a 32
>> bits int but I'm only interrested in the first 16 bits (where channel
>> 3 and 4 reside), is there a way i can inform the compiler
>
> the natural size of a operation is 32bits, changing to a 16 bit
> operation
> would slow the code execution.
>
>>
>> I had to leave pFifo12 and pFifo3 volatile because when i removed
>> these keywords the software pipelining was disabled again (Cannot find
>> schedule).
>
> the 'volatile' is needed for the two parameters because they DO change
> between reads. I had suggested to remove the 'volatile' from the variables,
> not
> the parameters.
> >
>> With kind regards,
>>
>> Dominic
>
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________
Thanks Mike,

I'm starting to enjoy this "tweaking" and am trying to push it as far as I can because every microsecond that i gain means the DSP can handle more particles/second. I've applied the tips I've gotten on this forum on the rest of my source code as well (the actual loops in my program that do the calculations on the data) and those are pipelining aswell now. Compared to the initial source total improvement is over 900%! Amazing (looks like I was using the DSP as a glorified MCU) but the true power of the DSP is starting to show! I thank you for your input but it raises some questions if you don't mind:

--- In c..., Michael Dunn wrote:
>
> Congratulations, Dominic!!
>
> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
>
> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> 16, and 8 bit accesses will be the same speed.

What do you mean with aligned exactly?

> Only if your external memory is 8 or 16 bits wide would there be any
> potential advantage in performing 16 bit accesses instead of 32 bit
> accesses.
> Also, there would be an advantage in fetching 32 bits at a time if you
> an entire array of 8 or 16 bit values.
>

I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads the first two together (logic tied together so they "act" like 1 32bits wide FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).

> I haven't looked at the details of your code, but if you always fetch
> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> that your hardware addresses are preventing you from picking up some
> additional speed. *If* the input addresses began on a 64 bit boundary
> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> bit read].

I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my compiler to fetch a 64 bit read (I mean this still has to be split somehow in 2 read cycles somehow?)
>
> Since your hardware addresses are fixed, you only need 1 pointer. You could use
> tmpRead2 = *(read1 + 4);
> This would free up one register and, depending on register
> utilization, could improve the performance.
>

Improve performance, thats what I like to hear ;) I hope my questions aren't too "basic".

Dominic

> mikedunn
> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> >
> >
> > d.stuartnl,
> >
> > my comments in-line and prefixed with
> >
> > R. Williams
> >
> > ---------- Original Message -----------
> > From: "d.stuartnl"
> > To: c...
> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> >
> >> R.Williams,
> >>
> >> SUCCESS! Looptime has almost halved! Software pipelining is working
> >> now thanks to your tips:
> >
> > congratulations!!
> >
> >
> >
> >>
> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> >> is working fine as it is. Dont know why though.
> >
> > two reasons:
> > 1) the data size of a float is the same as the address data size
> > 2) the '-1' because the pCH1 pointer is incremented at the end of the loop
> > to point 1 past the last location used.
> >
> >
> >> >
> >> still have them in a single loop and it's pipelining. Do you think
> >> it's worth considering splitting it into two loops and check if
> >> there's (an even better) speed increase?
> >
> > you could experiment, but it looks like it is not necessary to
> > separate
> > the code into two loops.
> >
> >
> >> My new and improved function:
> >
> >
> >> // CHANNEL 3 this channel is always read for particle matching
> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > there seems to be a problem in the editing of the above 4 lines
> > It looks like pCH3 is not being used; however, pCH3 is still being
> > initialized
> > and incremented in the code.
> > Also when testing for execution speed, adding new operations (pBinData3)
> > makes
> > it very difficult to make timing comparisons.
> >
> >
> >>
> >> As you might have seen in my code the second read (tempRead2) is a 32
> >> bits int but I'm only interrested in the first 16 bits (where channel
> >> 3 and 4 reside), is there a way i can inform the compiler
> >
> > the natural size of a operation is 32bits, changing to a 16 bit
> > operation
> > would slow the code execution.
> >
> >>
> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> >> these keywords the software pipelining was disabled again (Cannot find
> >> schedule).
> >
> > the 'volatile' is needed for the two parameters because they DO change
> > between reads. I had suggested to remove the 'volatile' from the variables,
> > not
> > the parameters.
> >
> >
> >>
> >> With kind regards,
> >>
> >> Dominic
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________
I just wanted to mention that I've enjoyed following this thread, because it
started with somewhat general code and has allowed me to follow as you've
learned a bunch of optimization tricks. I've gone through several of these
stages over the past couple of years, but that doesn't mean that I remember
to use the tricks all the time. seeing them again has been helpful.

Wim.

On Fri, Jul 24, 2009 at 2:58 PM, d.stuartnl wrote:

> Thanks Mike,
>
> I'm starting to enjoy this "tweaking" and am trying to push it as far as I
> can because every microsecond that i gain means the DSP can handle more
> particles/second. I've applied the tips I've gotten on this forum on the
> rest of my source code as well (the actual loops in my program that do the
> calculations on the data) and those are pipelining aswell now. Compared to
> the initial source total improvement is over 900%! Amazing (looks like I was
> using the DSP as a glorified MCU) but the true power of the DSP is starting
> to show! I thank you for your input but it raises some questions if you
> don't mind:
> --- In c... , Michael Dunn
> wrote:
> >
> > Congratulations, Dominic!!
> >
> > I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
> >
> > Assuming that you have 32 bit wide memory with aligned accesses, 32,
> > 16, and 8 bit accesses will be the same speed.
>
> What do you mean with aligned exactly?
>
> > Only if your external memory is 8 or 16 bits wide would there be any
> > potential advantage in performing 16 bit accesses instead of 32 bit
> > accesses.
> > Also, there would be an advantage in fetching 32 bits at a time if you
> > an entire array of 8 or 16 bit values.
> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
> the first two together (logic tied together so they "act" like 1 32bits wide
> FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>
> > I haven't looked at the details of your code, but if you always fetch
> > 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> > that your hardware addresses are preventing you from picking up some
> > additional speed. *If* the input addresses began on a 64 bit boundary
> > [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> > any memory fetch would coerce the compiler to performing an 'LDDW' [64
> > bit read].
>
> I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling
> my compiler to fetch a 64 bit read (I mean this still has to be split
> somehow in 2 read cycles somehow?)
>
> >
> > Since your hardware addresses are fixed, you only need 1 pointer. You
> could use
> > tmpRead2 = *(read1 + 4);
> > This would free up one register and, depending on register
> > utilization, could improve the performance.
> > Improve performance, thats what I like to hear ;) I hope my questions
> aren't too "basic".
>
> Dominic
>
> > mikedunn
> >
> >
> > On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> > >
> > >
> > > d.stuartnl,
> > >
> > > my comments in-line and prefixed with
> > >
> > > R. Williams
> > >
> > > ---------- Original Message -----------
> > > From: "d.stuartnl"
> > > To: c...
> > > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > >
> > >> R.Williams,
> > >>
> > >> SUCCESS! Looptime has almost halved! Software pipelining is working
> > >> now thanks to your tips:
> > >
> > > congratulations!!
> > >
> > >
> > >
> > >>
> > >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> > >> is working fine as it is. Dont know why though.
> > >
> > > two reasons:
> > > 1) the data size of a float is the same as the address data size
> > > 2) the '-1' because the pCH1 pointer is incremented at the end of the
> loop
> > > to point 1 past the last location used.
> > >
> > >
> > >> >
> > >> still have them in a single loop and it's pipelining. Do you think
> > >> it's worth considering splitting it into two loops and check if
> > >> there's (an even better) speed increase?
> > >
> > > you could experiment, but it looks like it is not necessary to
> > > separate
> > > the code into two loops.
> > >
> > >
> > >> My new and improved function:
> > >
> > >
> > >> // CHANNEL 3 this channel is always read for particle matching
> > >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> > >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > > >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > >
> > > there seems to be a problem in the editing of the above 4 lines
> > > It looks like pCH3 is not being used; however, pCH3 is still being
> > > initialized
> > > and incremented in the code.
> > > Also when testing for execution speed, adding new operations
> (pBinData3)
> > > makes
> > > it very difficult to make timing comparisons.
> > >
> > >
> > >>
> > >> As you might have seen in my code the second read (tempRead2) is a 32
> > >> bits int but I'm only interrested in the first 16 bits (where channel
> > >> 3 and 4 reside), is there a way i can inform the compiler
> > >
> > > the natural size of a operation is 32bits, changing to a 16 bit
> > > operation
> > > would slow the code execution.
> > >
> > >>
> > >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> > >> these keywords the software pipelining was disabled again (Cannot find
> > >> schedule).
> > >
> > > the 'volatile' is needed for the two parameters because they DO
> change
> > > between reads. I had suggested to remove the 'volatile' from the
> variables,
> > > not
> > > the parameters.
> > >
> > >
> > >>
> > >> With kind regards,
> > >>
> > >> Dominic
> > >
> > >
> >
> >
> >
> > --
> > www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >
>
Dominic,

On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
> Thanks Mike,
>
> I'm starting to enjoy this "tweaking" and am trying to push it as far as I
> can because every microsecond that i gain means the DSP can handle more
> particles/second. I've applied the tips I've gotten on this forum on the
> rest of my source code as well (the actual loops in my program that do the
> calculations on the data) and those are pipelining aswell now. Compared to
> the initial source total improvement is over 900%! Amazing (looks like I was
> using the DSP as a glorified MCU) but the true power of the DSP is starting
> to show! I thank you for your input but it raises some questions if you
> don't mind:
>
> --- In c..., Michael Dunn wrote:
>>
>> Congratulations, Dominic!!
>>
>> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
>>
>> Assuming that you have 32 bit wide memory with aligned accesses, 32,
>> 16, and 8 bit accesses will be the same speed.
>
> What do you mean with aligned exactly?

'Evenly divisible by the access size' or if 'myAddress % myAccessSize
== 0' then it is aligned.
For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
are aligned [byte addresses are always aligned].

>
>> Only if your external memory is 8 or 16 bits wide would there be any
>> potential advantage in performing 16 bit accesses instead of 32 bit
>> accesses.
>> Also, there would be an advantage in fetching 32 bits at a time if you
>> an entire array of 8 or 16 bit values.
>> I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
> the first two together (logic tied together so they "act" like 1 32bits wide
> FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>
>> I haven't looked at the details of your code, but if you always fetch
>> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
>> that your hardware addresses are preventing you from picking up some
>> additional speed. *If* the input addresses began on a 64 bit boundary
>> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
>> any memory fetch would coerce the compiler to performing an 'LDDW' [64
>> bit read].
>
> I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
> compiler to fetch a 64 bit read (I mean this still has to be split somehow
> in 2 read cycles somehow?)

First of all, I wrote this before I had the idea of using a single
pointer. Your code has 2 pointers that load data - this means that
you are using 4 processor registers. Changing to a single 64 bit read
[32 x 2] would result in requiring only 3 registers. If your routine
has a lot of register pressure [utilization] where it is loading and
unloading CPU registers, then a 'register reduction change' would help
performance.

As I finished writing about the double read, I thought of 'plan B' -
just use one pointer with an offset. When you look at the asm
listing, it should give you some register usage info. If you are
getting 'spills' then definitely try this.

>
>>
>> Since your hardware addresses are fixed, you only need 1 pointer. You
>> could use
>> tmpRead2 = *(read1 + 4);
>> This would free up one register and, depending on register
>> utilization, could improve the performance.
>> Improve performance, thats what I like to hear ;) I hope my questions aren't
> too "basic".

Most active members of this group are willing to help someone who
wants to learn. As long as your questions are informed and you show a
willingness to participate, most of us will help if we can. We come
from a variety of backgrounds and each of us end up learning something
from time to time.

As you are learning, 'performance improvement' is not something that
has a single solution. Rather, it is a journey with many stops along
the way.

mikedunn
>
> Dominic
>
>> mikedunn
>> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
>> >
>> >
>> > d.stuartnl,
>> >
>> > my comments in-line and prefixed with
>> >
>> > R. Williams
>> >
>> > ---------- Original Message -----------
>> > From: "d.stuartnl"
>> > To: c...
>> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> > Subject: [c6x] Re: Slow EMIF transfer
>> >
>> >> R.Williams,
>> >>
>> >> SUCCESS! Looptime has almost halved! Software pipelining is working
>> >> now thanks to your tips:
>> >
>> > congratulations!!
>> >
>> >
>> >
>> >>
>> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
>> >> is working fine as it is. Dont know why though.
>> >
>> > two reasons:
>> > 1) the data size of a float is the same as the address data size
>> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
>> > loop
>> > to point 1 past the last location used.
>> >
>> >
>> >> >
>> >> still have them in a single loop and it's pipelining. Do you think
>> >> it's worth considering splitting it into two loops and check if
>> >> there's (an even better) speed increase?
>> >
>> > you could experiment, but it looks like it is not necessary to
>> > separate
>> > the code into two loops.
>> >
>> >
>> >> My new and improved function:
>> >
>> >
>> >> // CHANNEL 3 this channel is always read for particle matching
>> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >
>> > there seems to be a problem in the editing of the above 4 lines
>> > It looks like pCH3 is not being used; however, pCH3 is still being
>> > initialized
>> > and incremented in the code.
>> > Also when testing for execution speed, adding new operations (pBinData3)
>> > makes
>> > it very difficult to make timing comparisons.
>> >
>> >
>> >>
>> >> As you might have seen in my code the second read (tempRead2) is a 32
>> >> bits int but I'm only interrested in the first 16 bits (where channel
>> >> 3 and 4 reside), is there a way i can inform the compiler
>> >
>> > the natural size of a operation is 32bits, changing to a 16 bit
>> > operation
>> > would slow the code execution.
>> >
>> >>
>> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
>> >> these keywords the software pipelining was disabled again (Cannot find
>> >> schedule).
>> >
>> > the 'volatile' is needed for the two parameters because they DO
>> > change
>> > between reads. I had suggested to remove the 'volatile' from the
>> > variables,
>> > not
>> > the parameters.
>> >
>> >
>> >>
>> >> With kind regards,
>> >>
>> >> Dominic
>> >
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>>

--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________
Dear Mike,

I have been away on holiday, I hope (for the people who are reading this they had a holiday too (and a good one). Right, back to work it is! I was wondering about your earlier idea about longer reads to force the compiler to use the LDDW instruction. Now is my question, how do I do this?

the 2 reads are as follows:

tmpRead1 = *(volatile int*) 0x90300004;
tmpRead2 = *(volatile int*) 0x90300008;

these are both 32 bit reads but i only need 48 bits in total (32 bits from tmpRead1, and 16 (least significant) bits of tmpRead2. Further more these 2 reads represent 6 (8-bit) channels:

read: tmpRead2 tmpRead1
M L M L
S S S S
B B B B
bit :******************************** ********************************
use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6

(forgive my primitive ASCII art :P)

In my fetchData routine (which is pipelining!!) I fetch the data into these 2 variables and then distribute the data in 6 channels. My code is as follows:

unsigned int Calculator_FetchData(Bool curvature)
{
unsigned int tmpRead1 =0;
unsigned int tmpRead2;
unsigned int sampleCount;
float * restrict pCH1;
float * restrict pCH2;
float * restrict pCH3;
char * restrict pBinData3;
float * restrict pCH4;
float * restrict pCH5;
float * restrict pCH6;

const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
const int termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
pCH3 = &CH3.deloggedData[0];
pBinData3 = &binData3[0];
pCH4 = &CH4.deloggedData[0];
pCH5 = &CH5.deloggedData[0];
pCH6 = &CH6.deloggedData[0];

#pragma MUST_ITERATE(16,4096,2);
while(tmpRead1 != termValue)
{
tmpRead1 = *(volatile int*) 0x90300004;
tmpRead2 = *(volatile int*) 0x90300008;

//CHANNEL 1
*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];

// CHANNEL 2
*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];

if(curvature)
{
*pCH1 += *pCH2;
if(*pCH1 > 5000)
{
*pCH1 = 5000;
}
}

//CHANNEL 5
*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

// CHANNEL 6
*pCH6 = LUT1[tmpRead1 & 0xFF];

// CHANNEL 3 this channel is always read for particle matching on this channel
*pBinData3 = tmpRead2 & 0xFF;
*pCH3 = LUT0[*pBinData3];

// CHANNEL 4
*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

pCH1++;
pCH2++;
pCH3++;
pBinData3++;
pCH4++;
pCH5++;
pCH6++;

if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
{
tmpRead1 = termValue;
}
}
sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
Screen_updateSamples(sampleCount);
return sampleCount;
}
At the moment I'm getting the folowing pipeline information is the ASM file:

_Calculator_FetchData:
;** --*
;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 219
;* Loop opening brace source line : 220
;* Loop closing brace source line : 266
;* Known Minimum Trip Count : 16
;* Known Maximum Trip Count : 4096
;* Known Max Trip Count Factor : 2
;* Loop Carried Dependency Bound(^) : 7
;* Unpartitioned Resource Bound : 9
;* Partitioned Resource Bound(*) : 9
;* Resource Partition:
;* A-side B-side
;* .L units 2 1
;* .S units 4 4
;* .D units 8 9*
;* .M units 0 0
;* .X cross paths 1 1
;* .T address paths 9* 8
;* Long read paths 5 4
;* Long write paths 0 0
;* Logical ops (.LS) 1 1 (.L or .S unit)
;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 4 3
;* Bound(.L .S .D .LS .LSD) 6 6
;*
;* Searching for software pipeline schedule at ...
;* ii = 9 Unsafe schedule for irregular loop
;* ii = 9 Did not find schedule
;* ii = 10 Unsafe schedule for irregular loop
;* ii = 10 Unsafe schedule for irregular loop
;* ii = 10 Did not find schedule
;* ii = 11 Unsafe schedule for irregular loop
;* ii = 11 Unsafe schedule for irregular loop
;* ii = 11 Did not find schedule
;* ii = 12 Unsafe schedule for irregular loop
;* ii = 12 Unsafe schedule for irregular loop
;* ii = 12 Did not find schedule
;* ii = 13 Unsafe schedule for irregular loop
;* ii = 13 Unsafe schedule for irregular loop
;* ii = 13 Unsafe schedule for irregular loop
;* ii = 13 Did not find schedule
;* ii = 14 Unsafe schedule for irregular loop
;* ii = 14 Unsafe schedule for irregular loop
;* ii = 14 Unsafe schedule for irregular loop
;* ii = 14 Did not find schedule
;* ii = 15 Unsafe schedule for irregular loop
;* ii = 15 Unsafe schedule for irregular loop
;* ii = 15 Unsafe schedule for irregular loop
;* ii = 15 Did not find schedule
;* ii = 16 Unsafe schedule for irregular loop
;* ii = 16 Unsafe schedule for irregular loop
;* ii = 16 Unsafe schedule for irregular loop
;* ii = 16 Did not find schedule
;* ii = 17 Unsafe schedule for irregular loop
;* ii = 17 Unsafe schedule for irregular loop
;* ii = 17 Unsafe schedule for irregular loop
;* ii = 17 Did not find schedule
;* ii = 18 Unsafe schedule for irregular loop
;* ii = 18 Unsafe schedule for irregular loop
;* ii = 18 Unsafe schedule for irregular loop
;* ii = 18 Did not find schedule
;* ii = 19 Unsafe schedule for irregular loop
;* ii = 19 Schedule found with 1 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |* * ********* | ***** * *** |
;* 1: |* * ********** | ***** * *** |
;* 2: |* * ********** | ***** * *** |
;* 3: |* * ********** | ***** * *** |
;* 4: |* * ********** | ***** * *** |
;* 5: |* * ********** | ***** * *** |
;* 6: |* ************* | ***** * *** |
;* 7: |* **************| ***** * *** |
;* 8: |* **************| ***** ** *** |
;* 9: |* **************| ******** *** |
;* 10: |* **************|********* *** |
;* 11: |****************| ***** ****** |
;* 12: |****************| ***** ****** |
;* 13: |****************| ************ |
;* 14: |* **************| ***** ****** |
;* 15: |* **************| ***** * **** |
;* 16: |*************** | ******* *** |
;* 17: |*** * ********* | ******* *** |
;* 18: |*** * ********* | ******* *** |
;* +---------------------------------+
;*
;* Done
;*
;* Loop is interruptible
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;*
;* Minimum safe trip count : 1

Looking at the register usage it looks like it's using quite a lot of registers and I thought that maybe the LDDW would relieve some registers. Also i was wondering if i can force-allign arrays in memory? If that's possible I can use 1 pointer to access all the channels (by using an offset when addressing:
----------------------------
IDEA:
//CHANNEL 1
*pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];

// CHANNEL 2
*pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];

// OTHER CHANNELS addressed using bigger offsets...
-------------------------------

I don't know if this idea is feasible but if it is i think it would relieve some more pressure of the register usage.

Anyone's idea's/comments are welcome. At the moment the code is running 33% to slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to load into my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an ideal situation i would like to complete the FetchData routine in 1000 us (not any shorter or I would read faster then data is being written :P).

With kind regards,

Dominic Stuart

--- In c..., Michael Dunn wrote:
>
> Dominic,
>
> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl wrote:
> >
> >
> > Thanks Mike,
> >
> > I'm starting to enjoy this "tweaking" and am trying to push it as far as I
> > can because every microsecond that i gain means the DSP can handle more
> > particles/second. I've applied the tips I've gotten on this forum on the
> > rest of my source code as well (the actual loops in my program that do the
> > calculations on the data) and those are pipelining aswell now. Compared to
> > the initial source total improvement is over 900%! Amazing (looks like I was
> > using the DSP as a glorified MCU) but the true power of the DSP is starting
> > to show! I thank you for your input but it raises some questions if you
> > don't mind:
> >
> > --- In c..., Michael Dunn wrote:
> >>
> >> Congratulations, Dominic!!
> >>
> >> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
> >>
> >> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> >> 16, and 8 bit accesses will be the same speed.
> >
> > What do you mean with aligned exactly?
>
> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
> == 0' then it is aligned.
> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
> are aligned [byte addresses are always aligned].
>
> >
> >> Only if your external memory is 8 or 16 bits wide would there be any
> >> potential advantage in performing 16 bit accesses instead of 32 bit
> >> accesses.
> >> Also, there would be an advantage in fetching 32 bits at a time if you
> >> an entire array of 8 or 16 bit values.
> >>
> >
> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads
> > the first two together (logic tied together so they "act" like 1 32bits wide
> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
> >
> >> I haven't looked at the details of your code, but if you always fetch
> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> >> that your hardware addresses are preventing you from picking up some
> >> additional speed. *If* the input addresses began on a 64 bit boundary
> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> >> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> >> bit read].
> >
> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
> > compiler to fetch a 64 bit read (I mean this still has to be split somehow
> > in 2 read cycles somehow?)
>
> First of all, I wrote this before I had the idea of using a single
> pointer. Your code has 2 pointers that load data - this means that
> you are using 4 processor registers. Changing to a single 64 bit read
> [32 x 2] would result in requiring only 3 registers. If your routine
> has a lot of register pressure [utilization] where it is loading and
> unloading CPU registers, then a 'register reduction change' would help
> performance.
>
> As I finished writing about the double read, I thought of 'plan B' -
> just use one pointer with an offset. When you look at the asm
> listing, it should give you some register usage info. If you are
> getting 'spills' then definitely try this.
>
> >
> >>
> >> Since your hardware addresses are fixed, you only need 1 pointer. You
> >> could use
> >> tmpRead2 = *(read1 + 4);
> >> This would free up one register and, depending on register
> >> utilization, could improve the performance.
> >>
> >
> > Improve performance, thats what I like to hear ;) I hope my questions aren't
> > too "basic".
>
> Most active members of this group are willing to help someone who
> wants to learn. As long as your questions are informed and you show a
> willingness to participate, most of us will help if we can. We come
> from a variety of backgrounds and each of us end up learning something
> from time to time.
>
> As you are learning, 'performance improvement' is not something that
> has a single solution. Rather, it is a journey with many stops along
> the way.
>
> mikedunn
> >
> > Dominic
> >
> >> mikedunn
> >>
> >>
> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams wrote:
> >> >
> >> >
> >> > d.stuartnl,
> >> >
> >> > my comments in-line and prefixed with
> >> >
> >> > R. Williams
> >> >
> >> > ---------- Original Message -----------
> >> > From: "d.stuartnl"
> >> > To: c...
> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >
> >> >> R.Williams,
> >> >>
> >> >> SUCCESS! Looptime has almost halved! Software pipelining is working
> >> >> now thanks to your tips:
> >> >
> >> > congratulations!!
> >> >
> >> >
> >> >
> >> >>
> >> >> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> >> >> is working fine as it is. Dont know why though.
> >> >
> >> > two reasons:
> >> > 1) the data size of a float is the same as the address data size
> >> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
> >> > loop
> >> > to point 1 past the last location used.
> >> >
> >> >
> >> >> >
> >> >> still have them in a single loop and it's pipelining. Do you think
> >> >> it's worth considering splitting it into two loops and check if
> >> >> there's (an even better) speed increase?
> >> >
> >> > you could experiment, but it looks like it is not necessary to
> >> > separate
> >> > the code into two loops.
> >> >
> >> >
> >> >> My new and improved function:
> >> >
> >> >
> >> >> // CHANNEL 3 this channel is always read for particle matching
> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 > >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >
> >> > there seems to be a problem in the editing of the above 4 lines
> >> > It looks like pCH3 is not being used; however, pCH3 is still being
> >> > initialized
> >> > and incremented in the code.
> >> > Also when testing for execution speed, adding new operations (pBinData3)
> >> > makes
> >> > it very difficult to make timing comparisons.
> >> >
> >> >
> >> >>
> >> >> As you might have seen in my code the second read (tempRead2) is a 32
> >> >> bits int but I'm only interrested in the first 16 bits (where channel
> >> >> 3 and 4 reside), is there a way i can inform the compiler
> >> >
> >> > the natural size of a operation is 32bits, changing to a 16 bit
> >> > operation
> >> > would slow the code execution.
> >> >
> >> >>
> >> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> >> >> these keywords the software pipelining was disabled again (Cannot find
> >> >> schedule).
> >> >
> >> > the 'volatile' is needed for the two parameters because they DO
> >> > change
> >> > between reads. I had suggested to remove the 'volatile' from the
> >> > variables,
> >> > not
> >> > the parameters.
> >> >
> >> >
> >> >>
> >> >> With kind regards,
> >> >>
> >> >> Dominic
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >>
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________