DSPRelated.com
Forums

Slow EMIF transfer

Started by d.st...@yahoo.com June 23, 2009
Thanks for the information, I think I will refrain from using blocktransfers because I want to process the data as the DSP receives it. My function looks like this:

void Calculator_AddSample()
{
x++;

read1 = (int*) 0x90300004;
read2 = (int*) 0x90300008;

tmpRead1 = *read1;
tmpRead2 = *read2;

// CHANNEL 1
CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
// CHANNEL 2
CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
// FWS R+L Add
if(LRneeded == 1)
{
CH1.deloggedData[x] += CH2.deloggedData[x];
if(CH1.deloggedData[x] > 5000)
{
CH1.deloggedData[x] = 5000;
}
}
// CHANNEL 3 this channel is always read for particle matching on this channel
binData[x] = (tmpRead2 & 0xFF);
CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];

// CHANNEL 4
CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
// CHANNEL 5
CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
// CHANNEL 6
CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
}
This function executes 2 reads from 2 different FIFO's and then seperates the different datachannels and decompresses the value's with a LookUp Table.

I am trying to streamline this function so it can keep up with the incoming data. The data is written to the FIFO's with 4MHz. The data consists of small burst packets ranging from 3 to 4096 bytes per channel.

At the moment I am starting this "prefetch" function when a burst starts and execute this function every time there is data available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst ends. All variables are in IRAM.

I think I made an error in suspecting the EMIF transfer speed and I now suspect that there may be some overhead in the polling scheme I use for calling this function that results in the slow transfer speed. I will look into this. I would like to thank everyone for there input.

With kind regards,

Dominic
--- In c..., Adolf Klemenz wrote:
>
> Dear Dominic,
>
> At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> >as I understand DMA, I would need to work in "blocks" of data but that
> >would be very tricky in my application since I do not know how big the
> >datastream is gonna be. Or is it possible to use DMA for single byte transfers?
>
> using DMA makes sense for block transfers only. Typical Fifo applications
> will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> block read.
> You may use element-synchronized DMA (each trigger transfers only one data
> word), but there will be no speed improvement: It takes about 100ns from
> the EDMA sync event to the actual data transfer on a C6713.
>
> Attached is a scope screenshot generated by this test program
>
> // compiled with -o2 and without debug info:
>
> volatile int buffer; // must be volatile to prevent
> // optimizer from code removal
> for (;;)
> {
> buffer = *(volatile int*)0x90300000;
> }
>
> The screenshot shows chip select and read signal with the expected timings
> (20ns strobe width). The gap between sucessive reads is caused by the DSP
> architecture. Here it is 200ns because a 225MHz DSP was used, which should
> translate to 150ns on a 300MHz device.
>
> If this isn't fast enough, you must use block transfers.
>
> Best Regards,
> Adolf Klemenz, D.SignT
>

_____________________________________
Dominic-

> Thanks for the information, I think I will refrain from using block
> transfers because I want to process the data as the DSP receives it.
.
.
.

> At the moment I am starting this "prefetch" function when a burst
> starts and execute this function every time there is data available
> in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> the data before the burst ends. All variables are in IRAM.

The typical reason for doing it that way is to avoid delay (latency) in your signal
processing flow, relative to some output (DAC, GPIO line, digital transmission,
etc). Is that the case? If not then a block based method would be better, otherwise
you will waste a lot of time polling for each element. You don't have to implement
DMA as a first step to get that working, you could use a code loop. Then implement
DMA in order to further improve performance.

-Jeff

> My function looks like this:
>
> void Calculator_AddSample()
> {
> x++;
>
> read1 = (int*) 0x90300004;
> read2 = (int*) 0x90300008;
>
> tmpRead1 = *read1;
> tmpRead2 = *read2;
>
> // CHANNEL 1
> CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> // FWS R+L Add
> if(LRneeded == 1)
> {
> CH1.deloggedData[x] += CH2.deloggedData[x];
> if(CH1.deloggedData[x] > 5000)
> {
> CH1.deloggedData[x] = 5000;
> }
> }
> // CHANNEL 3 this channel is always read for particle matching on this channel
> binData[x] = (tmpRead2 & 0xFF);
> CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
>
> // CHANNEL 4
> CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> // CHANNEL 5
> CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> // CHANNEL 6
> CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> }
> This function executes 2 reads from 2 different FIFO's and then seperates the different datachannels and decompresses the value's with a LookUp Table.
>
> I am trying to streamline this function so it can keep up with the incoming data. The data is written to the FIFO's with 4MHz. The data consists of small burst packets ranging from 3 to 4096 bytes per channel.
>
> At the moment I am starting this "prefetch" function when a burst starts and execute this function every time there is data available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst ends. All variables are in IRAM.
>
> I think I made an error in suspecting the EMIF transfer speed and I now suspect that there may be some overhead in the polling scheme I use for calling this function that results in the slow transfer speed. I will look into this. I would like to thank everyone for there input.
>
> With kind regards,
>
> Dominic
>
> --- In c..., Adolf Klemenz wrote:
> >
> > Dear Dominic,
> >
> > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > >as I understand DMA, I would need to work in "blocks" of data but that
> > >would be very tricky in my application since I do not know how big the
> > >datastream is gonna be. Or is it possible to use DMA for single byte transfers?
> >
> > using DMA makes sense for block transfers only. Typical Fifo applications
> > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > block read.
> > You may use element-synchronized DMA (each trigger transfers only one data
> > word), but there will be no speed improvement: It takes about 100ns from
> > the EDMA sync event to the actual data transfer on a C6713.
> >
> > Attached is a scope screenshot generated by this test program
> >
> > // compiled with -o2 and without debug info:
> >
> > volatile int buffer; // must be volatile to prevent
> > // optimizer from code removal
> > for (;;)
> > {
> > buffer = *(volatile int*)0x90300000;
> > }
> >
> > The screenshot shows chip select and read signal with the expected timings
> > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > translate to 150ns on a 300MHz device.
> >
> > If this isn't fast enough, you must use block transfers.
> >
> > Best Regards,
> > Adolf Klemenz, D.SignT

_____________________________________
Dominic-

> I am indeed trying to avoid delay in processing flow. The data needs to be
> decompressed asap. When that is done the DSP performs calculations on the
> data and based on the outcome of those calculations the DSP generates a
> trigger (GPIO). Your idea of a code loop got me thinking... If a read
> always takes longer than a write, I don't have to pull the Empty Flag and
> can just read the data through a loop like so:
>
> while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> {
> Calculator_AddSample();
> }

Ok, so what you're saying is that once you see a "not empty" flag, then you know the
agent on the other side of the FIFO is writing a known block size, and will write it
faster than you can read, so your code just needs to read.

> I've tested this and it did improve the performance but nothing shocking,
> it seems the decompressing via the LookUp Table is creating the bottle
> neck. I've already split the two dimensional LUT into 2 one dimensional
> array's. This also helped a bit.

One thing you might try is hand-optimized asm code just for the read / look-up
sequence, using techniques that Richard was describing. If you take advantage of the
pipeline, you can improve performance. For example you can read sample N, then in
the next 4 instructions process the lookup on N-1, waiting for N to become valid. It
sounds to me like it wouldn't be that much code in your loop, maybe a dozen or less
asm instructions.

-Jeff

PS. Please post to the group, not to me. Thanks.

> --- In c..., Jeff Brower wrote:
> >
> > Dominic-
> >
> > > Thanks for the information, I think I will refrain from using block
> > > transfers because I want to process the data as the DSP receives it.
> > .
> > .
> > .
> >
> > > At the moment I am starting this "prefetch" function when a burst
> > > starts and execute this function every time there is data available
> > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > the data before the burst ends. All variables are in IRAM.
> >
> > The typical reason for doing it that way is to avoid delay (latency) in your signal
> > processing flow, relative to some output (DAC, GPIO line, digital transmission,
> > etc). Is that the case? If not then a block based method would be better, otherwise
> > you will waste a lot of time polling for each element. You don't have to implement
> > DMA as a first step to get that working, you could use a code loop. Then implement
> > DMA in order to further improve performance.
> >
> > -Jeff
> >
> > > My function looks like this:
> > >
> > > void Calculator_AddSample()
> > > {
> > > x++;
> > >
> > > read1 = (int*) 0x90300004;
> > > read2 = (int*) 0x90300008;
> > >
> > > tmpRead1 = *read1;
> > > tmpRead2 = *read2;
> > >
> > > // CHANNEL 1
> > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > // CHANNEL 2
> > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > // FWS R+L Add
> > > if(LRneeded == 1)
> > > {
> > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > if(CH1.deloggedData[x] > 5000)
> > > {
> > > CH1.deloggedData[x] = 5000;
> > > }
> > > }
> > > // CHANNEL 3 this channel is always read for particle matching on this channel
> > > binData[x] = (tmpRead2 & 0xFF);
> > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > >
> > > // CHANNEL 4
> > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > // CHANNEL 5
> > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > // CHANNEL 6
> > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > }
> > > This function executes 2 reads from 2 different FIFO's and then seperates the different datachannels and decompresses the value's with a LookUp Table.
> > >
> > > I am trying to streamline this function so it can keep up with the incoming data. The data is written to the FIFO's with 4MHz. The data consists of small burst packets ranging from 3 to 4096 bytes per channel.
> > >
> > > At the moment I am starting this "prefetch" function when a burst starts and execute this function every time there is data available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst ends. All variables are in IRAM.
> > >
> > > I think I made an error in suspecting the EMIF transfer speed and I now suspect that there may be some overhead in the polling scheme I use for calling this function that results in the slow transfer speed. I will look into this. I would like to thank everyone for there input.
> > >
> > > With kind regards,
> > >
> > > Dominic
> > >
> > > --- In c..., Adolf Klemenz wrote:
> > > >
> > > > Dear Dominic,
> > > >
> > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > >as I understand DMA, I would need to work in "blocks" of data but that
> > > > >would be very tricky in my application since I do not know how big the
> > > > >datastream is gonna be. Or is it possible to use DMA for single byte transfers?
> > > >
> > > > using DMA makes sense for block transfers only. Typical Fifo applications
> > > > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > > > block read.
> > > > You may use element-synchronized DMA (each trigger transfers only one data
> > > > word), but there will be no speed improvement: It takes about 100ns from
> > > > the EDMA sync event to the actual data transfer on a C6713.
> > > >
> > > > Attached is a scope screenshot generated by this test program
> > > >
> > > > // compiled with -o2 and without debug info:
> > > >
> > > > volatile int buffer; // must be volatile to prevent
> > > > // optimizer from code removal
> > > > for (;;)
> > > > {
> > > > buffer = *(volatile int*)0x90300000;
> > > > }
> > > >
> > > > The screenshot shows chip select and read signal with the expected timings
> > > > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > > > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > > > translate to 150ns on a 300MHz device.
> > > >
> > > > If this isn't fast enough, you must use block transfers.
> > > >
> > > > Best Regards,
> > > > Adolf Klemenz, D.SignT
> >

_____________________________________
Hi Jeff,

I am indeed trying to create as little delay as possible. When the DSP has all the data (and decompressed it), it needs to perform various calculations on the data and based on certain outcomes of those calculations it needs to generate a trigger (GPIO). So the faster I have my data, the faster I can start processing it but I think if I work in blocks then I actualy loose time because at the moment I can start the decompressing from the first byte. Where when I would implement blocks i can start decompressing after a block?

Your code loop suggestion got me thinking, If a read always takes more time then a write I shouldnt have to poll the Empty flag. I think I should be able to get the data like this:

Calculator_AddSample();
while(tmpRead1 != 0x84825131 & (x <= 0x1000))
{
Calculator_AddSample();
}

This should give me a bit more speed since checking the Empty Flag is also a read to the EMIF bus.

I am also thinking to put some of the calculations in the fetching process because the DSP already has the value's in it's registers when it's storing the data. I asume this would improve speed aswell considering the current code fetches the stored data and does the calculations afterwards, it may even be possible to perform the calculations in realtime so I don't have to store the data at all and only store the outcomes of the calculations. When I would implement this I know for certain that the fetching and calculating of 1 read will take longer then the data being written in the FIFO.

With kind regards,

Dominic Stuart

--- In c..., Jeff Brower wrote:
>
> Dominic-
>
> > Thanks for the information, I think I will refrain from using block
> > transfers because I want to process the data as the DSP receives it.
> .
> .
> .
>
> > At the moment I am starting this "prefetch" function when a burst
> > starts and execute this function every time there is data available
> > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > the data before the burst ends. All variables are in IRAM.
>
> The typical reason for doing it that way is to avoid delay (latency) in your signal
> processing flow, relative to some output (DAC, GPIO line, digital transmission,
> etc). Is that the case? If not then a block based method would be better, otherwise
> you will waste a lot of time polling for each element. You don't have to implement
> DMA as a first step to get that working, you could use a code loop. Then implement
> DMA in order to further improve performance.
>
> -Jeff
>
> > My function looks like this:
> >
> > void Calculator_AddSample()
> > {
> > x++;
> >
> > read1 = (int*) 0x90300004;
> > read2 = (int*) 0x90300008;
> >
> > tmpRead1 = *read1;
> > tmpRead2 = *read2;
> >
> > // CHANNEL 1
> > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > // CHANNEL 2
> > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > // FWS R+L Add
> > if(LRneeded == 1)
> > {
> > CH1.deloggedData[x] += CH2.deloggedData[x];
> > if(CH1.deloggedData[x] > 5000)
> > {
> > CH1.deloggedData[x] = 5000;
> > }
> > }
> > // CHANNEL 3 this channel is always read for particle matching on this channel
> > binData[x] = (tmpRead2 & 0xFF);
> > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> >
> > // CHANNEL 4
> > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > // CHANNEL 5
> > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > // CHANNEL 6
> > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > }
> > This function executes 2 reads from 2 different FIFO's and then seperates the different datachannels and decompresses the value's with a LookUp Table.
> >
> > I am trying to streamline this function so it can keep up with the incoming data. The data is written to the FIFO's with 4MHz. The data consists of small burst packets ranging from 3 to 4096 bytes per channel.
> >
> > At the moment I am starting this "prefetch" function when a burst starts and execute this function every time there is data available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst ends. All variables are in IRAM.
> >
> > I think I made an error in suspecting the EMIF transfer speed and I now suspect that there may be some overhead in the polling scheme I use for calling this function that results in the slow transfer speed. I will look into this. I would like to thank everyone for there input.
> >
> > With kind regards,
> >
> > Dominic
> >
> > --- In c..., Adolf Klemenz wrote:
> > >
> > > Dear Dominic,
> > >
> > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > >as I understand DMA, I would need to work in "blocks" of data but that
> > > >would be very tricky in my application since I do not know how big the
> > > >datastream is gonna be. Or is it possible to use DMA for single byte transfers?
> > >
> > > using DMA makes sense for block transfers only. Typical Fifo applications
> > > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > > block read.
> > > You may use element-synchronized DMA (each trigger transfers only one data
> > > word), but there will be no speed improvement: It takes about 100ns from
> > > the EDMA sync event to the actual data transfer on a C6713.
> > >
> > > Attached is a scope screenshot generated by this test program
> > >
> > > // compiled with -o2 and without debug info:
> > >
> > > volatile int buffer; // must be volatile to prevent
> > > // optimizer from code removal
> > > for (;;)
> > > {
> > > buffer = *(volatile int*)0x90300000;
> > > }
> > >
> > > The screenshot shows chip select and read signal with the expected timings
> > > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > > translate to 150ns on a 300MHz device.
> > >
> > > If this isn't fast enough, you must use block transfers.
> > >
> > > Best Regards,
> > > Adolf Klemenz, D.SignT
>

_____________________________________
Dominic,

There are a couple of problems with the displayed code.
--'x' is not being incremented
--tmpRead1 is not being updated

However, your idea of just using the read operation, since it is much longer
than a write, is a good one.

R. Williams

---------- Original Message -----------
From: Jeff Brower
To: Dominic Stuart
Cc: c...
Sent: Wed, 15 Jul 2009 11:07:55 -0500
Subject: [c6x] Re: Slow EMIF transfer

> Dominic-
>
> > I am indeed trying to avoid delay in processing flow. The data needs to be
> > decompressed asap. When that is done the DSP performs calculations on the
> > data and based on the outcome of those calculations the DSP generates a
> > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > always takes longer than a write, I don't have to pull the Empty Flag and
> > can just read the data through a loop like so:
> >
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> > Calculator_AddSample();
> > }
>
> Ok, so what you're saying is that once you see a "not empty" flag,
> then you know the agent on the other side of the FIFO is writing a
> known block size, and will write it faster than you can read, so your
> code just needs to read.
>
> > I've tested this and it did improve the performance but nothing shocking,
> > it seems the decompressing via the LookUp Table is creating the bottle
> > neck. I've already split the two dimensional LUT into 2 one dimensional
> > array's. This also helped a bit.
>
> One thing you might try is hand-optimized asm code just for the read /
> look-up sequence, using techniques that Richard was describing. If
> you take advantage of the pipeline, you can improve performance. For
> example you can read sample N, then in the next 4 instructions process
> the lookup on N-1, waiting for N to become valid. It sounds to me
> like it wouldn't be that much code in your loop, maybe a dozen or less
> asm instructions.
>
> -Jeff
>
> PS. Please post to the group, not to me. Thanks.
>
> > --- In c..., Jeff Brower wrote:
> > >
> > > Dominic-
> > >
> > > > Thanks for the information, I think I will refrain from using block
> > > > transfers because I want to process the data as the DSP receives it.
> > > .
> > > .
> > > .
> > >
> > > > At the moment I am starting this "prefetch" function when a burst
> > > > starts and execute this function every time there is data available
> > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > the data before the burst ends. All variables are in IRAM.
> > >
> > > The typical reason for doing it that way is to avoid delay (latency) in
your signal
> > > processing flow, relative to some output (DAC, GPIO line, digital
transmission,
> > > etc). Is that the case? If not then a block based method would be
better, otherwise
> > > you will waste a lot of time polling for each element. You don't have to
implement
> > > DMA as a first step to get that working, you could use a code loop. Then
implement
> > > DMA in order to further improve performance.
> > >
> > > -Jeff
> > >
> > > > My function looks like this:
> > > >
> > > > void Calculator_AddSample()
> > > > {
> > > > x++;
> > > >
> > > > read1 = (int*) 0x90300004;
> > > > read2 = (int*) 0x90300008;
> > > >
> > > > tmpRead1 = *read1;
> > > > tmpRead2 = *read2;
> > > >
> > > > // CHANNEL 1
> > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > // CHANNEL 2
> > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > // FWS R+L Add
> > > > if(LRneeded == 1)
> > > > {
> > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > if(CH1.deloggedData[x] > 5000)
> > > > {
> > > > CH1.deloggedData[x] = 5000;
> > > > }
> > > > }
> > > > // CHANNEL 3 this channel is always read for particle matching on
this channel
> > > > binData[x] = (tmpRead2 & 0xFF);
> > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > >
> > > > // CHANNEL 4
> > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > // CHANNEL 5
> > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > // CHANNEL 6
> > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > }
> > > > This function executes 2 reads from 2 different FIFO's and then
seperates the different datachannels and decompresses the value's with a LookUp
Table.
> > > >
> > > > I am trying to streamline this function so it can keep up with the
incoming data. The data is written to the FIFO's with 4MHz. The data consists of
small burst packets ranging from 3 to 4096 bytes per channel.
> > > >
> > > > At the moment I am starting this "prefetch" function when a burst starts
and execute this function every time there is data available in the FIFO's
(polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst
ends. All variables are in IRAM.
> > > >
> > > > I think I made an error in suspecting the EMIF transfer speed and I now
suspect that there may be some overhead in the polling scheme I use for calling
this function that results in the slow transfer speed. I will look into this. I
would like to thank everyone for there input.
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic
> > > >
> > > > --- In c..., Adolf Klemenz wrote:
> > > > >
> > > > > Dear Dominic,
> > > > >
> > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > >as I understand DMA, I would need to work in "blocks" of data but that
> > > > > >would be very tricky in my application since I do not know how big the
> > > > > >datastream is gonna be. Or is it possible to use DMA for single byte
transfers?
> > > > >
> > > > > using DMA makes sense for block transfers only. Typical Fifo applications
> > > > > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > > > > block read.
> > > > > You may use element-synchronized DMA (each trigger transfers only one data
> > > > > word), but there will be no speed improvement: It takes about 100ns from
> > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > >
> > > > > Attached is a scope screenshot generated by this test program
> > > > >
> > > > > // compiled with -o2 and without debug info:
> > > > >
> > > > > volatile int buffer; // must be volatile to prevent
> > > > > // optimizer from code removal
> > > > > for (;;)
> > > > > {
> > > > > buffer = *(volatile int*)0x90300000;
> > > > > }
> > > > >
> > > > > The screenshot shows chip select and read signal with the expected timings
> > > > > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > > > > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > > > > translate to 150ns on a 300MHz device.
> > > > >
> > > > > If this isn't fast enough, you must use block transfers.
> > > > >
> > > > > Best Regards,
> > > > > Adolf Klemenz, D.SignT
> > >
------- End of Original Message -------

_____________________________________
Jeff,

your partially correct, the agent on the other side is not writing a known blocksize but it closes the "block" with a trailer value so I can determine when the block is over by checking the last read value against the known trailer value.

Writing in assembly is a step I hope to postpone, I haven't coded in assembly for many many moons ;)

I think I'll invest my time in optimizing my C-code first. I am currently reading the "Optimizing C Compiler Tutorial" from the TI website there's a lot of info in there. Once I'm comfortable with my C-code I will check if I can write certain algorithms in Assembly to further optimize my system.

I would like to thank everyone for reading/replying to this post!

--- In c..., Jeff Brower wrote:
>
> Dominic-
>
> > I am indeed trying to avoid delay in processing flow. The data needs to be
> > decompressed asap. When that is done the DSP performs calculations on the
> > data and based on the outcome of those calculations the DSP generates a
> > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > always takes longer than a write, I don't have to pull the Empty Flag and
> > can just read the data through a loop like so:
> >
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> > Calculator_AddSample();
> > }
>
> Ok, so what you're saying is that once you see a "not empty" flag, then you know the
> agent on the other side of the FIFO is writing a known block size, and will write it
> faster than you can read, so your code just needs to read.
>
> > I've tested this and it did improve the performance but nothing shocking,
> > it seems the decompressing via the LookUp Table is creating the bottle
> > neck. I've already split the two dimensional LUT into 2 one dimensional
> > array's. This also helped a bit.
>
> One thing you might try is hand-optimized asm code just for the read / look-up
> sequence, using techniques that Richard was describing. If you take advantage of the
> pipeline, you can improve performance. For example you can read sample N, then in
> the next 4 instructions process the lookup on N-1, waiting for N to become valid. It
> sounds to me like it wouldn't be that much code in your loop, maybe a dozen or less
> asm instructions.
>
> -Jeff
>
> PS. Please post to the group, not to me. Thanks.
>
> > --- In c..., Jeff Brower wrote:
> > >
> > > Dominic-
> > >
> > > > Thanks for the information, I think I will refrain from using block
> > > > transfers because I want to process the data as the DSP receives it.
> > > .
> > > .
> > > .
> > >
> > > > At the moment I am starting this "prefetch" function when a burst
> > > > starts and execute this function every time there is data available
> > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > the data before the burst ends. All variables are in IRAM.
> > >
> > > The typical reason for doing it that way is to avoid delay (latency) in your signal
> > > processing flow, relative to some output (DAC, GPIO line, digital transmission,
> > > etc). Is that the case? If not then a block based method would be better, otherwise
> > > you will waste a lot of time polling for each element. You don't have to implement
> > > DMA as a first step to get that working, you could use a code loop. Then implement
> > > DMA in order to further improve performance.
> > >
> > > -Jeff
> > >
> > > > My function looks like this:
> > > >
> > > > void Calculator_AddSample()
> > > > {
> > > > x++;
> > > >
> > > > read1 = (int*) 0x90300004;
> > > > read2 = (int*) 0x90300008;
> > > >
> > > > tmpRead1 = *read1;
> > > > tmpRead2 = *read2;
> > > >
> > > > // CHANNEL 1
> > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > // CHANNEL 2
> > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > // FWS R+L Add
> > > > if(LRneeded == 1)
> > > > {
> > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > if(CH1.deloggedData[x] > 5000)
> > > > {
> > > > CH1.deloggedData[x] = 5000;
> > > > }
> > > > }
> > > > // CHANNEL 3 this channel is always read for particle matching on this channel
> > > > binData[x] = (tmpRead2 & 0xFF);
> > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > >
> > > > // CHANNEL 4
> > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > // CHANNEL 5
> > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > // CHANNEL 6
> > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > }
> > > > This function executes 2 reads from 2 different FIFO's and then seperates the different datachannels and decompresses the value's with a LookUp Table.
> > > >
> > > > I am trying to streamline this function so it can keep up with the incoming data. The data is written to the FIFO's with 4MHz. The data consists of small burst packets ranging from 3 to 4096 bytes per channel.
> > > >
> > > > At the moment I am starting this "prefetch" function when a burst starts and execute this function every time there is data available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst ends. All variables are in IRAM.
> > > >
> > > > I think I made an error in suspecting the EMIF transfer speed and I now suspect that there may be some overhead in the polling scheme I use for calling this function that results in the slow transfer speed. I will look into this. I would like to thank everyone for there input.
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic
> > > >
> > > > --- In c..., Adolf Klemenz wrote:
> > > > >
> > > > > Dear Dominic,
> > > > >
> > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > >as I understand DMA, I would need to work in "blocks" of data but that
> > > > > >would be very tricky in my application since I do not know how big the
> > > > > >datastream is gonna be. Or is it possible to use DMA for single byte transfers?
> > > > >
> > > > > using DMA makes sense for block transfers only. Typical Fifo applications
> > > > > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > > > > block read.
> > > > > You may use element-synchronized DMA (each trigger transfers only one data
> > > > > word), but there will be no speed improvement: It takes about 100ns from
> > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > >
> > > > > Attached is a scope screenshot generated by this test program
> > > > >
> > > > > // compiled with -o2 and without debug info:
> > > > >
> > > > > volatile int buffer; // must be volatile to prevent
> > > > > // optimizer from code removal
> > > > > for (;;)
> > > > > {
> > > > > buffer = *(volatile int*)0x90300000;
> > > > > }
> > > > >
> > > > > The screenshot shows chip select and read signal with the expected timings
> > > > > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > > > > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > > > > translate to 150ns on a 300MHz device.
> > > > >
> > > > > If this isn't fast enough, you must use block transfers.
> > > > >
> > > > > Best Regards,
> > > > > Adolf Klemenz, D.SignT
> >

_____________________________________
R. Williams,

--- In c..., "Richard Williams" wrote:
> Dominic,
>
> There are a couple of problems with the displayed code.
> --'x' is not being incremented
> --tmpRead1 is not being updated

x and tmpRead1 are updated in the AddSample() routine. Furthermore, I've been analyzing the compilers feedback and it's stating that it cannot implement software pipelining because there's a function call (AddSample()) in the loop. I've removed the AddSample() function and put the code from the function directly into the loop (see source), there's still some problems (Disqualified loop: Loop carried dependency bound too large). But I'm working on it :) I've also found out that pipelining is not being used in a lot of my loops so I'm guessing if I adjust my C-code so that software pipelining will be possible I will notice an increase in performance.

Source:

read1 = (int*) 0x90300004;
read2 = (int*) 0x90300008;

tmpRead1 = *read1;
tmpRead2 = *read2;
x = 0;
while(tmpRead1 != 0x84825131 & (x <= 0x1000))
{
tmpRead1 = *read1;
tmpRead2 = *read2;

CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
// FWS R+L Add
if(LRneeded == 1)
{
CH1.deloggedData[x] += CH2.deloggedData[x];
if(CH1.deloggedData[x] > 5000)
{
CH1.deloggedData[x] = 5000;
}
}
CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
binData[x] = (tmpRead2 & 0xFF);
CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
x++;
}

With kind regards,

Dominic

>
> However, your idea of just using the read operation, since it is much longer
> than a write, is a good one.
>
> R. Williams
>
> ---------- Original Message -----------
> From: Jeff Brower
> To: Dominic Stuart
> Cc: c...
> Sent: Wed, 15 Jul 2009 11:07:55 -0500
> Subject: [c6x] Re: Slow EMIF transfer
>
> > Dominic-
> >
> > > I am indeed trying to avoid delay in processing flow. The data needs to be
> > > decompressed asap. When that is done the DSP performs calculations on the
> > > data and based on the outcome of those calculations the DSP generates a
> > > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > > always takes longer than a write, I don't have to pull the Empty Flag and
> > > can just read the data through a loop like so:
> > >
> > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > {
> > > Calculator_AddSample();
> > > }
> >
> > Ok, so what you're saying is that once you see a "not empty" flag,
> > then you know the agent on the other side of the FIFO is writing a
> > known block size, and will write it faster than you can read, so your
> > code just needs to read.
> >
> > > I've tested this and it did improve the performance but nothing shocking,
> > > it seems the decompressing via the LookUp Table is creating the bottle
> > > neck. I've already split the two dimensional LUT into 2 one dimensional
> > > array's. This also helped a bit.
> >
> > One thing you might try is hand-optimized asm code just for the read /
> > look-up sequence, using techniques that Richard was describing. If
> > you take advantage of the pipeline, you can improve performance. For
> > example you can read sample N, then in the next 4 instructions process
> > the lookup on N-1, waiting for N to become valid. It sounds to me
> > like it wouldn't be that much code in your loop, maybe a dozen or less
> > asm instructions.
> >
> > -Jeff
> >
> > PS. Please post to the group, not to me. Thanks.
> >
> > > --- In c..., Jeff Brower wrote:
> > > >
> > > > Dominic-
> > > >
> > > > > Thanks for the information, I think I will refrain from using block
> > > > > transfers because I want to process the data as the DSP receives it.
> > > > .
> > > > .
> > > > .
> > > >
> > > > > At the moment I am starting this "prefetch" function when a burst
> > > > > starts and execute this function every time there is data available
> > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > > the data before the burst ends. All variables are in IRAM.
> > > >
> > > > The typical reason for doing it that way is to avoid delay (latency) in
> your signal
> > > > processing flow, relative to some output (DAC, GPIO line, digital
> transmission,
> > > > etc). Is that the case? If not then a block based method would be
> better, otherwise
> > > > you will waste a lot of time polling for each element. You don't have to
> implement
> > > > DMA as a first step to get that working, you could use a code loop. Then
> implement
> > > > DMA in order to further improve performance.
> > > >
> > > > -Jeff
> > > >
> > > > > My function looks like this:
> > > > >
> > > > > void Calculator_AddSample()
> > > > > {
> > > > > x++;
> > > > >
> > > > > read1 = (int*) 0x90300004;
> > > > > read2 = (int*) 0x90300008;
> > > > >
> > > > > tmpRead1 = *read1;
> > > > > tmpRead2 = *read2;
> > > > >
> > > > > // CHANNEL 1
> > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > // CHANNEL 2
> > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > // FWS R+L Add
> > > > > if(LRneeded == 1)
> > > > > {
> > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > if(CH1.deloggedData[x] > 5000)
> > > > > {
> > > > > CH1.deloggedData[x] = 5000;
> > > > > }
> > > > > }
> > > > > // CHANNEL 3 this channel is always read for particle matching on
> this channel
> > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > >
> > > > > // CHANNEL 4
> > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > // CHANNEL 5
> > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > // CHANNEL 6
> > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > }
> > > > > This function executes 2 reads from 2 different FIFO's and then
> seperates the different datachannels and decompresses the value's with a LookUp
> Table.
> > > > >
> > > > > I am trying to streamline this function so it can keep up with the
> incoming data. The data is written to the FIFO's with 4MHz. The data consists of
> small burst packets ranging from 3 to 4096 bytes per channel.
> > > > >
> > > > > At the moment I am starting this "prefetch" function when a burst starts
> and execute this function every time there is data available in the FIFO's
> (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst
> ends. All variables are in IRAM.
> > > > >
> > > > > I think I made an error in suspecting the EMIF transfer speed and I now
> suspect that there may be some overhead in the polling scheme I use for calling
> this function that results in the slow transfer speed. I will look into this. I
> would like to thank everyone for there input.
> > > > >
> > > > > With kind regards,
> > > > >
> > > > > Dominic
> > > > >
> > > > > --- In c..., Adolf Klemenz wrote:
> > > > > >
> > > > > > Dear Dominic,
> > > > > >
> > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > >as I understand DMA, I would need to work in "blocks" of data but that
> > > > > > >would be very tricky in my application since I do not know how big the
> > > > > > >datastream is gonna be. Or is it possible to use DMA for single byte
> transfers?
> > > > > >
> > > > > > using DMA makes sense for block transfers only. Typical Fifo applications
> > > > > > will use the Fifo's half-full flag (or a similar signal) to trigger a DMA
> > > > > > block read.
> > > > > > You may use element-synchronized DMA (each trigger transfers only one data
> > > > > > word), but there will be no speed improvement: It takes about 100ns from
> > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > >
> > > > > > Attached is a scope screenshot generated by this test program
> > > > > >
> > > > > > // compiled with -o2 and without debug info:
> > > > > >
> > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > // optimizer from code removal
> > > > > > for (;;)
> > > > > > {
> > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > }
> > > > > >
> > > > > > The screenshot shows chip select and read signal with the expected timings
> > > > > > (20ns strobe width). The gap between sucessive reads is caused by the DSP
> > > > > > architecture. Here it is 200ns because a 225MHz DSP was used, which should
> > > > > > translate to 150ns on a 300MHz device.
> > > > > >
> > > > > > If this isn't fast enough, you must use block transfers.
> > > > > >
> > > > > > Best Regards,
> > > > > > Adolf Klemenz, D.SignT
> > > >
> ------- End of Original Message -------
>

_____________________________________
d.stuartnl,

I notice that the code, during the first loop, checks for the termination value
then throws away the first read values (by reading from read1 and read2 again).
is that you wanted to do?

Execution could be made much faster, by eliminating the calculations related to
'x' by using pointers to:
CH1.deloggedData,
CH2.deloggedData,
CH3.deloggedData,
CH4.deloggedData,
CH5.deloggedData,
CH6.deloggedData.
Initialize the pointers before the loop and increment them at the end of the loop.
Also, eliminate 'x' and related calculation by precalculating the end address
for the loop as:
const endCH1 = &CH1.deloggedData[0x1000];
const termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
--- // rest of initialization
while( pCH1 < endCH1 )
{
---// processing
pCH1++;
pCh2++;
...// rest of incrementing
} // end while()

to avoid processing the termination value from *read1
and to exit when the termination value is read:
The first code within the 'while' loop would be:
tmpRead1 = *read1;
if (tmpRead1 == termValue ) break;
tmpRead2 = *read2;

R. Williams
---------- Original Message -----------
From: "d.stuartnl"
To: c...
Sent: Fri, 17 Jul 2009 10:11:36 -0000
Subject: [c6x] Re: Slow EMIF transfer

> R. Williams,
x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> I've been analyzing the compilers feedback and it's stating that it
> cannot implement software pipelining because there's a function call
> (AddSample()) in the loop. I've removed the AddSample() function and
> put the code from the function directly into the loop (see source),
> there's still some problems (Disqualified loop: Loop carried
> dependency bound too large). But I'm working on it :) I've also found
> out that pipelining is not being used in a lot of my loops so I'm
> guessing if I adjust my C-code so that software pipelining will be
> possible I will notice an increase in performance.
>
> Source:
>
> read1 = (int*) 0x90300004;
> read2 = (int*) 0x90300008;
>
> tmpRead1 = *read1;
> tmpRead2 = *read2;
> x = 0;
> while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> {
> tmpRead1 = *read1;
> tmpRead2 = *read2;YouTube - Dilbert - The Knack
>
> CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> // FWS R+L Add
> if(LRneeded == 1)
> {
> CH1.deloggedData[x] += CH2.deloggedData[x];
> if(CH1.deloggedData[x] > 5000)
> {
> CH1.deloggedData[x] = 5000;
> }
> }
> CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> binData[x] = (tmpRead2 & 0xFF);
> CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> x++;
> }
>
> With kind regards,
>
> Dominic
>
> >
> > However, your idea of just using the read operation, since it is much longer
> > than a write, is a good one.
> >
> > R. Williams
> >
> >
> >
> > ---------- Original Message -----------
> > From: Jeff Brower
> > To: Dominic Stuart
> > Cc: c...
> > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > Subject: [c6x] Re: Slow EMIF transfer
> >
> > > Dominic-
> > >
> > > > I am indeed trying to avoid delay in processing flow. The data needs to be
> > > > decompressed asap. When that is done the DSP performs calculations on the
> > > > data and based on the outcome of those calculations the DSP generates a
> > > > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > > > always takes longer than a write, I don't have to pull the Empty Flag and
> > > > can just read the data through a loop like so:
> > > >
> > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > {
> > > > Calculator_AddSample();
> > > > }
> > >
> > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > then you know the agent on the other side of the FIFO is writing a
> > > known block size, and will write it faster than you can read, so your
> > > code just needs to read.
> > >
> > > > I've tested this and it did improve the performance but nothing shocking,
> > > > it seems the decompressing via the LookUp Table is creating the bottle
> > > > neck. I've already split the two dimensional LUT into 2 one dimensional
> > > > array's. This also helped a bit.
> > >
> > > One thing you might try is hand-optimized asm code just for the read /
> > > look-up sequence, using techniques that Richard was describing. If
> > > you take advantage of the pipeline, you can improve performance. For
> > > example you can read sample N, then in the next 4 instructions process
> > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > like it wouldn't be that much code in your loop, maybe a dozen or less
> > > asm instructions.
> > >
> > > -Jeff
> > >
> > > PS. Please post to the group, not to me. Thanks.
> > >
> > > > --- In c..., Jeff Brower wrote:
> > > > >
> > > > > Dominic-
> > > > >
> > > > > > Thanks for the information, I think I will refrain from using block
> > > > > > transfers because I want to process the data as the DSP receives it.
> > > > > .
> > > > > .
> > > > > .
> > > > >
> > > > > > At the moment I am starting this "prefetch" function when a burst
> > > > > > starts and execute this function every time there is data available
> > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > > > the data before the burst ends. All variables are in IRAM.
> > > > >
> > > > > The typical reason for doing it that way is to avoid delay (latency) in
> > your signal
> > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > transmission,
> > > > > etc). Is that the case? If not then a block based method would be
> > better, otherwise
> > > > > you will waste a lot of time polling for each element. You don't have to
> > implement
> > > > > DMA as a first step to get that working, you could use a code loop. Then
> > implement
> > > > > DMA in order to further improve performance.
> > > > >
> > > > > -Jeff
> > > > >
> > > > > > My function looks like this:
> > > > > >
> > > > > > void Calculator_AddSample()
> > > > > > {
> > > > > > x++;
> > > > > >
> > > > > > read1 = (int*) 0x90300004;
> > > > > > read2 = (int*) 0x90300008;
> > > > > >
> > > > > > tmpRead1 = *read1;
> > > > > > tmpRead2 = *read2;
> > > > > >
> > > > > > // CHANNEL 1
> > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > // CHANNEL 2
> > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > // FWS R+L Add
> > > > > > if(LRneeded == 1)
> > > > > > {
> > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > {
> > > > > > CH1.deloggedData[x] = 5000;
> > > > > > }
> > > > > > }
> > > > > > // CHANNEL 3 this channel is always read for particle matching on
> > this channel
> > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > >
> > > > > > // CHANNEL 4
> > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > // CHANNEL 5
> > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > // CHANNEL 6
> > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > }
> > > > > > This function executes 2 reads from 2 different FIFO's and then
> > seperates the different datachannels and decompresses the value's with a LookUp
> > Table.
> > > > > >
> > > > > > I am trying to streamline this function so it can keep up with the
> > incoming data. The data is written to the FIFO's with 4MHz. The data consists of
> > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > >
> > > > > > At the moment I am starting this "prefetch" function when a burst starts
> > and execute this function every time there is data available in the FIFO's
> > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst
> > ends. All variables are in IRAM.
> > > > > >
> > > > > > I think I made an error in suspecting the EMIF transfer speed and I now
> > suspect that there may be some overhead in the polling scheme I use for calling
> > this function that results in the slow transfer speed. I will look into this. I
> > would like to thank everyone for there input.
> > > > > >
> > > > > > With kind regards,
> > > > > >
> > > > > > Dominic
> > > > > >
> > > > > > --- In c..., Adolf Klemenz wrote:
> > > > > > >
> > > > > > > Dear Dominic,
> > > > > > >
> > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > >as I understand DMA, I would need to work in "blocks" of data but
that
> > > > > > > >would be very tricky in my application since I do not know how
big the
> > > > > > > >datastream is gonna be. Or is it possible to use DMA for single byte
> > transfers?
> > > > > > >
> > > > > > > using DMA makes sense for block transfers only. Typical Fifo
applications
> > > > > > > will use the Fifo's half-full flag (or a similar signal) to
trigger a DMA
> > > > > > > block read.
> > > > > > > You may use element-synchronized DMA (each trigger transfers only
one data
> > > > > > > word), but there will be no speed improvement: It takes about
100ns from
> > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > >
> > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > >
> > > > > > > // compiled with -o2 and without debug info:
> > > > > > >
> > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > // optimizer from code removal
> > > > > > > for (;;)
> > > > > > > {
> > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > }
> > > > > > >
> > > > > > > The screenshot shows chip select and read signal with the expected
timings
> > > > > > > (20ns strobe width). The gap between sucessive reads is caused by
the DSP
> > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
which should
> > > > > > > translate to 150ns on a 300MHz device.
> > > > > > >
> > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Adolf Klemenz, D.SignT
> > > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________
Dear R.Williams,

I changed my code to your suggestion:

void Calculator_FetchData()
{
volatile float * pCH1;
volatile float * pCH2;
volatile float * pCH3;
volatile float * pCH4;
volatile float * pCH5;
volatile float * pCH6;

const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
const termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
pCH3 = &CH3.deloggedData[0];
pCH4 = &CH4.deloggedData[0];
pCH5 = &CH5.deloggedData[0];
pCH6 = &CH6.deloggedData[0];

tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...

while(*pCH1 < endCH1)
{
tmpRead1 = *read1;
if(tmpRead1 == termValue) break;
//CHANNEL 1
*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
// CHANNEL 2
*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
if(LRneeded == 1)
{
*pCH1 += *pCH2;
if(*pCH1 > 5000)
{
*pCH1 = 5000;
}
}
// CHANNEL 5
*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

// CHANNEL 6
*pCH6 = LUT1[tmpRead1 & 0xFF];

tmpRead2 = *read2;

// CHANNEL 3 this channel is always read for particle matching on this channel
*pCH3 = LUT0[((tmpRead2 & 0xFF))];
// CHANNEL 4
*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

pCH1++;
pCH2++;
pCH3++;
pCH4++;
pCH5++;
pCH6++;
x++;
}
if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
{
processTime = TIMER(1)->cnt - tmpprocessTime;
}
}

On my testrig I'm offering particles with a fixed lenght of 985. My previous code could read 985 samples for 6 channels in 681us. Your suggestion cut that time down to 601us!!! My first reaction was WOW :P. I have a couple of questions though if you can forgive my ignorance. The big question is WHY? Because it looks like it's calculating more (6 pointers instead of 1 "x"). I still left in the x++; because I need to know how many samples have been read.

With kind regards,

Dominic Stuart

--- In c..., "Richard Williams" wrote:
>
> d.stuartnl,
>
> I notice that the code, during the first loop, checks for the termination value
> then throws away the first read values (by reading from read1 and read2 again).
> is that you wanted to do?
>
> Execution could be made much faster, by eliminating the calculations related to
> 'x' by using pointers to:
> CH1.deloggedData,
> CH2.deloggedData,
> CH3.deloggedData,
> CH4.deloggedData,
> CH5.deloggedData,
> CH6.deloggedData.
> Initialize the pointers before the loop and increment them at the end of the loop.
> Also, eliminate 'x' and related calculation by precalculating the end address
> for the loop as:
> const endCH1 = &CH1.deloggedData[0x1000];
> const termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> --- // rest of initialization
> while( pCH1 < endCH1 )
> {
> ---// processing
> pCH1++;
> pCh2++;
> ...// rest of incrementing
> } // end while()
>
> to avoid processing the termination value from *read1
> and to exit when the termination value is read:
> The first code within the 'while' loop would be:
> tmpRead1 = *read1;
> if (tmpRead1 == termValue ) break;
> tmpRead2 = *read2;
>
> R. Williams
> ---------- Original Message -----------
> From: "d.stuartnl"
> To: c...
> Sent: Fri, 17 Jul 2009 10:11:36 -0000
> Subject: [c6x] Re: Slow EMIF transfer
>
> > R. Williams,
>
> >
> > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> > I've been analyzing the compilers feedback and it's stating that it
> > cannot implement software pipelining because there's a function call
> > (AddSample()) in the loop. I've removed the AddSample() function and
> > put the code from the function directly into the loop (see source),
> > there's still some problems (Disqualified loop: Loop carried
> > dependency bound too large). But I'm working on it :) I've also found
> > out that pipelining is not being used in a lot of my loops so I'm
> > guessing if I adjust my C-code so that software pipelining will be
> > possible I will notice an increase in performance.
> >
> > Source:
> >
> > read1 = (int*) 0x90300004;
> > read2 = (int*) 0x90300008;
> >
> > tmpRead1 = *read1;
> > tmpRead2 = *read2;
> > x = 0;
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> > tmpRead1 = *read1;
> > tmpRead2 = *read2;YouTube - Dilbert - The Knack
> >
> > CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > // FWS R+L Add
> > if(LRneeded == 1)
> > {
> > CH1.deloggedData[x] += CH2.deloggedData[x];
> > if(CH1.deloggedData[x] > 5000)
> > {
> > CH1.deloggedData[x] = 5000;
> > }
> > }
> > CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > binData[x] = (tmpRead2 & 0xFF);
> > CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > x++;
> > }
> >
> > With kind regards,
> >
> > Dominic
> >
> > >
> > > However, your idea of just using the read operation, since it is much longer
> > > than a write, is a good one.
> > >
> > > R. Williams
> > >
> > >
> > >
> > > ---------- Original Message -----------
> > > From: Jeff Brower
> > > To: Dominic Stuart
> > > Cc: c...
> > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > Subject: [c6x] Re: Slow EMIF transfer
> > >
> > > > Dominic-
> > > >
> > > > > I am indeed trying to avoid delay in processing flow. The data needs to be
> > > > > decompressed asap. When that is done the DSP performs calculations on the
> > > > > data and based on the outcome of those calculations the DSP generates a
> > > > > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > > > > always takes longer than a write, I don't have to pull the Empty Flag and
> > > > > can just read the data through a loop like so:
> > > > >
> > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > {
> > > > > Calculator_AddSample();
> > > > > }
> > > >
> > > > Ok, so what you're saying is that once you see a "not empty" flag,
> > > > then you know the agent on the other side of the FIFO is writing a
> > > > known block size, and will write it faster than you can read, so your
> > > > code just needs to read.
> > > >
> > > > > I've tested this and it did improve the performance but nothing shocking,
> > > > > it seems the decompressing via the LookUp Table is creating the bottle
> > > > > neck. I've already split the two dimensional LUT into 2 one dimensional
> > > > > array's. This also helped a bit.
> > > >
> > > > One thing you might try is hand-optimized asm code just for the read /
> > > > look-up sequence, using techniques that Richard was describing. If
> > > > you take advantage of the pipeline, you can improve performance. For
> > > > example you can read sample N, then in the next 4 instructions process
> > > > the lookup on N-1, waiting for N to become valid. It sounds to me
> > > > like it wouldn't be that much code in your loop, maybe a dozen or less
> > > > asm instructions.
> > > >
> > > > -Jeff
> > > >
> > > > PS. Please post to the group, not to me. Thanks.
> > > >
> > > > > --- In c..., Jeff Brower wrote:
> > > > > >
> > > > > > Dominic-
> > > > > >
> > > > > > > Thanks for the information, I think I will refrain from using block
> > > > > > > transfers because I want to process the data as the DSP receives it.
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > >
> > > > > > > At the moment I am starting this "prefetch" function when a burst
> > > > > > > starts and execute this function every time there is data available
> > > > > > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > > > > > > the data before the burst ends. All variables are in IRAM.
> > > > > >
> > > > > > The typical reason for doing it that way is to avoid delay (latency) in
> > > your signal
> > > > > > processing flow, relative to some output (DAC, GPIO line, digital
> > > transmission,
> > > > > > etc). Is that the case? If not then a block based method would be
> > > better, otherwise
> > > > > > you will waste a lot of time polling for each element. You don't have to
> > > implement
> > > > > > DMA as a first step to get that working, you could use a code loop. Then
> > > implement
> > > > > > DMA in order to further improve performance.
> > > > > >
> > > > > > -Jeff
> > > > > >
> > > > > > > My function looks like this:
> > > > > > >
> > > > > > > void Calculator_AddSample()
> > > > > > > {
> > > > > > > x++;
> > > > > > >
> > > > > > > read1 = (int*) 0x90300004;
> > > > > > > read2 = (int*) 0x90300008;
> > > > > > >
> > > > > > > tmpRead1 = *read1;
> > > > > > > tmpRead2 = *read2;
> > > > > > >
> > > > > > > // CHANNEL 1
> > > > > > > CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > // CHANNEL 2
> > > > > > > CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > > // FWS R+L Add
> > > > > > > if(LRneeded == 1)
> > > > > > > {
> > > > > > > CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > > > if(CH1.deloggedData[x] > 5000)
> > > > > > > {
> > > > > > > CH1.deloggedData[x] = 5000;
> > > > > > > }
> > > > > > > }
> > > > > > > // CHANNEL 3 this channel is always read for particle matching on
> > > this channel
> > > > > > > binData[x] = (tmpRead2 & 0xFF);
> > > > > > > CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > > > > >
> > > > > > > // CHANNEL 4
> > > > > > > CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > // CHANNEL 5
> > > > > > > CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > // CHANNEL 6
> > > > > > > CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > > > }
> > > > > > > This function executes 2 reads from 2 different FIFO's and then
> > > seperates the different datachannels and decompresses the value's with a LookUp
> > > Table.
> > > > > > >
> > > > > > > I am trying to streamline this function so it can keep up with the
> > > incoming data. The data is written to the FIFO's with 4MHz. The data consists of
> > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > >
> > > > > > > At the moment I am starting this "prefetch" function when a burst starts
> > > and execute this function every time there is data available in the FIFO's
> > > (polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst
> > > ends. All variables are in IRAM.
> > > > > > >
> > > > > > > I think I made an error in suspecting the EMIF transfer speed and I now
> > > suspect that there may be some overhead in the polling scheme I use for calling
> > > this function that results in the slow transfer speed. I will look into this. I
> > > would like to thank everyone for there input.
> > > > > > >
> > > > > > > With kind regards,
> > > > > > >
> > > > > > > Dominic
> > > > > > >
> > > > > > > --- In c..., Adolf Klemenz wrote:
> > > > > > > >
> > > > > > > > Dear Dominic,
> > > > > > > >
> > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > > >as I understand DMA, I would need to work in "blocks" of data but
> that
> > > > > > > > >would be very tricky in my application since I do not know how
> big the
> > > > > > > > >datastream is gonna be. Or is it possible to use DMA for single byte
> > > transfers?
> > > > > > > >
> > > > > > > > using DMA makes sense for block transfers only. Typical Fifo
> applications
> > > > > > > > will use the Fifo's half-full flag (or a similar signal) to
> trigger a DMA
> > > > > > > > block read.
> > > > > > > > You may use element-synchronized DMA (each trigger transfers only
> one data
> > > > > > > > word), but there will be no speed improvement: It takes about
> 100ns from
> > > > > > > > the EDMA sync event to the actual data transfer on a C6713.
> > > > > > > >
> > > > > > > > Attached is a scope screenshot generated by this test program
> > > > > > > >
> > > > > > > > // compiled with -o2 and without debug info:
> > > > > > > >
> > > > > > > > volatile int buffer; // must be volatile to prevent
> > > > > > > > // optimizer from code removal
> > > > > > > > for (;;)
> > > > > > > > {
> > > > > > > > buffer = *(volatile int*)0x90300000;
> > > > > > > > }
> > > > > > > >
> > > > > > > > The screenshot shows chip select and read signal with the expected
> timings
> > > > > > > > (20ns strobe width). The gap between sucessive reads is caused by
> the DSP
> > > > > > > > architecture. Here it is 200ns because a 225MHz DSP was used,
> which should
> > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > >
> > > > > > > > If this isn't fast enough, you must use block transfers.
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Adolf Klemenz, D.SignT
> > > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________