Sign in

username:

password:



Not a member?

Search c6x



Search tips

Subscribe to c6x



c6x by Keywords

AD535 | BIOS | Booting | Bootloader | C621 | C6211 | C6415 | C671 | C6711 | C6711DSK | C6713 | CCS | Chassaing | COFF | DAT | DM64 | DM642 | DMA | DSK671 | DSK6711 | EDM | EDMA | EMIF | Emulator | EVM | EVM620 | FFT | FIR | GPIO | Halting | HPI | HWI | IDK | JTAG | LDB | LDH | LDW | Linker | LMS | LOG_printf | Matlab | McBSP | MEM_alloc | MIPS | PCI | PCM3003 | Pipeline | Profiling | QDM | Reset | ROM | RTDX | Sampling | SDRAM | Stack | TEB | THS1206 | TMS320C621 | TMS320C6416 | TMS320C6711 | TMS320C6713 | UART | Vector Table | XBUS | XDS560

Sponsor

Industry's highest performing at the lowest power DSPs now as low as $5.00*
Start development today!
*volume pricing for 10ku

Discussion Groups

See Also

Embedded SystemsFPGAElectronics

Discussion Groups | TMS320C6x | Slow EMIF transfer

Technical discussions about the TI C6000 DSPs (including the c62x, c64x and c67x DSPs).

  

Post a new Thread

Slow EMIF transfer - d.st...@yahoo.com - Jun 23 9:26:35 2009

Hi all,

I am a fairly new embedded programmer and this is my first post on this forum. I
am working with a 6713 DSP and in my current project I am reading data from some
FIFO's connected to the EMIF bus. My problem is the performance of the EMIF I
have measured the time it takes to read from the EMIF and i have confirmed these
findings with the  simulator.

I'm excecuting the folowing code:

   x++;                       // 10 clocks  - 0.033 us
   read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
   read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
   tmpRead1 = *read1;	      // 177 clocks - 0.590 us
   tmpRead2 = *read2;         // 176 clocks - 0.586 us

I've commented the measured clocktimes according to the simulator. 177 clocks
for 1 read seems a bit much. Am I overlooking something? How can i acquire a
higher transfer speed?

With kind regards,

Dominic Stuart

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - Jeff Brower - Jun 23 11:48:41 2009

Dominic-

> I am a fairly new embedded programmer and this is my first post on this
> forum. I am working with a 6713 DSP and in my current project I am reading
> data from some FIFO's connected to the EMIF bus. My problem is the
> performance of the EMIF I have measured the time it takes to read from the
> EMIF and i have confirmed these findings with the  simulator.
> 
> I'm excecuting the folowing code:
> 
>    x++;                       // 10 clocks  - 0.033 us
>    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
>    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
>    tmpRead1 = *read1;         // 177 clocks - 0.590 us
>    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> 
> I've commented the measured clocktimes according to the simulator.
> 177 clocks for 1 read seems a bit much. Am I overlooking something? How
> can i acquire a higher transfer speed?

I assume your design, when it takes shape, will treat the FIFO interface as
async,
like an SRAM.  What are your EMIF register settings for the CEn space containing
the
FIFO?  Have you set setup and hold times to match the FIFO data sheet?  Will you
be
connecting ARDY to a pin on the FIFO, and if so are you able to simulate that?

-Jeff

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - Jeff Brower - Jun 23 12:32:51 2009

Dominic-

> thank you for your response, the design actualy is already implemented,
> I designed the hardware for the system and my ex-collegue has designed
> the software.

Ok... in that case I would avoid using the simulator.  The only way to really
know
what's going on with hardware is to measure it in action.  How are you
measuring
clock cycles?  Dig scope or LA connected to FIFO signals?

> I am currently modifying and adding functionality to the
> software. Where can I find these EMIF register settings or can you
> point me to a document where I can read up on this interface?

Maybe these would help...

C6x EMIF Reference Guide:

  http://focus.ti.com/lit/ug/spru266e/spru266e.pdf

TMS320C6000 EMIF to External FIFO Interface App Note:

  http://focus.ti.com/lit/an/spra543/spra543.pdf

-Jeff

PS. Please post to the group, not to me.  Also please don't cut text from
previous
posts in the thread, I had to put your text back.  I have fun to try to help
and
answer questions, but not if I have to spend time formatting.
> > > I am a fairly new embedded programmer and this is my first post
on this
> > > forum. I am working with a 6713 DSP and in my current project I
am reading
> > > data from some FIFO's connected to the EMIF bus. My problem is
the
> > > performance of the EMIF I have measured the time it takes to read
from the
> > > EMIF and i have confirmed these findings with the  simulator.
> > > 
> > > I'm excecuting the folowing code:
> > > 
> > >    x++;                       // 10 clocks  - 0.033 us
> > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> > >    tmpRead1 = *read1;         // 177 clocks - 0.590 us
> > >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > > 
> > > I've commented the measured clocktimes according to the
simulator.
> > > 177 clocks for 1 read seems a bit much. Am I overlooking
something? How
> > > can i acquire a higher transfer speed?
> > 
> > I assume your design, when it takes shape, will treat the FIFO
interface as async,
> > like an SRAM.  What are your EMIF register settings for the CEn space
containing the
> > FIFO?  Have you set setup and hold times to match the FIFO data sheet?
 Will you be
> > connecting ARDY to a pin on the FIFO, and if so are you able to
simulate that?
> >
> > -Jeff

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - Jeff Brower - Jun 23 13:01:15 2009

Dominic-

Did you see my PS?  If you just missed that and you can re-post then I'm Ok to
continue to help.

Otherwise I hope someone else might help you...

-Jeff
"d.stuartnl" wrote:
> 
> I started out measuring the performance of the system with a hardware timer
and printing the result to the RS232 channel. I found out that i could only read
30 bytes in 67.7 microseconds. This means I can read 1 byte every 2.256 us. I
know the FIFO is filled with a dataclock with 4MHz so I'm sure there are bytes
available every 0.25 us. When i discovered this problem I decided to run the
code in the Simulator of CCS. I found that the timing the simulator suggests is
very accurate if I compare it to the measured values I got from the Hardware
timer.
> 
> It all boils down to why is the read from the EMIF so slow? (in the
simulator and in real world) I will read the EMIF reference guide you provided
in your last response and hope to find some answers.
> 
> With kind regards,
> 
> Dominic Stuart
> 
> --- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
> >
> > Dominic-
> >
> > > thank you for your response, the design actualy is already
implemented,
> > > I designed the hardware for the system and my ex-collegue has
designed
> > > the software.
> >
> > Ok... in that case I would avoid using the simulator.  The only way to
really know
> > what's going on with hardware is to measure it in action.  How are you
measuring
> > clock cycles?  Dig scope or LA connected to FIFO signals?
> >
> > > I am currently modifying and adding functionality to the
> > > software. Where can I find these EMIF register settings or can
you
> > > point me to a document where I can read up on this interface?
> >
> > Maybe these would help...
> >
> > C6x EMIF Reference Guide:
> >
> >   http://focus.ti.com/lit/ug/spru266e/spru266e.pdf
> >
> > TMS320C6000 EMIF to External FIFO Interface App Note:
> >
> >   http://focus.ti.com/lit/an/spra543/spra543.pdf
> >
> > -Jeff
> >
> > PS. Please post to the group, not to me.  Also please don't cut text
from previous
> > posts in the thread, I had to put your text back.  I have fun to try
to help and
> > answer questions, but not if I have to spend time formatting.
> >
> >
> > > > > I am a fairly new embedded programmer and this is my
first post on this
> > > > > forum. I am working with a 6713 DSP and in my current
project I am reading
> > > > > data from some FIFO's connected to the EMIF bus. My
problem is the
> > > > > performance of the EMIF I have measured the time it
takes to read from the
> > > > > EMIF and i have confirmed these findings with the 
simulator.
> > > > >
> > > > > I'm excecuting the folowing code:
> > > > >
> > > > >    x++;                       // 10 clocks  - 0.033 us
> > > > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> > > > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> > > > >    tmpRead1 = *read1;         // 177 clocks - 0.590 us
> > > > >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > > > >
> > > > > I've commented the measured clocktimes according to the
simulator.
> > > > > 177 clocks for 1 read seems a bit much. Am I
overlooking something? How
> > > > > can i acquire a higher transfer speed?
> > > >
> > > > I assume your design, when it takes shape, will treat the
FIFO interface as async,
> > > > like an SRAM.  What are your EMIF register settings for the
CEn space containing the
> > > > FIFO?  Have you set setup and hold times to match the FIFO
data sheet?  Will you be
> > > > connecting ARDY to a pin on the FIFO, and if so are you able
to simulate that?
> > > >
> > > > -Jeff
> >

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - d.st...@yahoo.com - Jun 23 13:54:15 2009

Hi Jeff,

thank you for your response, the design already has taken shape, I actualy
designed the hardware and someone else has made the software design, I am
currently modifying the software to add functionality. Where can I inspect these
EMIF register settings or can you point me to a document where I can read up on
it?

With kind regards,

Dominic Stuart

Hi all,
>
>I am a fairly new embedded programmer and this is my first post on this
forum. I am working with a 6713 DSP and in my current project I am reading data
from some FIFO's connected to the EMIF bus. My problem is the performance of the
EMIF I have measured the time it takes to read from the EMIF and i have
confirmed these findings with the  simulator.
>
>I'm excecuting the folowing code:
>
>   x++;                       // 10 clocks  - 0.033 us
>   read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
>   read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
>   tmpRead1 = *read1;	      // 177 clocks - 0.590 us
>   tmpRead2 = *read2;         // 176 clocks - 0.586 us
>
>I've commented the measured clocktimes according to the simulator. 177
clocks for 1 read seems a bit much. Am I overlooking something? How can i
acquire a higher transfer speed?
>
>With kind regards,
>
>Dominic Stuart
>
>_____________________________________

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jun 23 13:54:36 2009

Hi Jeff,

I'm sorry i missed your PS, like i said I am new to posting in forums my
apologies, and i think im posting in the group now, or am i still only replying
to you?

With kind regards,

Dominic
-

--- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
>
> Dominic-
> 
> Did you see my PS?  If you just missed that and you can re-post then I'm Ok
to
> continue to help.
> 
> Otherwise I hope someone else might help you...
> 
> -Jeff
> "d.stuartnl" wrote:
> > 
> > I started out measuring the performance of the system with a hardware
timer and printing the result to the RS232 channel. I found out that i could
only read 30 bytes in 67.7 microseconds. This means I can read 1 byte every
2.256 us. I know the FIFO is filled with a dataclock with 4MHz so I'm sure there
are bytes available every 0.25 us. When i discovered this problem I decided to
run the code in the Simulator of CCS. I found that the timing the simulator
suggests is very accurate if I compare it to the measured values I got from the
Hardware timer.
> > 
> > It all boils down to why is the read from the EMIF so slow? (in the
simulator and in real world) I will read the EMIF reference guide you provided
in your last response and hope to find some answers.
> > 
> > With kind regards,
> > 
> > Dominic Stuart
> > 
> > --- In c...@yahoogroups.com, Jeff Brower <jbrower@> wrote:
> > >
> > > Dominic-
> > >
> > > > thank you for your response, the design actualy is already
implemented,
> > > > I designed the hardware for the system and my ex-collegue
has designed
> > > > the software.
> > >
> > > Ok... in that case I would avoid using the simulator.  The only
way to really know
> > > what's going on with hardware is to measure it in action.  How
are you measuring
> > > clock cycles?  Dig scope or LA connected to FIFO signals?
> > >
> > > > I am currently modifying and adding functionality to the
> > > > software. Where can I find these EMIF register settings or
can you
> > > > point me to a document where I can read up on this
interface?
> > >
> > > Maybe these would help...
> > >
> > > C6x EMIF Reference Guide:
> > >
> > >   http://focus.ti.com/lit/ug/spru266e/spru266e.pdf
> > >
> > > TMS320C6000 EMIF to External FIFO Interface App Note:
> > >
> > >   http://focus.ti.com/lit/an/spra543/spra543.pdf
> > >
> > > -Jeff
> > >
> > > PS. Please post to the group, not to me.  Also please don't cut
text from previous
> > > posts in the thread, I had to put your text back.  I have fun to
try to help and
> > > answer questions, but not if I have to spend time formatting.
> > >
> > >
> > > > > > I am a fairly new embedded programmer and this is
my first post on this
> > > > > > forum. I am working with a 6713 DSP and in my
current project I am reading
> > > > > > data from some FIFO's connected to the EMIF bus.
My problem is the
> > > > > > performance of the EMIF I have measured the time
it takes to read from the
> > > > > > EMIF and i have confirmed these findings with the 
simulator.
> > > > > >
> > > > > > I'm excecuting the folowing code:
> > > > > >
> > > > > >    x++;                       // 10 clocks  -
0.033 us
> > > > > >    read1 = (int*) 0x90300004; // 3 clocks   -
0.010 us
> > > > > >    read2 = (int*) 0x90300008; // 3 clocks   -
0.010 us
> > > > > >    tmpRead1 = *read1;         // 177 clocks -
0.590 us
> > > > > >    tmpRead2 = *read2;         // 176 clocks -
0.586 us
> > > > > >
> > > > > > I've commented the measured clocktimes according
to the simulator.
> > > > > > 177 clocks for 1 read seems a bit much. Am I
overlooking something? How
> > > > > > can i acquire a higher transfer speed?
> > > > >
> > > > > I assume your design, when it takes shape, will treat
the FIFO interface as async,
> > > > > like an SRAM.  What are your EMIF register settings for
the CEn space containing the
> > > > > FIFO?  Have you set setup and hold times to match the
FIFO data sheet?  Will you be
> > > > > connecting ARDY to a pin on the FIFO, and if so are you
able to simulate that?
> > > > >
> > > > > -Jeff
> >

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Jeff Brower - Jun 23 14:14:46 2009

Dominic-

> I'm sorry i missed your PS, like i said I am new to posting in forums
> my apologies, and i think im posting in the group now, or am i still
> only replying to you?

Your post is showing on the group now...

> I started out measuring the performance of the system with a hardware
> timer and printing the result to the RS232 channel.

That's a good way.

> I found out that
> i could only read 30 bytes in 67.7 microseconds. This means I can read
> 1 byte every 2.256 us. I know the FIFO is filled with a dataclock with
> 4MHz so I'm sure there are bytes available every 0.25 us.

Can your FIFO operate synchronous?  If so that might help.  I know that C64x
series
DSPs can be set up specifically for sync FIFO, not sure about C671x.

> When i
> discovered this problem I decided to run the code in the Simulator of
> CCS. I found that the timing the simulator suggests is very accurate if
> I compare it to the measured values I got from the Hardware timer.

Well that might be one reason to be suspicious.  The simulator normally defaults
to
"slowest possible" values, if I recall correctly, like 31 wait-states
for an async
read, or something like that.  So if you can optimize your EMIF register
settings to
match the FIFO data sheet, hopefully you would see some improvement.

> It all boils down to why is the read from the EMIF so slow? (in the
> simulator and in real world) I will read the EMIF reference guide you
> provided in your last response and hope to find some answers.

-Jeff

> > > --- In c...@yahoogroups.com, Jeff Brower <jbrower@> wrote:
> > > >
> > > > Dominic-
> > > >
> > > > > thank you for your response, the design actualy is
already implemented,
> > > > > I designed the hardware for the system and my
ex-collegue has designed
> > > > > the software.
> > > >
> > > > Ok... in that case I would avoid using the simulator.  The
only way to really know
> > > > what's going on with hardware is to measure it in action. 
How are you measuring
> > > > clock cycles?  Dig scope or LA connected to FIFO signals?
> > > >
> > > > > I am currently modifying and adding functionality to
the
> > > > > software. Where can I find these EMIF register settings
or can you
> > > > > point me to a document where I can read up on this
interface?
> > > >
> > > > Maybe these would help...
> > > >
> > > > C6x EMIF Reference Guide:
> > > >
> > > >   http://focus.ti.com/lit/ug/spru266e/spru266e.pdf
> > > >
> > > > TMS320C6000 EMIF to External FIFO Interface App Note:
> > > >
> > > >   http://focus.ti.com/lit/an/spra543/spra543.pdf
> > > >
> > > > -Jeff
> > > >
> > > > PS. Please post to the group, not to me.  Also please don't
cut text from previous
> > > > posts in the thread, I had to put your text back.  I have
fun to try to help and
> > > > answer questions, but not if I have to spend time
formatting.
> > > >
> > > >
> > > > > > > I am a fairly new embedded programmer and
this is my first post on this
> > > > > > > forum. I am working with a 6713 DSP and in my
current project I am reading
> > > > > > > data from some FIFO's connected to the EMIF
bus. My problem is the
> > > > > > > performance of the EMIF I have measured the
time it takes to read from the
> > > > > > > EMIF and i have confirmed these findings with
the  simulator.
> > > > > > >
> > > > > > > I'm excecuting the folowing code:
> > > > > > >
> > > > > > >    x++;                       // 10 clocks  -
0.033 us
> > > > > > >    read1 = (int*) 0x90300004; // 3 clocks   -
0.010 us
> > > > > > >    read2 = (int*) 0x90300008; // 3 clocks   -
0.010 us
> > > > > > >    tmpRead1 = *read1;         // 177 clocks -
0.590 us
> > > > > > >    tmpRead2 = *read2;         // 176 clocks -
0.586 us
> > > > > > >
> > > > > > > I've commented the measured clocktimes
according to the simulator.
> > > > > > > 177 clocks for 1 read seems a bit much. Am I
overlooking something? How
> > > > > > > can i acquire a higher transfer speed?
> > > > > >
> > > > > > I assume your design, when it takes shape, will
treat the FIFO interface as async,
> > > > > > like an SRAM.  What are your EMIF register
settings for the CEn space containing the
> > > > > > FIFO?  Have you set setup and hold times to match
the FIFO data sheet?  Will you be
> > > > > > connecting ARDY to a pin on the FIFO, and if so
are you able to simulate that?
> > > > > >
> > > > > > -Jeff
> > > >
> >
> 
> _____________________________________

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - Richard Williams - Jun 23 16:40:52 2009

D.S,

it looks, on first examination, like the memory at 0x90300000 has about 10 wait
states.

Have you examined the actual source code?
I would expect a max of 4 instructions to perform the 'tmpRead1 = *read2'
fetch read1 (source address)
fetch @ read1 (contents)
fetch tmpRead1 (destination address)
store @ tmpRead1 (contents)

R. Williams
---------- Original Message -----------
From: d...@yahoo.com
To: c...@yahoogroups.com
Sent: Tue, 23 Jun 2009 09:17:13 -0400
Subject: [c6x] Slow EMIF transfer

> Hi all,
> 
> I am a fairly new embedded programmer and this is my first post on 
> this forum. I am working with a 6713 DSP and in my current project I 
> am reading data from some FIFO's connected to the EMIF bus. My problem 
> is the performance of the EMIF I have measured the time it takes to 
> read from the EMIF and i have confirmed these findings with the 
simulator.
> 
> I'm excecuting the folowing code:
> 
>    x++;                       // 10 clocks  - 0.033 us
>    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
>    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
>    tmpRead1 = *read1;	      // 177 clocks - 0.590 us
>    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> 
> I've commented the measured clocktimes according to the simulator. 177 
> clocks for 1 read seems a bit much. Am I overlooking something? How 
> can i acquire a higher transfer speed?
> 
> With kind regards,
> 
> Dominic Stuart
------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - William C Bonner - Jun 23 16:41:09 2009

http://focus.ti.com/lit/ug/spru266e/spru266e.pdf was the document that was
mentioned earlier that has the details of how the EMIF registers are
arranged, and what the bits mean.

The boot bios on my particular platform sets the EMIF correctly. (I've got
two platforms that are very similar but have different ram, so different
EMIF timing.)

I've also got the values that are for each of my platforms listed in my .GEL
file for CCS, since the emulator needs to know the details to make things
work properly.

On Tue, Jun 23, 2009 at 9:12 AM, <d...@yahoo.com> wrote:

> Hi Jeff,
>
> thank you for your response, the design already has taken shape, I actualy
> designed the hardware and someone else has made the software design, I am
> currently modifying the software to add functionality. Where can I inspect
> these EMIF register settings or can you point me to a document where I can
> read up on it?
> With kind regards,
>
> Dominic Stuart
>
> Hi all,
> >
> >I am a fairly new embedded programmer and this is my first post on
this
> forum. I am working with a 6713 DSP and in my current project I am reading
> data from some FIFO's connected to the EMIF bus. My problem is the
> performance of the EMIF I have measured the time it takes to read from the
> EMIF and i have confirmed these findings with the simulator.
> >
> >I'm excecuting the folowing code:
> >
> > x++; // 10 clocks - 0.033 us
> > read1 = (int*) 0x90300004; // 3 clocks - 0.010 us
> > read2 = (int*) 0x90300008; // 3 clocks - 0.010 us
> > tmpRead1 = *read1; // 177 clocks - 0.590 us
> > tmpRead2 = *read2; // 176 clocks - 0.586 us
> >
> >I've commented the measured clocktimes according to the simulator. 177
> clocks for 1 read seems a bit much. Am I overlooking something? How can i
> acquire a higher transfer speed?
> >
> >With kind regards,
> >
> >Dominic Stuart
> >
> >_____________________________________
> >
> >
>  
>

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jun 24 8:47:01 2009

Thanks for all of your responses, I've checked the .gel file and found that
the EMIF_CE registers for the FIFO's I'm reading from are configured as 32-bit
asynchronous. The FIFO's are capable of Synchronous datatransfer so I will check
the SPRU document and program the registers correctfully. I will post a final
message if that fixes my problem.

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
>
> D.S,
> 
> it looks, on first examination, like the memory at 0x90300000 has about 10
wait
> states.
> 
> Have you examined the actual source code?
> I would expect a max of 4 instructions to perform the 'tmpRead1 = *read2'
> fetch read1 (source address)
> fetch @ read1 (contents)
> fetch tmpRead1 (destination address)
> store @ tmpRead1 (contents)
> 
> R. Williams
> ---------- Original Message -----------
> From: d.stuartnl@...
> To: c...@yahoogroups.com
> Sent: Tue, 23 Jun 2009 09:17:13 -0400
> Subject: [c6x] Slow EMIF transfer
> 
> > Hi all,
> > 
> > I am a fairly new embedded programmer and this is my first post on 
> > this forum. I am working with a 6713 DSP and in my current project I 
> > am reading data from some FIFO's connected to the EMIF bus. My problem

> > is the performance of the EMIF I have measured the time it takes to 
> > read from the EMIF and i have confirmed these findings with the 
simulator.
> > 
> > I'm excecuting the folowing code:
> > 
> >    x++;                       // 10 clocks  - 0.033 us
> >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> >    tmpRead1 = *read1;	      // 177 clocks - 0.590 us
> >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > 
> > I've commented the measured clocktimes according to the simulator. 177

> > clocks for 1 read seems a bit much. Am I overlooking something? How 
> > can i acquire a higher transfer speed?
> > 
> > With kind regards,
> > 
> > Dominic Stuart
> ------- End of Original Message -------
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 13 9:50:39 2009

Hi all,

I've examined my code and hardware, the FIFO's I'm accessing are configured as
Asynchronous.

The address space is configured as:
*0x1800004 = 0x10914221;    /* CE1 = async 32 */

If I understand it correctly this should be 1 setup, 2 strobe and 1 hold cycle.
The EMIF is 100 MHz so one read should take 0,04us.

When I measure the performance 1 read takes 0.333 us. Where does this delay come
from?

The code is as follows:

read1 = (int*) 0x90300004;
tmpRead1 = *read1; // this line takes 0.333 us.

tmpRead1 is defined as a volatile int and resides in IRAM (address: 0x0001d100
according to the .map file.

I saw that R. Williams suggested that there may be 10 wait states, what are
these and how can i verify?

If I look at the assembly code for the tmpRead1 = *read1; line it states:

MV.L2X   A5,B4
LDW.D2T2 *+B4[0],B4
MVK.S1   0xffffd100,A6
MVKH.S1  0x10000,A6
NOP      2
STW.D1T2 B4,*+A6[0]
NOP

I hope anyone can shed some light on this or point me in the right direction to
debug this problem.

With kind regards,

Dominic Stuart
--- In c...@yahoogroups.com, "d.stuartnl" <d.stuartnl@...>
wrote:
>
> Thanks for all of your responses, I've checked the .gel file and found that
the EMIF_CE registers for the FIFO's I'm reading from are configured as 32-bit
asynchronous. The FIFO's are capable of Synchronous datatransfer so I will check
the SPRU document and program the registers correctfully. I will post a final
message if that fixes my problem.
> 
> --- In c...@yahoogroups.com, "Richard Williams" <rkwill@>
wrote:
> >
> > D.S,
> > 
> > it looks, on first examination, like the memory at 0x90300000 has
about 10 wait
> > states.
> > 
> > Have you examined the actual source code?
> > I would expect a max of 4 instructions to perform the 'tmpRead1 =
*read2'
> > fetch read1 (source address)
> > fetch @ read1 (contents)
> > fetch tmpRead1 (destination address)
> > store @ tmpRead1 (contents)
> > 
> > R. Williams
> > 
> > 
> > ---------- Original Message -----------
> > From: d.stuartnl@
> > To: c...@yahoogroups.com
> > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> > Subject: [c6x] Slow EMIF transfer
> > 
> > > Hi all,
> > > 
> > > I am a fairly new embedded programmer and this is my first post
on 
> > > this forum. I am working with a 6713 DSP and in my current
project I 
> > > am reading data from some FIFO's connected to the EMIF bus. My
problem 
> > > is the performance of the EMIF I have measured the time it takes
to 
> > > read from the EMIF and i have confirmed these findings with the 
simulator.
> > > 
> > > I'm excecuting the folowing code:
> > > 
> > >    x++;                       // 10 clocks  - 0.033 us
> > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> > >    tmpRead1 = *read1;	      // 177 clocks - 0.590 us
> > >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > > 
> > > I've commented the measured clocktimes according to the
simulator. 177 
> > > clocks for 1 read seems a bit much. Am I overlooking something?
How 
> > > can i acquire a higher transfer speed?
> > > 
> > > With kind regards,
> > > 
> > > Dominic Stuart
> > ------- End of Original Message -------
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 13 10:24:16 2009

d.stuartnl,

I see 10 instructions in the code:

MV.L2X A5,B4          <<-- copy source address to working register
LDW.D2T2 *&#43;B4[0],B4 <<-- read value at source
MVK.S1 0xffffd100,A6  <<-- set up low portion of destination address in
A6
MVKH.S1 0x10000,A6    <<-- set up high portion of destination address in
A6
NOP 2                 <<-- wait two NOP instruction times
STW.D1T2 B4,*&#43;A6[0] <<-- write value to destination address
NOP                   <<-- wait 1 NOP instruction time

So the "tmpRead1 = *read1;" takes ~10 instructions.
( I do not have the H/W details at hand, so cannot supply the specifics on
cycle
times for each instruction.  I also do not know the specific processor well
enough to predict the amount of pipeline stalls, etc )

In general, the operation tmpRead1 = *read1; will take much longer than the
time
needed to read the value from the source address.

Finally, the time measuring tool (?JTAG?) probably imposes some delays for
setup/communication/etc.

R. Williams

---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Mon, 13 Jul 2009 13:14:04 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Hi all,
> 
> I've examined my code and hardware, the FIFO's I'm accessing are 
> configured as Asynchronous.
> 
> The address space is configured as:
> *0x1800004 = 0x10914221;    /* CE1 = async 32 */
> 
> If I understand it correctly this should be 1 setup, 2 strobe and 1 
> hold cycle. The EMIF is 100 MHz so one read should take 0,04us.
> 
> When I measure the performance 1 read takes 0.333 us. Where does this 
> delay come from?
> 
> The code is as follows:
> 
> read1 = (int*) 0x90300004;
> tmpRead1 = *read1; // this line takes 0.333 us.
> 
> tmpRead1 is defined as a volatile int and resides in IRAM (address: 
> 0x0001d100 according to the .map file.
> 
> I saw that R. Williams suggested that there may be 10 wait states, 
> what are these and how can i verify?
> 
> If I look at the assembly code for the tmpRead1 = *read1; line it states:
> 
> MV.L2X   A5,B4
> LDW.D2T2 *+B4[0],B4
> MVK.S1   0xffffd100,A6
> MVKH.S1  0x10000,A6
> NOP      2
> STW.D1T2 B4,*+A6[0]
> NOP
> 
> I hope anyone can shed some light on this or point me in the right 
> direction to debug this problem.
> 
> With kind regards,
> 
> Dominic Stuart
> 
> --- In c...@yahoogroups.com, "d.stuartnl" <d.stuartnl@...>
wrote:
> >
> > Thanks for all of your responses, I've checked the .gel file and found
that
the EMIF_CE registers for the FIFO's I'm reading from are configured as 32-bit
asynchronous. The FIFO's are capable of Synchronous datatransfer so I will
check
the SPRU document and program the registers correctfully. I will post a final
message if that fixes my problem.
> > 
> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > >
> > > D.S,
> > > 
> > > it looks, on first examination, like the memory at 0x90300000 has
about 10
wait
> > > states.
> > > 
> > > Have you examined the actual source code?
> > > I would expect a max of 4 instructions to perform the 'tmpRead1 =
*read2'
> > > fetch read1 (source address)
> > > fetch @ read1 (contents)
> > > fetch tmpRead1 (destination address)
> > > store @ tmpRead1 (contents)
> > > 
> > > R. Williams
> > > 
> > > 
> > > ---------- Original Message -----------
> > > From: d.stuartnl@
> > > To: c...@yahoogroups.com
> > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> > > Subject: [c6x] Slow EMIF transfer
> > > 
> > > > Hi all,
> > > > 
> > > > I am a fairly new embedded programmer and this is my first
post on 
> > > > this forum. I am working with a 6713 DSP and in my current
project I 
> > > > am reading data from some FIFO's connected to the EMIF bus.
My problem 
> > > > is the performance of the EMIF I have measured the time it
takes to 
> > > > read from the EMIF and i have confirmed these findings with
the  simulator.
> > > > 
> > > > I'm excecuting the folowing code:
> > > > 
> > > >    x++;                       // 10 clocks  - 0.033 us
> > > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> > > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> > > >    tmpRead1 = *read1;	      // 177 clocks - 0.590 us
> > > >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > > > 
> > > > I've commented the measured clocktimes according to the
simulator. 177 
> > > > clocks for 1 read seems a bit much. Am I overlooking
something? How 
> > > > can i acquire a higher transfer speed?
> > > > 
> > > > With kind regards,
> > > > 
> > > > Dominic Stuart
> > > ------- End of Original Message -------
> > >
> >
------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 13 10:39:44 2009

Thanks for your reply R.Williams,

I am measuring the instruction with a hardware timer (which creates 0.1665 us
delay). This means that the read instruction still takes (0.333 - 0.1665) 0.1665
us. This seems very slow (+/-6MHz) for a 300 MHz CPU connected to a 100MHz bus.
Is there any way to speed this up?
With kind regards,

Dominic

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
>
> d.stuartnl,
> 
> I see 10 instructions in the code:
> 
> MV.L2X A5,B4          <<-- copy source address to working register
> LDW.D2T2 *&#43;B4[0],B4 <<-- read value at source
> MVK.S1 0xffffd100,A6  <<-- set up low portion of destination address
in A6
> MVKH.S1 0x10000,A6    <<-- set up high portion of destination address
in A6
> NOP 2                 <<-- wait two NOP instruction times
> STW.D1T2 B4,*&#43;A6[0] <<-- write value to destination address
> NOP                   <<-- wait 1 NOP instruction time
> 
> So the "tmpRead1 = *read1;" takes ~10 instructions.
> ( I do not have the H/W details at hand, so cannot supply the specifics on
cycle
> times for each instruction.  I also do not know the specific processor
well
> enough to predict the amount of pipeline stalls, etc )
> 
> In general, the operation tmpRead1 = *read1; will take much longer than the
time
> needed to read the value from the source address.
> 
> Finally, the time measuring tool (?JTAG?) probably imposes some delays for
> setup/communication/etc.
> 
> R. Williams
> 
> ---------- Original Message -----------
> From: "d.stuartnl" <d.stuartnl@...>
> To: c...@yahoogroups.com
> Sent: Mon, 13 Jul 2009 13:14:04 -0000
> Subject: [c6x] Re: Slow EMIF transfer
> 
> > Hi all,
> > 
> > I've examined my code and hardware, the FIFO's I'm accessing are 
> > configured as Asynchronous.
> > 
> > The address space is configured as:
> > *0x1800004 = 0x10914221;    /* CE1 = async 32 */
> > 
> > If I understand it correctly this should be 1 setup, 2 strobe and 1 
> > hold cycle. The EMIF is 100 MHz so one read should take 0,04us.
> > 
> > When I measure the performance 1 read takes 0.333 us. Where does this

> > delay come from?
> > 
> > The code is as follows:
> > 
> > read1 = (int*) 0x90300004;
> > tmpRead1 = *read1; // this line takes 0.333 us.
> > 
> > tmpRead1 is defined as a volatile int and resides in IRAM (address: 
> > 0x0001d100 according to the .map file.
> > 
> > I saw that R. Williams suggested that there may be 10 wait states, 
> > what are these and how can i verify?
> > 
> > If I look at the assembly code for the tmpRead1 = *read1; line it
states:
> > 
> > MV.L2X   A5,B4
> > LDW.D2T2 *+B4[0],B4
> > MVK.S1   0xffffd100,A6
> > MVKH.S1  0x10000,A6
> > NOP      2
> > STW.D1T2 B4,*+A6[0]
> > NOP
> > 
> > I hope anyone can shed some light on this or point me in the right 
> > direction to debug this problem.
> > 
> > With kind regards,
> > 
> > Dominic Stuart
> > 
> > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
> > >
> > > Thanks for all of your responses, I've checked the .gel file and
found that
> the EMIF_CE registers for the FIFO's I'm reading from are configured as
32-bit
> asynchronous. The FIFO's are capable of Synchronous datatransfer so I will
check
> the SPRU document and program the registers correctfully. I will post a
final
> message if that fixes my problem.
> > > 
> > > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > > >
> > > > D.S,
> > > > 
> > > > it looks, on first examination, like the memory at
0x90300000 has about 10
> wait
> > > > states.
> > > > 
> > > > Have you examined the actual source code?
> > > > I would expect a max of 4 instructions to perform the
'tmpRead1 = *read2'
> > > > fetch read1 (source address)
> > > > fetch @ read1 (contents)
> > > > fetch tmpRead1 (destination address)
> > > > store @ tmpRead1 (contents)
> > > > 
> > > > R. Williams
> > > > 
> > > > 
> > > > ---------- Original Message -----------
> > > > From: d.stuartnl@
> > > > To: c...@yahoogroups.com
> > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> > > > Subject: [c6x] Slow EMIF transfer
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > I am a fairly new embedded programmer and this is my
first post on 
> > > > > this forum. I am working with a 6713 DSP and in my
current project I 
> > > > > am reading data from some FIFO's connected to the EMIF
bus. My problem 
> > > > > is the performance of the EMIF I have measured the time
it takes to 
> > > > > read from the EMIF and i have confirmed these findings
with the  simulator.
> > > > > 
> > > > > I'm excecuting the folowing code:
> > > > > 
> > > > >    x++;                       // 10 clocks  - 0.033 us
> > > > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010 us
> > > > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010 us
> > > > >    tmpRead1 = *read1;	      // 177 clocks - 0.590 us
> > > > >    tmpRead2 = *read2;         // 176 clocks - 0.586 us
> > > > > 
> > > > > I've commented the measured clocktimes according to the
simulator. 177 
> > > > > clocks for 1 read seems a bit much. Am I overlooking
something? How 
> > > > > can i acquire a higher transfer speed?
> > > > > 
> > > > > With kind regards,
> > > > > 
> > > > > Dominic Stuart
> > > > ------- End of Original Message -------
> > > >
> > >
> ------- End of Original Message -------
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Jeff Brower - Jul 13 11:02:40 2009

Dominic-

> Thanks for your reply R.Williams,
>
> I am measuring the instruction with a hardware timer (which creates
> 0.1665 us delay). This means that the read
> instruction still takes (0.333 - 0.1665) 0.1665 us. This seems very
> slow (+/-6MHz) for a 300 MHz CPU connected to a
> 100MHz bus. Is there any way to speed this up?

Any "internal" technique you use to measure the duration of something
in the nsec range is not going to be accurate. 
As you have found, reading hardware timer registers has some inherent delay, and
as Richard mentions, a JTAG and/or
RTDX based method would take so much time the actual memory cycle would end up a
tiny fraction.

The only way to accurately measure a single memory cycle time is externally (dig
scope or LA).  My suggestion to get a
worst-case figure would be to make three accesses: one to your mem, one to
another mem (to force a change in CEn
lines), and a third one to your mem.  And watch this on the scope.  Then you
would know both the cycle duration and
the amount of time the compiler is adding for your line of C code.

-Jeff

> --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@...> wrote:
>>
>> d.stuartnl,
>>
>> I see 10 instructions in the code:
>>
>> MV.L2X A5,B4          <<-- copy source address to working
register
>> LDW.D2T2 *+B4[0],B4 <<-- read value at source
>> MVK.S1 0xffffd100,A6  <<-- set up low portion of destination
address in A6
>> MVKH.S1 0x10000,A6    <<-- set up high portion of destination
address in A6
>> NOP 2                 <<-- wait two NOP instruction times
>> STW.D1T2 B4,*+A6[0] <<-- write value to destination address
>> NOP                   <<-- wait 1 NOP instruction time
>>
>> So the "tmpRead1 = *read1;" takes ~10 instructions.
>> ( I do not have the H/W details at hand, so cannot supply the specifics
on cycle
>> times for each instruction.  I also do not know the specific processor
well
>> enough to predict the amount of pipeline stalls, etc )
>>
>> In general, the operation tmpRead1 = *read1; will take much longer than
the time
>> needed to read the value from the source address.
>>
>> Finally, the time measuring tool (?JTAG?) probably imposes some delays
for
>> setup/communication/etc.
>>
>> R. Williams
>>
>> ---------- Original Message -----------
>> From: "d.stuartnl" <d.stuartnl@...>
>> To: c...@yahoogroups.com
>> Sent: Mon, 13 Jul 2009 13:14:04 -0000
>> Subject: [c6x] Re: Slow EMIF transfer
>>
>> > Hi all,
>> >
>> > I've examined my code and hardware, the FIFO's I'm accessing are
>> > configured as Asynchronous.
>> >
>> > The address space is configured as:
>> > *0x1800004 = 0x10914221;    /* CE1 = async 32 */
>> >
>> > If I understand it correctly this should be 1 setup, 2 strobe and
1
>> > hold cycle. The EMIF is 100 MHz so one read should take 0,04us.
>> >
>> > When I measure the performance 1 read takes 0.333 us. Where does
this
>> > delay come from?
>> >
>> > The code is as follows:
>> >
>> > read1 = (int*) 0x90300004;
>> > tmpRead1 = *read1; // this line takes 0.333 us.
>> >
>> > tmpRead1 is defined as a volatile int and resides in IRAM
(address:
>> > 0x0001d100 according to the .map file.
>> >
>> > I saw that R. Williams suggested that there may be 10 wait
states,
>> > what are these and how can i verify?
>> >
>> > If I look at the assembly code for the tmpRead1 = *read1; line it
states:
>> >
>> > MV.L2X   A5,B4
>> > LDW.D2T2 *+B4[0],B4
>> > MVK.S1   0xffffd100,A6
>> > MVKH.S1  0x10000,A6
>> > NOP      2
>> > STW.D1T2 B4,*+A6[0]
>> > NOP
>> >
>> > I hope anyone can shed some light on this or point me in the
right
>> > direction to debug this problem.
>> >
>> > With kind regards,
>> >
>> > Dominic Stuart
>> >
>> > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
>> > >
>> > > Thanks for all of your responses, I've checked the .gel file
and found that
>> the EMIF_CE registers for the FIFO's I'm reading from are configured as
32-bit
>> asynchronous. The FIFO's are capable of Synchronous datatransfer so I
will check
>> the SPRU document and program the registers correctfully. I will post a
final
>> message if that fixes my problem.
>> > >
>> > > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
>> > > >
>> > > > D.S,
>> > > >
>> > > > it looks, on first examination, like the memory at
0x90300000 has about 10
>> wait
>> > > > states.
>> > > >
>> > > > Have you examined the actual source code?
>> > > > I would expect a max of 4 instructions to perform the
'tmpRead1 = *read2'
>> > > > fetch read1 (source address)
>> > > > fetch @ read1 (contents)
>> > > > fetch tmpRead1 (destination address)
>> > > > store @ tmpRead1 (contents)
>> > > >
>> > > > R. Williams
>> > > >
>> > > >
>> > > > ---------- Original Message -----------
>> > > > From: d.stuartnl@
>> > > > To: c...@yahoogroups.com
>> > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
>> > > > Subject: [c6x] Slow EMIF transfer
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I am a fairly new embedded programmer and this is
my first post on
>> > > > > this forum. I am working with a 6713 DSP and in my
current project I
>> > > > > am reading data from some FIFO's connected to the
EMIF bus. My problem
>> > > > > is the performance of the EMIF I have measured the
time it takes to
>> > > > > read from the EMIF and i have confirmed these
findings with the  simulator.
>> > > > >
>> > > > > I'm excecuting the folowing code:
>> > > > >
>> > > > >    x++;                       // 10 clocks  - 0.033
us
>> > > > >    read1 = (int*) 0x90300004; // 3 clocks   - 0.010
us
>> > > > >    read2 = (int*) 0x90300008; // 3 clocks   - 0.010
us
>> > > > >    tmpRead1 = *read1;	      // 177 clocks - 0.590
us
>> > > > >    tmpRead2 = *read2;         // 176 clocks - 0.586
us
>> > > > >
>> > > > > I've commented the measured clocktimes according to
the simulator. 177
>> > > > > clocks for 1 read seems a bit much. Am I
overlooking something? How
>> > > > > can i acquire a higher transfer speed?
>> > > > >
>> > > > > With kind regards,
>> > > > >
>> > > > > Dominic Stuart
>> > > > ------- End of Original Message -------
>> > > >
>> > >
>> ------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 13 12:31:02 2009

D.stuartnl,

If I were coding it in ASM.  I would do the following:

> > MVK.S1 0xffffd100,A6  <<-- set up low portion of destination
address in A6
> > MVKH.S1 0x10000,A6    <<-- set up high portion of destination
address in A6
> > LDW.D2T2 *&#43;A5[0],B4 <<-- read value at source
> > NOP                   <<-- wait 1 NOP instruction time
> > STW.D1T2 B4,*&#43;A6[0] <<-- write value to destination
address
> > NOP                   <<-- wait 1 NOP instruction time

However, this may have some deficiencies in the use of the pipeline and assumes
A5 can be used for accessing the RAM.

R. Williams

---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Mon, 13 Jul 2009 14:31:38 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Thanks for your reply R.Williams,
> 
> I am measuring the instruction with a hardware timer (which creates 
> 0.1665 us delay). This means that the read instruction still takes 
> (0.333 - 0.1665) 0.1665 us. This seems very slow (+/-6MHz) for a 300 
> MHz CPU connected to a 100MHz bus. Is there any way to speed this up?
> 
> With kind regards,
> 
> Dominic
> 
> --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@...> wrote:
> >
> > d.stuartnl,
> > 
> > I see 10 instructions in the code:
> > 
> > MV.L2X A5,B4          <<-- copy source address to working
register
> > LDW.D2T2 *&#43;B4[0],B4 <<-- read value at source
> > MVK.S1 0xffffd100,A6  <<-- set up low portion of destination
address in A6
> > MVKH.S1 0x10000,A6    <<-- set up high portion of destination
address in A6
> > NOP 2                 <<-- wait two NOP instruction times
> > STW.D1T2 B4,*&#43;A6[0] <<-- write value to destination
address
> > NOP                   <<-- wait 1 NOP instruction time
> > 
> > So the "tmpRead1 = *read1;" takes ~10 instructions.
> > ( I do not have the H/W details at hand, so cannot supply the
specifics on cycle
> > times for each instruction.  I also do not know the specific processor
well
> > enough to predict the amount of pipeline stalls, etc )
> > 
> > In general, the operation tmpRead1 = *read1; will take much longer
than the time
> > needed to read the value from the source address.
> > 
> > Finally, the time measuring tool (?JTAG?) probably imposes some delays
for
> > setup/communication/etc.
> > 
> > R. Williams
> > 
> > 
> > 
> > ---------- Original Message -----------
> > From: "d.stuartnl" <d.stuartnl@...>
> > To: c...@yahoogroups.com
> > Sent: Mon, 13 Jul 2009 13:14:04 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> > 
> > > Hi all,
> > > 
> > > I've examined my code and hardware, the FIFO's I'm accessing are

> > > configured as Asynchronous.
> > > 
> > > The address space is configured as:
> > > *0x1800004 = 0x10914221;    /* CE1 = async 32 */
> > > 
> > > If I understand it correctly this should be 1 setup, 2 strobe and
1 
> > > hold cycle. The EMIF is 100 MHz so one read should take 0,04us.
> > > 
> > > When I measure the performance 1 read takes 0.333 us. Where does
this 
> > > delay come from?
> > > 
> > > The code is as follows:
> > > 
> > > read1 = (int*) 0x90300004;
> > > tmpRead1 = *read1; // this line takes 0.333 us.
> > > 
> > > tmpRead1 is defined as a volatile int and resides in IRAM
(address: 
> > > 0x0001d100 according to the .map file.
> > > 
> > > I saw that R. Williams suggested that there may be 10 wait
states, 
> > > what are these and how can i verify?
> > > 
> > > If I look at the assembly code for the tmpRead1 = *read1; line it
states:
> > > 
> > > MV.L2X   A5,B4
> > > LDW.D2T2 *+B4[0],B4
> > > MVK.S1   0xffffd100,A6
> > > MVKH.S1  0x10000,A6
> > > NOP      2
> > > STW.D1T2 B4,*+A6[0]
> > > NOP
> > > 
> > > I hope anyone can shed some light on this or point me in the
right 
> > > direction to debug this problem.
> > > 
> > > With kind regards,
> > > 
> > > Dominic Stuart
> > > 
> > > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
> > > >
> > > > Thanks for all of your responses, I've checked the .gel file
and found that
> > the EMIF_CE registers for the FIFO's I'm reading from are configured
as 32-bit
> > asynchronous. The FIFO's are capable of Synchronous datatransfer so I
will check
> > the SPRU document and program the registers correctfully. I will post
a final
> > message if that fixes my problem.
> > > > 
> > > > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > > > >
> > > > > D.S,
> > > > > 
> > > > > it looks, on first examination, like the memory at
0x90300000 has about 10
> > wait
> > > > > states.
> > > > > 
> > > > > Have you examined the actual source code?
> > > > > I would expect a max of 4 instructions to perform the
'tmpRead1 = *read2'
> > > > > fetch read1 (source address)
> > > > > fetch @ read1 (contents)
> > > > > fetch tmpRead1 (destination address)
> > > > > store @ tmpRead1 (contents)
> > > > > 
> > > > > R. Williams
> > > > > 
> > > > > 
> > > > > ---------- Original Message -----------
> > > > > From: d.stuartnl@
> > > > > To: c...@yahoogroups.com
> > > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> > > > > Subject: [c6x] Slow EMIF transfer
> > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > I am a fairly new embedded programmer and this is
my first post on 
> > > > > > this forum. I am working with a 6713 DSP and in my
current project I 
> > > > > > am reading data from some FIFO's connected to the
EMIF bus. My problem 
> > > > > > is the performance of the EMIF I have measured the
time it takes to 
> > > > > > read from the EMIF and i have confirmed these
findings with the 
simulator.
> > > > > > 
> > > > > > I'm excecuting the folowing code:
> > > > > > 
> > > > > >    x++;                       // 10 clocks  -
0.033 us
> > > > > >    read1 = (int*) 0x90300004; // 3 clocks   -
0.010 us
> > > > > >    read2 = (int*) 0x90300008; // 3 clocks   -
0.010 us
> > > > > >    tmpRead1 = *read1;	      // 177 clocks - 0.590
us
> > > > > >    tmpRead2 = *read2;         // 176 clocks -
0.586 us
> > > > > > 
> > > > > > I've commented the measured clocktimes according
to the simulator. 177 
> > > > > > clocks for 1 read seems a bit much. Am I
overlooking something? How 
> > > > > > can i acquire a higher transfer speed?
> > > > > > 
> > > > > > With kind regards,
> > > > > > 
> > > > > > Dominic Stuart
> > > > > ------- End of Original Message -------
> > > > >
> > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Michael Dunn - Jul 13 12:31:32 2009

Dominic,

On Mon, Jul 13, 2009 at 10:15 AM, Jeff Brower <j...@signalogic.com>
wrote:
> Dominic-
>
> > Thanks for your reply R.Williams,
> >
> > I am measuring the instruction with a hardware timer (which creates
> > 0.1665 us delay). This means that the read
> > instruction still takes (0.333 - 0.1665) 0.1665 us. This seems very
> > slow (+/-6MHz) for a 300 MHz CPU connected to a
> > 100MHz bus. Is there any way to speed this up?

<mld>
Excuse me if I missed this in previous postings...

1. Is the code executing in IRAM or SDRAM??
2. Is ClkOut2 == 150 Mhz??
3. Have you checked Emif Clk in for 100 Mhz??

Now that we [or I :-) ] are calibrated....
>
> Any "internal" technique you use to measure the duration of
something in the nsec range is not going to be accurate.
> As you have found, reading hardware timer registers has some inherent
delay, and as Richard mentions, a JTAG and/or
> RTDX based method would take so much time the actual memory cycle would end
up a tiny fraction.
>
> The only way to accurately measure a single memory cycle time is externally
(dig scope or LA). My suggestion to get a
> worst-case figure would be to make three accesses: one to your mem, one to
another mem (to force a change in CEn
> lines), and a third one to your mem. And watch this on the scope. Then you
would know both the cycle duration and
> the amount of time the compiler is adding for your line of C code.

<mld>
I pretty much agree with Jeff.
If you are writing your testcase in C, keep it very simple, keep
everything in main, and always check the asm code that was generated.
I prefer a loop because some days it takes me a few tries to get setup
correctly and "sync'd". When doing the reads, look at the CEx line
and
AOE to determine the read cycle time.

Start:
Read from 0xA0000000 [CE2].
Read from 0x90300004 [CE1].
Read from 0xA0000000 [CE2].
Read from 0x90300004 [CE1].
goto Start.

Once you get a handle on the measurements, you can insert 2 CE2
accesses around your "code of interest" to measure the time in your
app.  If you 'register' your CE2 address, you can get very accurate
numbers.

mikedunn

>
> -Jeff
>
> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@...> wrote:
> >>
> >> d.stuartnl,
> >>
> >> I see 10 instructions in the code:
> >>
> >> MV.L2X A5,B4 <<-- copy source address to working register
> >> LDW.D2T2 *+B4[0],B4 <<-- read value at source
> >> MVK.S1 0xffffd100,A6 <<-- set up low portion of destination
address in A6
> >> MVKH.S1 0x10000,A6 <<-- set up high portion of destination
address in A6
> >> NOP 2 <<-- wait two NOP instruction times
> >> STW.D1T2 B4,*+A6[0] <<-- write value to destination address
> >> NOP <<-- wait 1 NOP instruction time
> >>
> >> So the "tmpRead1 = *read1;" takes ~10 instructions.
> >> ( I do not have the H/W details at hand, so cannot supply the
specifics on cycle
> >> times for each instruction. I also do not know the specific
processor well
> >> enough to predict the amount of pipeline stalls, etc )
> >>
> >> In general, the operation tmpRead1 = *read1; will take much longer
than the time
> >> needed to read the value from the source address.
> >>
> >> Finally, the time measuring tool (?JTAG?) probably imposes some
delays for
> >> setup/communication/etc.
> >>
> >> R. Williams
> >>
> >>
> >>
> >> ---------- Original Message -----------
> >> From: "d.stuartnl" <d.stuartnl@...>
> >> To: c...@yahoogroups.com
> >> Sent: Mon, 13 Jul 2009 13:14:04 -0000
> >> Subject: [c6x] Re: Slow EMIF transfer
> >>
> >> > Hi all,
> >> >
> >> > I've examined my code and hardware, the FIFO's I'm accessing
are
> >> > configured as Asynchronous.
> >> >
> >> > The address space is configured as:
> >> > *0x1800004 = 0x10914221; /* CE1 = async 32 */
> >> >
> >> > If I understand it correctly this should be 1 setup, 2 strobe
and 1
> >> > hold cycle. The EMIF is 100 MHz so one read should take
0,04us.
> >> >
> >> > When I measure the performance 1 read takes 0.333 us. Where
does this
> >> > delay come from?
> >> >
> >> > The code is as follows:
> >> >
> >> > read1 = (int*) 0x90300004;
> >> > tmpRead1 = *read1; // this line takes 0.333 us.
> >> >
> >> > tmpRead1 is defined as a volatile int and resides in IRAM
(address:
> >> > 0x0001d100 according to the .map file.
> >> >
> >> > I saw that R. Williams suggested that there may be 10 wait
states,
> >> > what are these and how can i verify?
> >> >
> >> > If I look at the assembly code for the tmpRead1 = *read1;
line it states:
> >> >
> >> > MV.L2X A5,B4
> >> > LDW.D2T2 *+B4[0],B4
> >> > MVK.S1 0xffffd100,A6
> >> > MVKH.S1 0x10000,A6
> >> > NOP 2
> >> > STW.D1T2 B4,*+A6[0]
> >> > NOP
> >> >
> >> > I hope anyone can shed some light on this or point me in the
right
> >> > direction to debug this problem.
> >> >
> >> > With kind regards,
> >> >
> >> > Dominic Stuart
> >> >
> >> > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
> >> > >
> >> > > Thanks for all of your responses, I've checked the .gel
file and found that
> >> the EMIF_CE registers for the FIFO's I'm reading from are
configured as 32-bit
> >> asynchronous. The FIFO's are capable of Synchronous datatransfer
so I will check
> >> the SPRU document and program the registers correctfully. I will
post a final
> >> message if that fixes my problem.
> >> > >
> >> > > --- In c...@yahoogroups.com, "Richard
Williams" <rkwill@> wrote:
> >> > > >
> >> > > > D.S,
> >> > > >
> >> > > > it looks, on first examination, like the memory at
0x90300000 has about 10
> >> wait
> >> > > > states.
> >> > > >
> >> > > > Have you examined the actual source code?
> >> > > > I would expect a max of 4 instructions to perform
the 'tmpRead1 = *read2'
> >> > > > fetch read1 (source address)
> >> > > > fetch @ read1 (contents)
> >> > > > fetch tmpRead1 (destination address)
> >> > > > store @ tmpRead1 (contents)
> >> > > >
> >> > > > R. Williams
> >> > > >
> >> > > >
> >> > > > ---------- Original Message -----------
> >> > > > From: d.stuartnl@
> >> > > > To: c...@yahoogroups.com
> >> > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> >> > > > Subject: [c6x] Slow EMIF transfer
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I am a fairly new embedded programmer and this
is my first post on
> >> > > > > this forum. I am working with a 6713 DSP and
in my current project I
> >> > > > > am reading data from some FIFO's connected to
the EMIF bus. My problem
> >> > > > > is the performance of the EMIF I have measured
the time it takes to
> >> > > > > read from the EMIF and i have confirmed these
findings with the simulator.
> >> > > > >
> >> > > > > I'm excecuting the folowing code:
> >> > > > >
> >> > > > > x++; // 10 clocks - 0.033 us
> >> > > > > read1 = (int*) 0x90300004; // 3 clocks - 0.010
us
> >> > > > > read2 = (int*) 0x90300008; // 3 clocks - 0.010
us
> >> > > > > tmpRead1 = *read1; // 177 clocks - 0.590 us
> >> > > > > tmpRead2 = *read2; // 176 clocks - 0.586 us
> >> > > > >
> >> > > > > I've commented the measured clocktimes
according to the simulator. 177
> >> > > > > clocks for 1 read seems a bit much. Am I
overlooking something? How
> >> > > > > can i acquire a higher transfer speed?
> >> > > > >
> >> > > > > With kind regards,
> >> > > > >
> >> > > > > Dominic Stuart
> >> > > > ------- End of Original Message -------
> >> > > >
> >> > >
> >> ------- End of Original Message -------
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Adolf Klemenz - Jul 13 12:34:51 2009

Dear Dominic,

   C6x CPU reads from the EMIF are always significantly slower than 
expected. This is caused by pipeline and synchronization penalties: the 
data has to cross different clock domains. Also it takes 4 "delay
slots" 
from a read instruction until the data is available in the register file 
and ready to be stored

You can speed up performance by interleaving multiple read instructions, 
but this requires low-level assembler programming.

I recommend to use EDMA or QDMA to read your Fifo. With DMA you will get 
the expected performance (40ns read cycle time). Make sure the DMA 
destination is in internal L2RAM - if in external memory (SDRAM for 
example), the EMIF must switch from asynchronous to synchronous mode with 
every transfer, which will dramatically slow down the transfer.

   Best Regards,
   Adolf Klemenz, D.SignT

At 14:31 13.07.2009 +0000, d.stuartnl wrote:
>Thanks for your reply R.Williams,
>
>I am measuring the instruction with a hardware timer (which creates 0.1665 
>us delay). This means that the read instruction still takes (0.333 - 
>0.1665) 0.1665 us. This seems very slow (+/-6MHz) for a 300 MHz CPU 
>connected to a 100MHz bus. Is there any way to speed this up?

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 13 13:16:23 2009

Hi Jeff,

thanks for your response. I think you're right in stating that when measuring
small times in a crude way there is a huge error margin. But it's not about the
measurements. I first found the problem when I collected 1000 bytes when I
divide those measured results with the bytes read I concluded that the reads are
very slow. When I measured the single reads with the same (crude) technique
(hardware timer). I substracted the delay the timer imposed and found out that
the reading is fairly correct, somehow a read takes 0.16 us is this normal
behaviour or am I overlooking something somehow?

I am using a D.Module.C6713 in combination with a TI FIFO (SN74V215). There is
some glue logic involved (programmed in the onboard CPLD).

I am suspecting it is a software/configuration problem because I am getting
valid data only it's slow. I would think it was hardware related if I had no
data or the data was corrupt but since the data is fine I am at a loss why the
transfer speed is so slow.

With kind regards,

Dominic

--- In c...@yahoogroups.com, "Jeff Brower" <jbrower@...> wrote:
>
> Dominic-
> 
> > Thanks for your reply R.Williams,
> >
> > I am measuring the instruction with a hardware timer (which creates
> > 0.1665 us delay). This means that the read
> > instruction still takes (0.333 - 0.1665) 0.1665 us. This seems very
> > slow (+/-6MHz) for a 300 MHz CPU connected to a
> > 100MHz bus. Is there any way to speed this up?
> 
> Any "internal" technique you use to measure the duration of
something in the nsec range is not going to be accurate. 
> As you have found, reading hardware timer registers has some inherent
delay, and as Richard mentions, a JTAG and/or
> RTDX based method would take so much time the actual memory cycle would end
up a tiny fraction.
> 
> The only way to accurately measure a single memory cycle time is externally
(dig scope or LA).  My suggestion to get a
> worst-case figure would be to make three accesses: one to your mem, one to
another mem (to force a change in CEn
> lines), and a third one to your mem.  And watch this on the scope.  Then
you would know both the cycle duration and
> the amount of time the compiler is adding for your line of C code.
> 
> -Jeff
> 
> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> >>
> >> d.stuartnl,
> >>
> >> I see 10 instructions in the code:
> >>
> >> MV.L2X A5,B4          <<-- copy source address to working
register
> >> LDW.D2T2 *+B4[0],B4 <<-- read value at source
> >> MVK.S1 0xffffd100,A6  <<-- set up low portion of destination
address in A6
> >> MVKH.S1 0x10000,A6    <<-- set up high portion of
destination address in A6
> >> NOP 2                 <<-- wait two NOP instruction times
> >> STW.D1T2 B4,*+A6[0] <<-- write value to destination address
> >> NOP                   <<-- wait 1 NOP instruction time
> >>
> >> So the "tmpRead1 = *read1;" takes ~10 instructions.
> >> ( I do not have the H/W details at hand, so cannot supply the
specifics on cycle
> >> times for each instruction.  I also do not know the specific
processor well
> >> enough to predict the amount of pipeline stalls, etc )
> >>
> >> In general, the operation tmpRead1 = *read1; will take much longer
than the time
> >> needed to read the value from the source address.
> >>
> >> Finally, the time measuring tool (?JTAG?) probably imposes some
delays for
> >> setup/communication/etc.
> >>
> >> R. Williams
> >>
> >>
> >>
> >> ---------- Original Message -----------
> >> From: "d.stuartnl" <d.stuartnl@>
> >> To: c...@yahoogroups.com
> >> Sent: Mon, 13 Jul 2009 13:14:04 -0000
> >> Subject: [c6x] Re: Slow EMIF transfer
> >>
> >> > Hi all,
> >> >
> >> > I've examined my code and hardware, the FIFO's I'm accessing
are
> >> > configured as Asynchronous.
> >> >
> >> > The address space is configured as:
> >> > *0x1800004 = 0x10914221;    /* CE1 = async 32 */
> >> >
> >> > If I understand it correctly this should be 1 setup, 2 strobe
and 1
> >> > hold cycle. The EMIF is 100 MHz so one read should take
0,04us.
> >> >
> >> > When I measure the performance 1 read takes 0.333 us. Where
does this
> >> > delay come from?
> >> >
> >> > The code is as follows:
> >> >
> >> > read1 = (int*) 0x90300004;
> >> > tmpRead1 = *read1; // this line takes 0.333 us.
> >> >
> >> > tmpRead1 is defined as a volatile int and resides in IRAM
(address:
> >> > 0x0001d100 according to the .map file.
> >> >
> >> > I saw that R. Williams suggested that there may be 10 wait
states,
> >> > what are these and how can i verify?
> >> >
> >> > If I look at the assembly code for the tmpRead1 = *read1;
line it states:
> >> >
> >> > MV.L2X   A5,B4
> >> > LDW.D2T2 *+B4[0],B4
> >> > MVK.S1   0xffffd100,A6
> >> > MVKH.S1  0x10000,A6
> >> > NOP      2
> >> > STW.D1T2 B4,*+A6[0]
> >> > NOP
> >> >
> >> > I hope anyone can shed some light on this or point me in the
right
> >> > direction to debug this problem.
> >> >
> >> > With kind regards,
> >> >
> >> > Dominic Stuart
> >> >
> >> > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
> >> > >
> >> > > Thanks for all of your responses, I've checked the .gel
file and found that
> >> the EMIF_CE registers for the FIFO's I'm reading from are
configured as 32-bit
> >> asynchronous. The FIFO's are capable of Synchronous datatransfer
so I will check
> >> the SPRU document and program the registers correctfully. I will
post a final
> >> message if that fixes my problem.
> >> > >
> >> > > --- In c...@yahoogroups.com, "Richard
Williams" <rkwill@> wrote:
> >> > > >
> >> > > > D.S,
> >> > > >
> >> > > > it looks, on first examination, like the memory at
0x90300000 has about 10
> >> wait
> >> > > > states.
> >> > > >
> >> > > > Have you examined the actual source code?
> >> > > > I would expect a max of 4 instructions to perform
the 'tmpRead1 = *read2'
> >> > > > fetch read1 (source address)
> >> > > > fetch @ read1 (contents)
> >> > > > fetch tmpRead1 (destination address)
> >> > > > store @ tmpRead1 (contents)
> >> > > >
> >> > > > R. Williams
> >> > > >
> >> > > >
> >> > > > ---------- Original Message -----------
> >> > > > From: d.stuartnl@
> >> > > > To: c...@yahoogroups.com
> >> > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
> >> > > > Subject: [c6x] Slow EMIF transfer
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I am a fairly new embedded programmer and this
is my first post on
> >> > > > > this forum. I am working with a 6713 DSP and
in my current project I
> >> > > > > am reading data from some FIFO's connected to
the EMIF bus. My problem
> >> > > > > is the performance of the EMIF I have measured
the time it takes to
> >> > > > > read from the EMIF and i have confirmed these
findings with the  simulator.
> >> > > > >
> >> > > > > I'm excecuting the folowing code:
> >> > > > >
> >> > > > >    x++;                       // 10 clocks  -
0.033 us
> >> > > > >    read1 = (int*) 0x90300004; // 3 clocks   -
0.010 us
> >> > > > >    read2 = (int*) 0x90300008; // 3 clocks   -
0.010 us
> >> > > > >    tmpRead1 = *read1;	      // 177 clocks -
0.590 us
> >> > > > >    tmpRead2 = *read2;         // 176 clocks -
0.586 us
> >> > > > >
> >> > > > > I've commented the measured clocktimes
according to the simulator. 177
> >> > > > > clocks for 1 read seems a bit much. Am I
overlooking something? How
> >> > > > > can i acquire a higher transfer speed?
> >> > > > >
> >> > > > > With kind regards,
> >> > > > >
> >> > > > > Dominic Stuart
> >> > > > ------- End of Original Message -------
> >> > > >
> >> > >
> >> ------- End of Original Message -------
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 13 13:16:39 2009

Dear Adolf,

as I understand DMA, I would need to work in "blocks" of data but that
would be very tricky in my application since I do not know how big the
datastream is gonna be. Or is it possible to use DMA for single byte transfers?

With kind regards,

Dominic

--- In c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@...> wrote:
>
> Dear Dominic,
> 
>    C6x CPU reads from the EMIF are always significantly slower than 
> expected. This is caused by pipeline and synchronization penalties: the 
> data has to cross different clock domains. Also it takes 4 "delay
slots" 
> from a read instruction until the data is available in the register file 
> and ready to be stored
> 
> You can speed up performance by interleaving multiple read instructions, 
> but this requires low-level assembler programming.
> 
> I recommend to use EDMA or QDMA to read your Fifo. With DMA you will get 
> the expected performance (40ns read cycle time). Make sure the DMA 
> destination is in internal L2RAM - if in external memory (SDRAM for 
> example), the EMIF must switch from asynchronous to synchronous mode with 
> every transfer, which will dramatically slow down the transfer.
> 
>    Best Regards,
>    Adolf Klemenz, D.SignT
> At 14:31 13.07.2009 +0000, d.stuartnl wrote:
> >Thanks for your reply R.Williams,
> >
> >I am measuring the instruction with a hardware timer (which creates
0.1665 
> >us delay). This means that the read instruction still takes (0.333 - 
> >0.1665) 0.1665 us. This seems very slow (+/-6MHz) for a 300 MHz CPU 
> >connected to a 100MHz bus. Is there any way to speed this up?
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Jeff Brower - Jul 13 13:35:33 2009

Dominic-

> thanks for your response. I think you're right in stating that when
> measuring small times in a crude way there is a
> huge error margin. But it's not about the measurements. I first found
> the problem when I collected 1000 bytes when I
> divide those measured results with the bytes read I concluded that
> the reads are very slow. When I measured the single
> reads with the same (crude) technique (hardware timer). I substracted
> the delay the timer imposed and found out that
> the reading is fairly correct, somehow a read takes 0.16 us is this
> normal behaviour or am I overlooking something somehow?
>
> I am using a D.Module.C6713 in combination with a TI FIFO (SN74V215).
> There is some glue logic involved (programmed in the onboard CPLD).
>
> I am suspecting it is a software/configuration problem because I am
> getting valid data only it's slow. I would think
> it was hardware related if I had no data or the data was corrupt but
> since the data is fine I am at a loss why the transfer speed is so slow.

I answered based on single-cycle access time since that what's you asked about. 
If now it turns out you're actually
concerned about block transfer rate (in your comments above, 1000 bytes), then I
suggest to follow Adolf's advice
regarding DMA.

-Jeff

> --- In c...@yahoogroups.com, "Jeff Brower" <jbrower@...>
wrote:
>>
>> Dominic-
>>
>> > Thanks for your reply R.Williams,
>> >
>> > I am measuring the instruction with a hardware timer (which
creates
>> > 0.1665 us delay). This means that the read
>> > instruction still takes (0.333 - 0.1665) 0.1665 us. This seems
very
>> > slow (+/-6MHz) for a 300 MHz CPU connected to a
>> > 100MHz bus. Is there any way to speed this up?
>>
>> Any "internal" technique you use to measure the duration of
something in the nsec range is not going to be accurate.
>> As you have found, reading hardware timer registers has some inherent
delay, and as Richard mentions, a JTAG and/or
>> RTDX based method would take so much time the actual memory cycle would
end up a tiny fraction.
>>
>> The only way to accurately measure a single memory cycle time is
externally (dig scope or LA).  My suggestion to get
>> a
>> worst-case figure would be to make three accesses: one to your mem, one
to another mem (to force a change in CEn
>> lines), and a third one to your mem.  And watch this on the scope. 
Then you would know both the cycle duration and
>> the amount of time the compiler is adding for your line of C code.
>>
>> -Jeff
>>
>> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
>> >>
>> >> d.stuartnl,
>> >>
>> >> I see 10 instructions in the code:
>> >>
>> >> MV.L2X A5,B4          <<-- copy source address to
working register
>> >> LDW.D2T2 *+B4[0],B4 <<-- read value at source
>> >> MVK.S1 0xffffd100,A6  <<-- set up low portion of
destination address in A6
>> >> MVKH.S1 0x10000,A6    <<-- set up high portion of
destination address in A6
>> >> NOP 2                 <<-- wait two NOP instruction
times
>> >> STW.D1T2 B4,*+A6[0] <<-- write value to destination
address
>> >> NOP                   <<-- wait 1 NOP instruction time
>> >>
>> >> So the "tmpRead1 = *read1;" takes ~10 instructions.
>> >> ( I do not have the H/W details at hand, so cannot supply the
specifics on cycle
>> >> times for each instruction.  I also do not know the specific
processor well
>> >> enough to predict the amount of pipeline stalls, etc )
>> >>
>> >> In general, the operation tmpRead1 = *read1; will take much
longer than the time
>> >> needed to read the value from the source address.
>> >>
>> >> Finally, the time measuring tool (?JTAG?) probably imposes
some delays for
>> >> setup/communication/etc.
>> >>
>> >> R. Williams
>> >>
>> >>
>> >>
>> >> ---------- Original Message -----------
>> >> From: "d.stuartnl" <d.stuartnl@>
>> >> To: c...@yahoogroups.com
>> >> Sent: Mon, 13 Jul 2009 13:14:04 -0000
>> >> Subject: [c6x] Re: Slow EMIF transfer
>> >>
>> >> > Hi all,
>> >> >
>> >> > I've examined my code and hardware, the FIFO's I'm
accessing are
>> >> > configured as Asynchronous.
>> >> >
>> >> > The address space is configured as:
>> >> > *0x1800004 = 0x10914221;    /* CE1 = async 32 */
>> >> >
>> >> > If I understand it correctly this should be 1 setup, 2
strobe and 1
>> >> > hold cycle. The EMIF is 100 MHz so one read should take
0,04us.
>> >> >
>> >> > When I measure the performance 1 read takes 0.333 us.
Where does this
>> >> > delay come from?
>> >> >
>> >> > The code is as follows:
>> >> >
>> >> > read1 = (int*) 0x90300004;
>> >> > tmpRead1 = *read1; // this line takes 0.333 us.
>> >> >
>> >> > tmpRead1 is defined as a volatile int and resides in IRAM
(address:
>> >> > 0x0001d100 according to the .map file.
>> >> >
>> >> > I saw that R. Williams suggested that there may be 10
wait states,
>> >> > what are these and how can i verify?
>> >> >
>> >> > If I look at the assembly code for the tmpRead1 = *read1;
line it states:
>> >> >
>> >> > MV.L2X   A5,B4
>> >> > LDW.D2T2 *+B4[0],B4
>> >> > MVK.S1   0xffffd100,A6
>> >> > MVKH.S1  0x10000,A6
>> >> > NOP      2
>> >> > STW.D1T2 B4,*+A6[0]
>> >> > NOP
>> >> >
>> >> > I hope anyone can shed some light on this or point me in
the right
>> >> > direction to debug this problem.
>> >> >
>> >> > With kind regards,
>> >> >
>> >> > Dominic Stuart
>> >> >
>> >> > --- In c...@yahoogroups.com, "d.stuartnl"
<d.stuartnl@> wrote:
>> >> > >
>> >> > > Thanks for all of your responses, I've checked the
.gel file and found that
>> >> the EMIF_CE registers for the FIFO's I'm reading from are
configured as 32-bit
>> >> asynchronous. The FIFO's are capable of Synchronous
datatransfer so I will check
>> >> the SPRU document and program the registers correctfully. I
will post a final
>> >> message if that fixes my problem.
>> >> > >
>> >> > > --- In c...@yahoogroups.com, "Richard
Williams" <rkwill@> wrote:
>> >> > > >
>> >> > > > D.S,
>> >> > > >
>> >> > > > it looks, on first examination, like the memory
at 0x90300000 has about 10
>> >> wait
>> >> > > > states.
>> >> > > >
>> >> > > > Have you examined the actual source code?
>> >> > > > I would expect a max of 4 instructions to
perform the 'tmpRead1 = *read2'
>> >> > > > fetch read1 (source address)
>> >> > > > fetch @ read1 (contents)
>> >> > > > fetch tmpRead1 (destination address)
>> >> > > > store @ tmpRead1 (contents)
>> >> > > >
>> >> > > > R. Williams
>> >> > > >
>> >> > > >
>> >> > > > ---------- Original Message -----------
>> >> > > > From: d.stuartnl@
>> >> > > > To: c...@yahoogroups.com
>> >> > > > Sent: Tue, 23 Jun 2009 09:17:13 -0400
>> >> > > > Subject: [c6x] Slow EMIF transfer
>> >> > > >
>> >> > > > > Hi all,
>> >> > > > >
>> >> > > > > I am a fairly new embedded programmer and
this is my first post on
>> >> > > > > this forum. I am working with a 6713 DSP
and in my current project I
>> >> > > > > am reading data from some FIFO's connected
to the EMIF bus. My problem
>> >> > > > > is the performance of the EMIF I have
measured the time it takes to
>> >> > > > > read from the EMIF and i have confirmed
these findings with the  simulator.
>> >> > > > >
>> >> > > > > I'm excecuting the folowing code:
>> >> > > > >
>> >> > > > >    x++;                       // 10 clocks
 - 0.033 us
>> >> > > > >    read1 = (int*) 0x90300004; // 3 clocks 
 - 0.010 us
>> >> > > > >    read2 = (int*) 0x90300008; // 3 clocks 
 - 0.010 us
>> >> > > > >    tmpRead1 = *read1;	      // 177 clocks
- 0.590 us
>> >> > > > >    tmpRead2 = *read2;         // 176
clocks - 0.586 us
>> >> > > > >
>> >> > > > > I've commented the measured clocktimes
according to the simulator. 177
>> >> > > > > clocks for 1 read seems a bit much. Am I
overlooking something? How
>> >> > > > > can i acquire a higher transfer speed?
>> >> > > > >
>> >> > > > > With kind regards,
>> >> > > > >
>> >> > > > > Dominic Stuart
>> >> > > > ------- End of Original Message -------
>> >> > > >
>> >> > >
>> >> ------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer [1 Attachment] - Adolf Klemenz - Jul 14 9:41:57 2009


______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 14 11:49:39 2009

Thanks for the information, I think I will refrain from using
blocktransfers because I want to process the data as the DSP receives it. My
function looks like this:

void Calculator_AddSample()
{
   x++;

   read1 = (int*) 0x90300004;
   read2 = (int*) 0x90300008;

   tmpRead1 = *read1;
   tmpRead2 = *read2;
		
   // CHANNEL 1
   CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
   // CHANNEL 2
   CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
   // FWS R+L Add
   if(LRneeded == 1)
   {
      CH1.deloggedData[x] += 	CH2.deloggedData[x];
      if(CH1.deloggedData[x] > 5000)
      {
         CH1.deloggedData[x] = 5000;
      }
   }
   // CHANNEL 3 this channel is always read for particle matching on this
channel
   binData[x] = (tmpRead2 & 0xFF);
   CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];

   // CHANNEL 4
   CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
   // CHANNEL 5
   CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
   // CHANNEL 6
   CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
}
This function executes 2 reads from 2 different FIFO's and then seperates the
different datachannels and decompresses the value's with a LookUp Table.

I am trying to streamline this function so it can keep up with the incoming
data. The data is written to the FIFO's with 4MHz. The data consists of small
burst packets ranging from 3 to 4096 bytes per channel.

At the moment I am starting this "prefetch" function when a burst
starts and execute this function every time there is data available in the
FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
burst ends. All variables are in IRAM.

I think I made an error in suspecting the EMIF transfer speed and I now suspect
that there may be some overhead in the polling scheme I use for calling this
function that results in the slow transfer speed. I will look into this. I would
like to thank everyone for there input.

With kind regards,

Dominic
--- In c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@...> wrote:
>
> Dear Dominic,
> 
> At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> >as I understand DMA, I would need to work in "blocks" of data
but that 
> >would be very tricky in my application since I do not know how big the

> >datastream is gonna be. Or is it possible to use DMA for single byte
transfers?
> 
> using DMA makes sense for block transfers only. Typical Fifo applications 
> will use the Fifo's half-full flag (or a similar signal) to trigger a DMA 
> block read.
> You may use element-synchronized DMA (each trigger transfers only one data

> word), but there will be no speed improvement: It takes about 100ns from 
> the EDMA sync event to the actual data transfer on a C6713.
> 
> Attached is a scope screenshot generated by this test program
> 
> // compiled with -o2 and without debug info:
> 
> volatile int buffer; // must be volatile to prevent
>                       // optimizer from code removal
> for (;;)
> {
>      buffer = *(volatile int*)0x90300000;
> }
> 
> The screenshot shows chip select and read signal with the expected timings

> (20ns strobe width). The gap between sucessive reads is caused by the DSP 
> architecture. Here it is 200ns because a 225MHz DSP was used, which should

> translate to 150ns on a 300MHz device.
> 
> If this isn't fast enough, you must use block transfers.
> 
>    Best Regards,
>    Adolf Klemenz, D.SignT
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Jeff Brower - Jul 14 12:06:26 2009

Dominic-

> Thanks for the information, I think I will refrain from using block
> transfers because I want to process the data as the DSP receives it.
.
.
.

> At the moment I am starting this "prefetch" function when a
burst
> starts and execute this function every time there is data available
> in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> the data before the burst ends. All variables are in IRAM.

The typical reason for doing it that way is to avoid delay (latency) in your
signal
processing flow, relative to some output (DAC, GPIO line, digital transmission,
etc).  Is that the case?  If not then a block based method would be better,
otherwise
you will waste a lot of time polling for each element.  You don't have to
implement
DMA as a first step to get that working, you could use a code loop.  Then
implement
DMA in order to further improve performance.

-Jeff

> My function looks like this:
> 
> void Calculator_AddSample()
> {
>    x++;
> 
>    read1 = (int*) 0x90300004;
>    read2 = (int*) 0x90300008;
> 
>    tmpRead1 = *read1;
>    tmpRead2 = *read2;
> 
>    // CHANNEL 1
>    CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
>    // CHANNEL 2
>    CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >>
24)];
>    // FWS R+L Add
>    if(LRneeded == 1)
>    {
>       CH1.deloggedData[x] +=    CH2.deloggedData[x];
>       if(CH1.deloggedData[x] > 5000)
>       {
>          CH1.deloggedData[x] = 5000;
>       }
>    }
>    // CHANNEL 3 this channel is always read for particle matching on this
channel
>    binData[x] = (tmpRead2 & 0xFF);
>    CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> 
>    // CHANNEL 4
>    CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >> 8)];
>    // CHANNEL 5
>    CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >> 8)];
>    // CHANNEL 6
>    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> }
> This function executes 2 reads from 2 different FIFO's and then seperates
the different datachannels and decompresses the value's with a LookUp Table.
> 
> I am trying to streamline this function so it can keep up with the incoming
data. The data is written to the FIFO's with 4MHz. The data consists of small
burst packets ranging from 3 to 4096 bytes per channel.
> 
> At the moment I am starting this "prefetch" function when a burst
starts and execute this function every time there is data available in the
FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
burst ends. All variables are in IRAM.
> 
> I think I made an error in suspecting the EMIF transfer speed and I now
suspect that there may be some overhead in the polling scheme I use for calling
this function that results in the slow transfer speed. I will look into this. I
would like to thank everyone for there input.
> 
> With kind regards,
> 
> Dominic
> 
> --- In c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@...>
wrote:
> >
> > Dear Dominic,
> >
> > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > >as I understand DMA, I would need to work in "blocks" of
data but that
> > >would be very tricky in my application since I do not know how big
the
> > >datastream is gonna be. Or is it possible to use DMA for single
byte transfers?
> >
> > using DMA makes sense for block transfers only. Typical Fifo
applications
> > will use the Fifo's half-full flag (or a similar signal) to trigger a
DMA
> > block read.
> > You may use element-synchronized DMA (each trigger transfers only one
data
> > word), but there will be no speed improvement: It takes about 100ns
from
> > the EDMA sync event to the actual data transfer on a C6713.
> >
> > Attached is a scope screenshot generated by this test program
> >
> > // compiled with -o2 and without debug info:
> >
> > volatile int buffer; // must be volatile to prevent
> >                       // optimizer from code removal
> > for (;;)
> > {
> >      buffer = *(volatile int*)0x90300000;
> > }
> >
> > The screenshot shows chip select and read signal with the expected
timings
> > (20ns strobe width). The gap between sucessive reads is caused by the
DSP
> > architecture. Here it is 200ns because a 225MHz DSP was used, which
should
> > translate to 150ns on a 300MHz device.
> >
> > If this isn't fast enough, you must use block transfers.
> >
> >    Best Regards,
> >    Adolf Klemenz, D.SignT

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - Jeff Brower - Jul 15 12:11:09 2009

Dominic-

> I am indeed trying to avoid delay in processing flow. The data needs to be
> decompressed asap. When that is done the DSP performs calculations on the
> data and based on the outcome of those calculations the DSP generates a
> trigger (GPIO). Your idea of a code loop got me thinking... If a read
> always takes longer than a write, I don't have to pull the Empty Flag and
> can just read the data through a loop like so:
> 
> while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> {
>    Calculator_AddSample();
> }

Ok, so what you're saying is that once you see a "not empty" flag,
then you know the
agent on the other side of the FIFO is writing a known block size, and will
write it
faster than you can read, so your code just needs to read.

> I've tested this and it did improve the performance but nothing shocking,
> it seems the decompressing via the LookUp Table is creating the bottle
> neck. I've already split the two dimensional LUT into 2 one dimensional
> array's. This also helped a bit.

One thing you might try is hand-optimized asm code just for the read / look-up
sequence, using techniques that Richard was describing.  If you take advantage
of the
pipeline, you can improve performance.  For example you can read sample N, then
in
the next 4 instructions process the lookup on N-1, waiting for N to become
valid.  It
sounds to me like it wouldn't be that much code in your loop, maybe a dozen or
less
asm instructions.

-Jeff

PS. Please post to the group, not to me.  Thanks.

> --- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
> >
> > Dominic-
> >
> > > Thanks for the information, I think I will refrain from using
block
> > > transfers because I want to process the data as the DSP receives
it.
> > .
> > .
> > .
> >
> > > At the moment I am starting this "prefetch" function
when a burst
> > > starts and execute this function every time there is data
available
> > > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6%
of
> > > the data before the burst ends. All variables are in IRAM.
> >
> > The typical reason for doing it that way is to avoid delay (latency)
in your signal
> > processing flow, relative to some output (DAC, GPIO line, digital
transmission,
> > etc).  Is that the case?  If not then a block based method would be
better, otherwise
> > you will waste a lot of time polling for each element.  You don't have
to implement
> > DMA as a first step to get that working, you could use a code loop. 
Then implement
> > DMA in order to further improve performance.
> >
> > -Jeff
> >
> > > My function looks like this:
> > >
> > > void Calculator_AddSample()
> > > {
> > >    x++;
> > >
> > >    read1 = (int*) 0x90300004;
> > >    read2 = (int*) 0x90300008;
> > >
> > >    tmpRead1 = *read1;
> > >    tmpRead2 = *read2;
> > >
> > >    // CHANNEL 1
> > >    CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000)
>> 16)];
> > >    // CHANNEL 2
> > >    CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000)
>> 24)];
> > >    // FWS R+L Add
> > >    if(LRneeded == 1)
> > >    {
> > >       CH1.deloggedData[x] +=    CH2.deloggedData[x];
> > >       if(CH1.deloggedData[x] > 5000)
> > >       {
> > >          CH1.deloggedData[x] = 5000;
> > >       }
> > >    }
> > >    // CHANNEL 3 this channel is always read for particle matching
on this channel
> > >    binData[x] = (tmpRead2 & 0xFF);
> > >    CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > >
> > >    // CHANNEL 4
> > >    CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >>
8)];
> > >    // CHANNEL 5
> > >    CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >>
8)];
> > >    // CHANNEL 6
> > >    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > }
> > > This function executes 2 reads from 2 different FIFO's and then
seperates the different datachannels and decompresses the value's with a LookUp
Table.
> > >
> > > I am trying to streamline this function so it can keep up with
the incoming data. The data is written to the FIFO's with 4MHz. The data
consists of small burst packets ranging from 3 to 4096 bytes per channel.
> > >
> > > At the moment I am starting this "prefetch" function
when a burst starts and execute this function every time there is data available
in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data
before the burst ends. All variables are in IRAM.
> > >
> > > I think I made an error in suspecting the EMIF transfer speed and
I now suspect that there may be some overhead in the polling scheme I use for
calling this function that results in the slow transfer speed. I will look into
this. I would like to thank everyone for there input.
> > >
> > > With kind regards,
> > >
> > > Dominic
> > >
> > > --- In c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@>
wrote:
> > > >
> > > > Dear Dominic,
> > > >
> > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > >as I understand DMA, I would need to work in
"blocks" of data but that
> > > > >would be very tricky in my application since I do not
know how big the
> > > > >datastream is gonna be. Or is it possible to use DMA for
single byte transfers?
> > > >
> > > > using DMA makes sense for block transfers only. Typical Fifo
applications
> > > > will use the Fifo's half-full flag (or a similar signal) to
trigger a DMA
> > > > block read.
> > > > You may use element-synchronized DMA (each trigger transfers
only one data
> > > > word), but there will be no speed improvement: It takes
about 100ns from
> > > > the EDMA sync event to the actual data transfer on a C6713.
> > > >
> > > > Attached is a scope screenshot generated by this test
program
> > > >
> > > > // compiled with -o2 and without debug info:
> > > >
> > > > volatile int buffer; // must be volatile to prevent
> > > >                       // optimizer from code removal
> > > > for (;;)
> > > > {
> > > >      buffer = *(volatile int*)0x90300000;
> > > > }
> > > >
> > > > The screenshot shows chip select and read signal with the
expected timings
> > > > (20ns strobe width). The gap between sucessive reads is
caused by the DSP
> > > > architecture. Here it is 200ns because a 225MHz DSP was
used, which should
> > > > translate to 150ns on a 300MHz device.
> > > >
> > > > If this isn't fast enough, you must use block transfers.
> > > >
> > > >    Best Regards,
> > > >    Adolf Klemenz, D.SignT
> >

_____________________________________

______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 15 18:21:40 2009

Hi Jeff,

I am indeed trying to create as little delay as possible. When the DSP has all
the data (and decompressed it), it needs to perform various calculations on the
data and based on certain outcomes of those calculations it needs to generate a
trigger (GPIO). So the faster I have my data, the faster I can start processing
it but I think if I work in blocks then I actualy loose time because at the
moment I can start the decompressing from the first byte. Where when I would
implement blocks i can start decompressing after a block?

Your code loop suggestion got me thinking, If a read always takes more time then
a write I shouldnt have to poll the Empty flag. I think I should be able to get
the data like this:

Calculator_AddSample(); 
while(tmpRead1 != 0x84825131 & (x <= 0x1000))
{
  Calculator_AddSample();
}

This should give me a bit more speed since checking the Empty Flag is also a
read to the EMIF bus.

I am also thinking to put some of the calculations in the fetching process
because the DSP already has the value's in it's registers when it's storing the
data. I asume this would improve speed aswell considering the current code
fetches the stored data and does the calculations afterwards, it may even be
possible to perform the calculations in realtime so I don't have to store the
data at all and only store the outcomes of the calculations. When I would
implement this I know for certain that the fetching and calculating of 1 read
will take longer then the data being written in the FIFO.

With kind regards,

Dominic Stuart

--- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
>
> Dominic-
> 
> > Thanks for the information, I think I will refrain from using block
> > transfers because I want to process the data as the DSP receives it.
> .
> .
> .
> 
> > At the moment I am starting this "prefetch" function when a
burst
> > starts and execute this function every time there is data available
> > in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of
> > the data before the burst ends. All variables are in IRAM.
> 
> The typical reason for doing it that way is to avoid delay (latency) in
your signal
> processing flow, relative to some output (DAC, GPIO line, digital
transmission,
> etc).  Is that the case?  If not then a block based method would be better,
otherwise
> you will waste a lot of time polling for each element.  You don't have to
implement
> DMA as a first step to get that working, you could use a code loop.  Then
implement
> DMA in order to further improve performance.
> 
> -Jeff
> 
> > My function looks like this:
> > 
> > void Calculator_AddSample()
> > {
> >    x++;
> > 
> >    read1 = (int*) 0x90300004;
> >    read2 = (int*) 0x90300008;
> > 
> >    tmpRead1 = *read1;
> >    tmpRead2 = *read2;
> > 
> >    // CHANNEL 1
> >    CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000) >>
16)];
> >    // CHANNEL 2
> >    CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000) >>
24)];
> >    // FWS R+L Add
> >    if(LRneeded == 1)
> >    {
> >       CH1.deloggedData[x] +=    CH2.deloggedData[x];
> >       if(CH1.deloggedData[x] > 5000)
> >       {
> >          CH1.deloggedData[x] = 5000;
> >       }
> >    }
> >    // CHANNEL 3 this channel is always read for particle matching on
this channel
> >    binData[x] = (tmpRead2 & 0xFF);
> >    CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > 
> >    // CHANNEL 4
> >    CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00) >>
8)];
> >    // CHANNEL 5
> >    CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00) >>
8)];
> >    // CHANNEL 6
> >    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > }
> > This function executes 2 reads from 2 different FIFO's and then
seperates the different datachannels and decompresses the value's with a LookUp
Table.
> > 
> > I am trying to streamline this function so it can keep up with the
incoming data. The data is written to the FIFO's with 4MHz. The data consists of
small burst packets ranging from 3 to 4096 bytes per channel.
> > 
> > At the moment I am starting this "prefetch" function when a
burst starts and execute this function every time there is data available in the
FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
burst ends. All variables are in IRAM.
> > 
> > I think I made an error in suspecting the EMIF transfer speed and I
now suspect that there may be some overhead in the polling scheme I use for
calling this function that results in the slow transfer speed. I will look into
this. I would like to thank everyone for there input.
> > 
> > With kind regards,
> > 
> > Dominic
> > 
> > --- In c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@>
wrote:
> > >
> > > Dear Dominic,
> > >
> > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > >as I understand DMA, I would need to work in
"blocks" of data but that
> > > >would be very tricky in my application since I do not know
how big the
> > > >datastream is gonna be. Or is it possible to use DMA for
single byte transfers?
> > >
> > > using DMA makes sense for block transfers only. Typical Fifo
applications
> > > will use the Fifo's half-full flag (or a similar signal) to
trigger a DMA
> > > block read.
> > > You may use element-synchronized DMA (each trigger transfers only
one data
> > > word), but there will be no speed improvement: It takes about
100ns from
> > > the EDMA sync event to the actual data transfer on a C6713.
> > >
> > > Attached is a scope screenshot generated by this test program
> > >
> > > // compiled with -o2 and without debug info:
> > >
> > > volatile int buffer; // must be volatile to prevent
> > >                       // optimizer from code removal
> > > for (;;)
> > > {
> > >      buffer = *(volatile int*)0x90300000;
> > > }
> > >
> > > The screenshot shows chip select and read signal with the
expected timings
> > > (20ns strobe width). The gap between sucessive reads is caused by
the DSP
> > > architecture. Here it is 200ns because a 225MHz DSP was used,
which should
> > > translate to 150ns on a 300MHz device.
> > >
> > > If this isn't fast enough, you must use block transfers.
> > >
> > >    Best Regards,
> > >    Adolf Klemenz, D.SignT
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 15 18:21:47 2009

Dominic,

There are a couple of problems with the displayed code.
--'x' is not being incremented
--tmpRead1 is not being updated

However, your idea of just using the read operation, since it is much longer
than a write, is a good one.

R. Williams
 
---------- Original Message -----------
From: Jeff Brower <j...@signalogic.com>
To: Dominic Stuart <d...@yahoo.com>
Cc: c...@yahoogroups.com
Sent: Wed, 15 Jul 2009 11:07:55 -0500
Subject: [c6x] Re: Slow EMIF transfer

> Dominic-
> 
> > I am indeed trying to avoid delay in processing flow. The data needs
to be
> > decompressed asap. When that is done the DSP performs calculations on
the
> > data and based on the outcome of those calculations the DSP generates
a
> > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > always takes longer than a write, I don't have to pull the Empty Flag
and
> > can just read the data through a loop like so:
> > 
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> >    Calculator_AddSample();
> > }
> 
> Ok, so what you're saying is that once you see a "not empty"
flag, 
> then you know the agent on the other side of the FIFO is writing a 
> known block size, and will write it faster than you can read, so your 
> code just needs to read.
> 
> > I've tested this and it did improve the performance but nothing
shocking,
> > it seems the decompressing via the LookUp Table is creating the
bottle
> > neck. I've already split the two dimensional LUT into 2 one
dimensional
> > array's. This also helped a bit.
> 
> One thing you might try is hand-optimized asm code just for the read / 
> look-up sequence, using techniques that Richard was describing.  If 
> you take advantage of the pipeline, you can improve performance.  For 
> example you can read sample N, then in the next 4 instructions process 
> the lookup on N-1, waiting for N to become valid.  It sounds to me 
> like it wouldn't be that much code in your loop, maybe a dozen or less 
> asm instructions.
> 
> -Jeff
> 
> PS. Please post to the group, not to me.  Thanks.
> 
> > --- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
> > >
> > > Dominic-
> > >
> > > > Thanks for the information, I think I will refrain from
using block
> > > > transfers because I want to process the data as the DSP
receives it.
> > > .
> > > .
> > > .
> > >
> > > > At the moment I am starting this "prefetch"
function when a burst
> > > > starts and execute this function every time there is data
available
> > > > in the FIFO's (polling the Empty Flag). I'm prefeteching
27.6% of
> > > > the data before the burst ends. All variables are in IRAM.
> > >
> > > The typical reason for doing it that way is to avoid delay
(latency) in
your signal
> > > processing flow, relative to some output (DAC, GPIO line,
digital
transmission,
> > > etc).  Is that the case?  If not then a block based method would
be
better, otherwise
> > > you will waste a lot of time polling for each element.  You don't
have to
implement
> > > DMA as a first step to get that working, you could use a code
loop.  Then
implement
> > > DMA in order to further improve performance.
> > >
> > > -Jeff
> > >
> > > > My function looks like this:
> > > >
> > > > void Calculator_AddSample()
> > > > {
> > > >    x++;
> > > >
> > > >    read1 = (int*) 0x90300004;
> > > >    read2 = (int*) 0x90300008;
> > > >
> > > >    tmpRead1 = *read1;
> > > >    tmpRead2 = *read2;
> > > >
> > > >    // CHANNEL 1
> > > >    CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000)
>> 16)];
> > > >    // CHANNEL 2
> > > >    CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000)
>> 24)];
> > > >    // FWS R+L Add
> > > >    if(LRneeded == 1)
> > > >    {
> > > >       CH1.deloggedData[x] +=    CH2.deloggedData[x];
> > > >       if(CH1.deloggedData[x] > 5000)
> > > >       {
> > > >          CH1.deloggedData[x] = 5000;
> > > >       }
> > > >    }
> > > >    // CHANNEL 3 this channel is always read for particle
matching on
this channel
> > > >    binData[x] = (tmpRead2 & 0xFF);
> > > >    CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > >
> > > >    // CHANNEL 4
> > > >    CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00)
>> 8)];
> > > >    // CHANNEL 5
> > > >    CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00)
>> 8)];
> > > >    // CHANNEL 6
> > > >    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > }
> > > > This function executes 2 reads from 2 different FIFO's and
then
seperates the different datachannels and decompresses the value's with a LookUp
Table.
> > > >
> > > > I am trying to streamline this function so it can keep up
with the
incoming data. The data is written to the FIFO's with 4MHz. The data consists
of
small burst packets ranging from 3 to 4096 bytes per channel.
> > > >
> > > > At the moment I am starting this "prefetch"
function when a burst starts
and execute this function every time there is data available in the FIFO's
(polling the Empty Flag). I'm prefeteching 27.6% of the data before the burst
ends. All variables are in IRAM.
> > > >
> > > > I think I made an error in suspecting the EMIF transfer
speed and I now
suspect that there may be some overhead in the polling scheme I use for calling
this function that results in the slow transfer speed. I will look into this. I
would like to thank everyone for there input.
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic
> > > >
> > > > --- In c...@yahoogroups.com, Adolf Klemenz
<adolf.klemenz@> wrote:
> > > > >
> > > > > Dear Dominic,
> > > > >
> > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > >as I understand DMA, I would need to work in
"blocks" of data but that
> > > > > >would be very tricky in my application since I do
not know how big the
> > > > > >datastream is gonna be. Or is it possible to use
DMA for single byte
transfers?
> > > > >
> > > > > using DMA makes sense for block transfers only. Typical
Fifo applications
> > > > > will use the Fifo's half-full flag (or a similar
signal) to trigger a DMA
> > > > > block read.
> > > > > You may use element-synchronized DMA (each trigger
transfers only one data
> > > > > word), but there will be no speed improvement: It takes
about 100ns from
> > > > > the EDMA sync event to the actual data transfer on a
C6713.
> > > > >
> > > > > Attached is a scope screenshot generated by this test
program
> > > > >
> > > > > // compiled with -o2 and without debug info:
> > > > >
> > > > > volatile int buffer; // must be volatile to prevent
> > > > >                       // optimizer from code removal
> > > > > for (;;)
> > > > > {
> > > > >      buffer = *(volatile int*)0x90300000;
> > > > > }
> > > > >
> > > > > The screenshot shows chip select and read signal with
the expected timings
> > > > > (20ns strobe width). The gap between sucessive reads is
caused by the DSP
> > > > > architecture. Here it is 200ns because a 225MHz DSP was
used, which should
> > > > > translate to 150ns on a 300MHz device.
> > > > >
> > > > > If this isn't fast enough, you must use block
transfers.
> > > > >
> > > > >    Best Regards,
> > > > >    Adolf Klemenz, D.SignT
> > >
------- End of Original Message -------

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 15 18:22:16 2009

Jeff,

your partially correct, the agent on the other side is not writing a known
blocksize but it closes the "block" with a trailer value so I can
determine when the block is over by checking the last read value against the
known trailer value.

Writing in assembly is a step I hope to postpone, I haven't coded in assembly
for many many moons ;)

I think I'll invest my time in optimizing my C-code first. I am currently
reading the "Optimizing C Compiler Tutorial" from the TI website
there's a lot of info in there. Once I'm comfortable with my C-code I will check
if I can write certain algorithms in Assembly to further optimize my system.

I would like to thank everyone for reading/replying to this post!

--- In c...@yahoogroups.com, Jeff Brower <jbrower@...> wrote:
>
> Dominic-
> 
> > I am indeed trying to avoid delay in processing flow. The data needs
to be
> > decompressed asap. When that is done the DSP performs calculations on
the
> > data and based on the outcome of those calculations the DSP generates
a
> > trigger (GPIO). Your idea of a code loop got me thinking... If a read
> > always takes longer than a write, I don't have to pull the Empty Flag
and
> > can just read the data through a loop like so:
> > 
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> >    Calculator_AddSample();
> > }
> 
> Ok, so what you're saying is that once you see a "not empty"
flag, then you know the
> agent on the other side of the FIFO is writing a known block size, and will
write it
> faster than you can read, so your code just needs to read.
> 
> > I've tested this and it did improve the performance but nothing
shocking,
> > it seems the decompressing via the LookUp Table is creating the
bottle
> > neck. I've already split the two dimensional LUT into 2 one
dimensional
> > array's. This also helped a bit.
> 
> One thing you might try is hand-optimized asm code just for the read /
look-up
> sequence, using techniques that Richard was describing.  If you take
advantage of the
> pipeline, you can improve performance.  For example you can read sample N,
then in
> the next 4 instructions process the lookup on N-1, waiting for N to become
valid.  It
> sounds to me like it wouldn't be that much code in your loop, maybe a dozen
or less
> asm instructions.
> 
> -Jeff
> 
> PS. Please post to the group, not to me.  Thanks.
> 
> > --- In c...@yahoogroups.com, Jeff Brower <jbrower@> wrote:
> > >
> > > Dominic-
> > >
> > > > Thanks for the information, I think I will refrain from
using block
> > > > transfers because I want to process the data as the DSP
receives it.
> > > .
> > > .
> > > .
> > >
> > > > At the moment I am starting this "prefetch"
function when a burst
> > > > starts and execute this function every time there is data
available
> > > > in the FIFO's (polling the Empty Flag). I'm prefeteching
27.6% of
> > > > the data before the burst ends. All variables are in IRAM.
> > >
> > > The typical reason for doing it that way is to avoid delay
(latency) in your signal
> > > processing flow, relative to some output (DAC, GPIO line, digital
transmission,
> > > etc).  Is that the case?  If not then a block based method would
be better, otherwise
> > > you will waste a lot of time polling for each element.  You don't
have to implement
> > > DMA as a first step to get that working, you could use a code
loop.  Then implement
> > > DMA in order to further improve performance.
> > >
> > > -Jeff
> > >
> > > > My function looks like this:
> > > >
> > > > void Calculator_AddSample()
> > > > {
> > > >    x++;
> > > >
> > > >    read1 = (int*) 0x90300004;
> > > >    read2 = (int*) 0x90300008;
> > > >
> > > >    tmpRead1 = *read1;
> > > >    tmpRead2 = *read2;
> > > >
> > > >    // CHANNEL 1
> > > >    CH1.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF0000)
>> 16)];
> > > >    // CHANNEL 2
> > > >    CH2.deloggedData[x] = LUT[0][((tmpRead1 & 0xFF000000)
>> 24)];
> > > >    // FWS R+L Add
> > > >    if(LRneeded == 1)
> > > >    {
> > > >       CH1.deloggedData[x] +=    CH2.deloggedData[x];
> > > >       if(CH1.deloggedData[x] > 5000)
> > > >       {
> > > >          CH1.deloggedData[x] = 5000;
> > > >       }
> > > >    }
> > > >    // CHANNEL 3 this channel is always read for particle
matching on this channel
> > > >    binData[x] = (tmpRead2 & 0xFF);
> > > >    CH3.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF))];
> > > >
> > > >    // CHANNEL 4
> > > >    CH4.deloggedData[x] = LUT[0][((tmpRead2 & 0xFF00)
>> 8)];
> > > >    // CHANNEL 5
> > > >    CH5.deloggedData[x] = LUT[1][((tmpRead1 & 0xFF00)
>> 8)];
> > > >    // CHANNEL 6
> > > >    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > }
> > > > This function executes 2 reads from 2 different FIFO's and
then seperates the different datachannels and decompresses the value's with a
LookUp Table.
> > > >
> > > > I am trying to streamline this function so it can keep up
with the incoming data. The data is written to the FIFO's with 4MHz. The data
consists of small burst packets ranging from 3 to 4096 bytes per channel.
> > > >
> > > > At the moment I am starting this "prefetch"
function when a burst starts and execute this function every time there is data
available in the FIFO's (polling the Empty Flag). I'm prefeteching 27.6% of the
data before the burst ends. All variables are in IRAM.
> > > >
> > > > I think I made an error in suspecting the EMIF transfer
speed and I now suspect that there may be some overhead in the polling scheme I
use for calling this function that results in the slow transfer speed. I will
look into this. I would like to thank everyone for there input.
> > > >
> > > > With kind regards,
> > > >
> > > > Dominic
> > > >
> > > > --- In c...@yahoogroups.com, Adolf Klemenz
<adolf.klemenz@> wrote:
> > > > >
> > > > > Dear Dominic,
> > > > >
> > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > >as I understand DMA, I would need to work in
"blocks" of data but that
> > > > > >would be very tricky in my application since I do
not know how big the
> > > > > >datastream is gonna be. Or is it possible to use
DMA for single byte transfers?
> > > > >
> > > > > using DMA makes sense for block transfers only. Typical
Fifo applications
> > > > > will use the Fifo's half-full flag (or a similar
signal) to trigger a DMA
> > > > > block read.
> > > > > You may use element-synchronized DMA (each trigger
transfers only one data
> > > > > word), but there will be no speed improvement: It takes
about 100ns from
> > > > > the EDMA sync event to the actual data transfer on a
C6713.
> > > > >
> > > > > Attached is a scope screenshot generated by this test
program
> > > > >
> > > > > // compiled with -o2 and without debug info:
> > > > >
> > > > > volatile int buffer; // must be volatile to prevent
> > > > >                       // optimizer from code removal
> > > > > for (;;)
> > > > > {
> > > > >      buffer = *(volatile int*)0x90300000;
> > > > > }
> > > > >
> > > > > The screenshot shows chip select and read signal with
the expected timings
> > > > > (20ns strobe width). The gap between sucessive reads is
caused by the DSP
> > > > > architecture. Here it is 200ns because a 225MHz DSP was
used, which should
> > > > > translate to 150ns on a 300MHz device.
> > > > >
> > > > > If this isn't fast enough, you must use block
transfers.
> > > > >
> > > > >    Best Regards,
> > > > >    Adolf Klemenz, D.SignT
> >

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 17 9:03:22 2009

R. Williams,

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
> Dominic,
> 
> There are a couple of problems with the displayed code.
> --'x' is not being incremented
> --tmpRead1 is not being updated

x and tmpRead1 are updated in the AddSample() routine. Furthermore, I've been
analyzing the compilers feedback and it's stating that it cannot implement
software pipelining because there's a function call (AddSample()) in the loop.
I've removed the AddSample() function and put the code from the function
directly into the loop (see source), there's still some problems (Disqualified
loop: Loop carried dependency bound too large). But I'm working on it :) I've
also found out that pipelining is not being used in a lot of my loops so I'm
guessing if I adjust my C-code so that software pipelining will be possible I
will notice an increase in performance.

Source:

read1 = (int*) 0x90300004;
read2 = (int*) 0x90300008;

tmpRead1 = *read1;
tmpRead2 = *read2;
x = 0;
while(tmpRead1 != 0x84825131 & (x <= 0x1000))
{
   tmpRead1 = *read1;
   tmpRead2 = *read2;

   CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
   CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
   // FWS R+L Add
   if(LRneeded == 1)
   {
      CH1.deloggedData[x] += CH2.deloggedData[x];
      if(CH1.deloggedData[x] > 5000)
      {
         CH1.deloggedData[x] = 5000;
      }
   }
   CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
   binData[x] = (tmpRead2 & 0xFF);
   CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
   CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
   CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
   x++;
}

With kind regards,

Dominic

> 
> However, your idea of just using the read operation, since it is much
longer
> than a write, is a good one.
> 
> R. Williams
>  
> ---------- Original Message -----------
> From: Jeff Brower <jbrower@...>
> To: Dominic Stuart <d.stuartnl@...>
> Cc: c...@yahoogroups.com
> Sent: Wed, 15 Jul 2009 11:07:55 -0500
> Subject: [c6x] Re: Slow EMIF transfer
> 
> > Dominic-
> > 
> > > I am indeed trying to avoid delay in processing flow. The data
needs to be
> > > decompressed asap. When that is done the DSP performs
calculations on the
> > > data and based on the outcome of those calculations the DSP
generates a
> > > trigger (GPIO). Your idea of a code loop got me thinking... If a
read
> > > always takes longer than a write, I don't have to pull the Empty
Flag and
> > > can just read the data through a loop like so:
> > > 
> > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > {
> > >    Calculator_AddSample();
> > > }
> > 
> > Ok, so what you're saying is that once you see a "not empty"
flag, 
> > then you know the agent on the other side of the FIFO is writing a 
> > known block size, and will write it faster than you can read, so your

> > code just needs to read.
> > 
> > > I've tested this and it did improve the performance but nothing
shocking,
> > > it seems the decompressing via the LookUp Table is creating the
bottle
> > > neck. I've already split the two dimensional LUT into 2 one
dimensional
> > > array's. This also helped a bit.
> > 
> > One thing you might try is hand-optimized asm code just for the read /

> > look-up sequence, using techniques that Richard was describing.  If 
> > you take advantage of the pipeline, you can improve performance.  For

> > example you can read sample N, then in the next 4 instructions process

> > the lookup on N-1, waiting for N to become valid.  It sounds to me 
> > like it wouldn't be that much code in your loop, maybe a dozen or less

> > asm instructions.
> > 
> > -Jeff
> > 
> > PS. Please post to the group, not to me.  Thanks.
> > 
> > > --- In c...@yahoogroups.com, Jeff Brower <jbrower@> wrote:
> > > >
> > > > Dominic-
> > > >
> > > > > Thanks for the information, I think I will refrain from
using block
> > > > > transfers because I want to process the data as the DSP
receives it.
> > > > .
> > > > .
> > > > .
> > > >
> > > > > At the moment I am starting this "prefetch"
function when a burst
> > > > > starts and execute this function every time there is
data available
> > > > > in the FIFO's (polling the Empty Flag). I'm
prefeteching 27.6% of
> > > > > the data before the burst ends. All variables are in
IRAM.
> > > >
> > > > The typical reason for doing it that way is to avoid delay
(latency) in
> your signal
> > > > processing flow, relative to some output (DAC, GPIO line,
digital
> transmission,
> > > > etc).  Is that the case?  If not then a block based method
would be
> better, otherwise
> > > > you will waste a lot of time polling for each element.  You
don't have to
> implement
> > > > DMA as a first step to get that working, you could use a
code loop.  Then
> implement
> > > > DMA in order to further improve performance.
> > > >
> > > > -Jeff
> > > >
> > > > > My function looks like this:
> > > > >
> > > > > void Calculator_AddSample()
> > > > > {
> > > > >    x++;
> > > > >
> > > > >    read1 = (int*) 0x90300004;
> > > > >    read2 = (int*) 0x90300008;
> > > > >
> > > > >    tmpRead1 = *read1;
> > > > >    tmpRead2 = *read2;
> > > > >
> > > > >    // CHANNEL 1
> > > > >    CH1.deloggedData[x] = LUT[0][((tmpRead1 &
0xFF0000) >> 16)];
> > > > >    // CHANNEL 2
> > > > >    CH2.deloggedData[x] = LUT[0][((tmpRead1 &
0xFF000000) >> 24)];
> > > > >    // FWS R+L Add
> > > > >    if(LRneeded == 1)
> > > > >    {
> > > > >       CH1.deloggedData[x] +=    CH2.deloggedData[x];
> > > > >       if(CH1.deloggedData[x] > 5000)
> > > > >       {
> > > > >          CH1.deloggedData[x] = 5000;
> > > > >       }
> > > > >    }
> > > > >    // CHANNEL 3 this channel is always read for
particle matching on
> this channel
> > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > >    CH3.deloggedData[x] = LUT[0][((tmpRead2 &
0xFF))];
> > > > >
> > > > >    // CHANNEL 4
> > > > >    CH4.deloggedData[x] = LUT[0][((tmpRead2 &
0xFF00) >> 8)];
> > > > >    // CHANNEL 5
> > > > >    CH5.deloggedData[x] = LUT[1][((tmpRead1 &
0xFF00) >> 8)];
> > > > >    // CHANNEL 6
> > > > >    CH6.deloggedData[x] = LUT[1][tmpRead1 & 0xFF];
> > > > > }
> > > > > This function executes 2 reads from 2 different FIFO's
and then
> seperates the different datachannels and decompresses the value's with a
LookUp
> Table.
> > > > >
> > > > > I am trying to streamline this function so it can keep
up with the
> incoming data. The data is written to the FIFO's with 4MHz. The data
consists of
> small burst packets ranging from 3 to 4096 bytes per channel.
> > > > >
> > > > > At the moment I am starting this "prefetch"
function when a burst starts
> and execute this function every time there is data available in the FIFO's
> (polling the Empty Flag). I'm prefeteching 27.6% of the data before the
burst
> ends. All variables are in IRAM.
> > > > >
> > > > > I think I made an error in suspecting the EMIF transfer
speed and I now
> suspect that there may be some overhead in the polling scheme I use for
calling
> this function that results in the slow transfer speed. I will look into
this. I
> would like to thank everyone for there input.
> > > > >
> > > > > With kind regards,
> > > > >
> > > > > Dominic
> > > > >
> > > > > --- In c...@yahoogroups.com, Adolf Klemenz
<adolf.klemenz@> wrote:
> > > > > >
> > > > > > Dear Dominic,
> > > > > >
> > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > >as I understand DMA, I would need to work in
"blocks" of data but that
> > > > > > >would be very tricky in my application since I
do not know how big the
> > > > > > >datastream is gonna be. Or is it possible to
use DMA for single byte
> transfers?
> > > > > >
> > > > > > using DMA makes sense for block transfers only.
Typical Fifo applications
> > > > > > will use the Fifo's half-full flag (or a similar
signal) to trigger a DMA
> > > > > > block read.
> > > > > > You may use element-synchronized DMA (each trigger
transfers only one data
> > > > > > word), but there will be no speed improvement: It
takes about 100ns from
> > > > > > the EDMA sync event to the actual data transfer on
a C6713.
> > > > > >
> > > > > > Attached is a scope screenshot generated by this
test program
> > > > > >
> > > > > > // compiled with -o2 and without debug info:
> > > > > >
> > > > > > volatile int buffer; // must be volatile to
prevent
> > > > > >                       // optimizer from code
removal
> > > > > > for (;;)
> > > > > > {
> > > > > >      buffer = *(volatile int*)0x90300000;
> > > > > > }
> > > > > >
> > > > > > The screenshot shows chip select and read signal
with the expected timings
> > > > > > (20ns strobe width). The gap between sucessive
reads is caused by the DSP
> > > > > > architecture. Here it is 200ns because a 225MHz
DSP was used, which should
> > > > > > translate to 150ns on a 300MHz device.
> > > > > >
> > > > > > If this isn't fast enough, you must use block
transfers.
> > > > > >
> > > > > >    Best Regards,
> > > > > >    Adolf Klemenz, D.SignT
> > > >
> ------- End of Original Message -------
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 17 10:36:04 2009

d.stuartnl,

I notice that the code, during the first loop, checks for the termination value
then throws away the first read values (by reading from read1 and read2 again).
is that you wanted to do?

Execution could be made much faster, by eliminating the calculations related to
'x' by using pointers to:
CH1.deloggedData, 
CH2.deloggedData, 
CH3.deloggedData, 
CH4.deloggedData, 
CH5.deloggedData, 
CH6.deloggedData.  
Initialize the pointers before the loop and increment them at the end of the
loop.
Also, eliminate 'x' and related calculation by precalculating the end address
for the loop as: 
const endCH1 = &CH1.deloggedData[0x1000];
const termValue = 0x84825131;

pCH1 = &CH1.deloggedData[0];
pCH2 = &CH2.deloggedData[0];
--- // rest of initialization
while( pCH1 < endCH1 )
{
---// processing
pCH1++;
pCh2++;
...// rest of incrementing
} // end while()

to avoid processing the termination value from *read1
and to exit when the termination value is read:
The first code within the 'while' loop would be:
tmpRead1 = *read1;
if (tmpRead1 == termValue ) break;
tmpRead2 = *read2;

R. Williams
---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Fri, 17 Jul 2009 10:11:36 -0000
Subject: [c6x] Re: Slow EMIF transfer

> R. Williams,
<snip> x and tmpRead1 are updated in the AddSample() routine.
Furthermore,
>  I've been analyzing the compilers feedback and it's stating that it 
> cannot implement software pipelining because there's a function call 
> (AddSample()) in the loop. I've removed the AddSample() function and 
> put the code from the function directly into the loop (see source),
>  there's still some problems (Disqualified loop: Loop carried 
> dependency bound too large). But I'm working on it :) I've also found 
> out that pipelining is not being used in a lot of my loops so I'm 
> guessing if I adjust my C-code so that software pipelining will be 
> possible I will notice an increase in performance.
> 
> Source:
> 
> read1 = (int*) 0x90300004;
> read2 = (int*) 0x90300008;
> 
> tmpRead1 = *read1;
> tmpRead2 = *read2;
> x = 0;
> while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> {
>    tmpRead1 = *read1;
>    tmpRead2 = *read2;YouTube - Dilbert - The Knack
> 
>    CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
>    CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>    // FWS R+L Add
>    if(LRneeded == 1)
>    {
>       CH1.deloggedData[x] += CH2.deloggedData[x];
>       if(CH1.deloggedData[x] > 5000)
>       {
>          CH1.deloggedData[x] = 5000;
>       }
>    }
>    CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
>    binData[x] = (tmpRead2 & 0xFF);
>    CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>    CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>    CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
>    x++;
> }
> 
> With kind regards,
> 
> Dominic
> 
> > 
> > However, your idea of just using the read operation, since it is much
longer
> > than a write, is a good one.
> > 
> > R. Williams
> >  
> > 
> > 
> > ---------- Original Message -----------
> > From: Jeff Brower <jbrower@...>
> > To: Dominic Stuart <d.stuartnl@...>
> > Cc: c...@yahoogroups.com
> > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > Subject: [c6x] Re: Slow EMIF transfer
> > 
> > > Dominic-
> > > 
> > > > I am indeed trying to avoid delay in processing flow. The
data needs to be
> > > > decompressed asap. When that is done the DSP performs
calculations on the
> > > > data and based on the outcome of those calculations the DSP
generates a
> > > > trigger (GPIO). Your idea of a code loop got me thinking...
If a read
> > > > always takes longer than a write, I don't have to pull the
Empty Flag and
> > > > can just read the data through a loop like so:
> > > > 
> > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > {
> > > >    Calculator_AddSample();
> > > > }
> > > 
> > > Ok, so what you're saying is that once you see a "not
empty" flag, 
> > > then you know the agent on the other side of the FIFO is writing
a 
> > > known block size, and will write it faster than you can read, so
your 
> > > code just needs to read.
> > > 
> > > > I've tested this and it did improve the performance but
nothing shocking,
> > > > it seems the decompressing via the LookUp Table is creating
the bottle
> > > > neck. I've already split the two dimensional LUT into 2 one
dimensional
> > > > array's. This also helped a bit.
> > > 
> > > One thing you might try is hand-optimized asm code just for the
read / 
> > > look-up sequence, using techniques that Richard was describing. 
If 
> > > you take advantage of the pipeline, you can improve performance. 
For 
> > > example you can read sample N, then in the next 4 instructions
process 
> > > the lookup on N-1, waiting for N to become valid.  It sounds to
me 
> > > like it wouldn't be that much code in your loop, maybe a dozen or
less 
> > > asm instructions.
> > > 
> > > -Jeff
> > > 
> > > PS. Please post to the group, not to me.  Thanks.
> > > 
> > > > --- In c...@yahoogroups.com, Jeff Brower <jbrower@>
wrote:
> > > > >
> > > > > Dominic-
> > > > >
> > > > > > Thanks for the information, I think I will refrain
from using block
> > > > > > transfers because I want to process the data as
the DSP receives it.
> > > > > .
> > > > > .
> > > > > .
> > > > >
> > > > > > At the moment I am starting this
"prefetch" function when a burst
> > > > > > starts and execute this function every time there
is data available
> > > > > > in the FIFO's (polling the Empty Flag). I'm
prefeteching 27.6% of
> > > > > > the data before the burst ends. All variables are
in IRAM.
> > > > >
> > > > > The typical reason for doing it that way is to avoid
delay (latency) in
> > your signal
> > > > > processing flow, relative to some output (DAC, GPIO
line, digital
> > transmission,
> > > > > etc).  Is that the case?  If not then a block based
method would be
> > better, otherwise
> > > > > you will waste a lot of time polling for each element. 
You don't have to
> > implement
> > > > > DMA as a first step to get that working, you could use
a code loop.  Then
> > implement
> > > > > DMA in order to further improve performance.
> > > > >
> > > > > -Jeff
> > > > >
> > > > > > My function looks like this:
> > > > > >
> > > > > > void Calculator_AddSample()
> > > > > > {
> > > > > >    x++;
> > > > > >
> > > > > >    read1 = (int*) 0x90300004;
> > > > > >    read2 = (int*) 0x90300008;
> > > > > >
> > > > > >    tmpRead1 = *read1;
> > > > > >    tmpRead2 = *read2;
> > > > > >
> > > > > >    // CHANNEL 1
> > > > > >    CH1.deloggedData[x] = LUT[0][((tmpRead1 &
0xFF0000) >> 16)];
> > > > > >    // CHANNEL 2
> > > > > >    CH2.deloggedData[x] = LUT[0][((tmpRead1 &
0xFF000000) >> 24)];
> > > > > >    // FWS R+L Add
> > > > > >    if(LRneeded == 1)
> > > > > >    {
> > > > > >       CH1.deloggedData[x] +=   
CH2.deloggedData[x];
> > > > > >       if(CH1.deloggedData[x] > 5000)
> > > > > >       {
> > > > > >          CH1.deloggedData[x] = 5000;
> > > > > >       }
> > > > > >    }
> > > > > >    // CHANNEL 3 this channel is always read for
particle matching on
> > this channel
> > > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > > >    CH3.deloggedData[x] = LUT[0][((tmpRead2 &
0xFF))];
> > > > > >
> > > > > >    // CHANNEL 4
> > > > > >    CH4.deloggedData[x] = LUT[0][((tmpRead2 &
0xFF00) >> 8)];
> > > > > >    // CHANNEL 5
> > > > > >    CH5.deloggedData[x] = LUT[1][((tmpRead1 &
0xFF00) >> 8)];
> > > > > >    // CHANNEL 6
> > > > > >    CH6.deloggedData[x] = LUT[1][tmpRead1 &
0xFF];
> > > > > > }
> > > > > > This function executes 2 reads from 2 different
FIFO's and then
> > seperates the different datachannels and decompresses the value's with
a LookUp
> > Table.
> > > > > >
> > > > > > I am trying to streamline this function so it can
keep up with the
> > incoming data. The data is written to the FIFO's with 4MHz. The data
consists of
> > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > >
> > > > > > At the moment I am starting this
"prefetch" function when a burst starts
> > and execute this function every time there is data available in the
FIFO's
> > (polling the Empty Flag). I'm prefeteching 27.6% of the data before
the burst
> > ends. All variables are in IRAM.
> > > > > >
> > > > > > I think I made an error in suspecting the EMIF
transfer speed and I now
> > suspect that there may be some overhead in the polling scheme I use
for calling
> > this function that results in the slow transfer speed. I will look
into this. I
> > would like to thank everyone for there input.
> > > > > >
> > > > > > With kind regards,
> > > > > >
> > > > > > Dominic
> > > > > >
> > > > > > --- In c...@yahoogroups.com, Adolf Klemenz
<adolf.klemenz@> wrote:
> > > > > > >
> > > > > > > Dear Dominic,
> > > > > > >
> > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl wrote:
> > > > > > > >as I understand DMA, I would need to work
in "blocks" of data but
that
> > > > > > > >would be very tricky in my application
since I do not know how
big the
> > > > > > > >datastream is gonna be. Or is it possible
to use DMA for single byte
> > transfers?
> > > > > > >
> > > > > > > using DMA makes sense for block transfers
only. Typical Fifo
applications
> > > > > > > will use the Fifo's half-full flag (or a
similar signal) to
trigger a DMA
> > > > > > > block read.
> > > > > > > You may use element-synchronized DMA (each
trigger transfers only
one data
> > > > > > > word), but there will be no speed
improvement: It takes about
100ns from
> > > > > > > the EDMA sync event to the actual data
transfer on a C6713.
> > > > > > >
> > > > > > > Attached is a scope screenshot generated by
this test program
> > > > > > >
> > > > > > > // compiled with -o2 and without debug info:
> > > > > > >
> > > > > > > volatile int buffer; // must be volatile to
prevent
> > > > > > >                       // optimizer from code
removal
> > > > > > > for (;;)
> > > > > > > {
> > > > > > >      buffer = *(volatile int*)0x90300000;
> > > > > > > }
> > > > > > >
> > > > > > > The screenshot shows chip select and read
signal with the expected
timings
> > > > > > > (20ns strobe width). The gap between
sucessive reads is caused by
the DSP
> > > > > > > architecture. Here it is 200ns because a
225MHz DSP was used,
which should
> > > > > > > translate to 150ns on a 300MHz device.
> > > > > > >
> > > > > > > If this isn't fast enough, you must use block
transfers.
> > > > > > >
> > > > > > >    Best Regards,
> > > > > > >    Adolf Klemenz, D.SignT
> > > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 17 15:56:19 2009

Dear R.Williams,

I changed my code to your suggestion:

void Calculator_FetchData()
{
	volatile float * pCH1;
	volatile float * pCH2;
	volatile float * pCH3;
	volatile float * pCH4;
	volatile float * pCH5;
	volatile float * pCH6;

	const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
	const termValue = 0x84825131;

	pCH1 = &CH1.deloggedData[0];
	pCH2 = &CH2.deloggedData[0];
	pCH3 = &CH3.deloggedData[0];
	pCH4 = &CH4.deloggedData[0];
	pCH5 = &CH5.deloggedData[0];
	pCH6 = &CH6.deloggedData[0];

	tmpprocessTime = TIMER(1)->cnt; //just in here for measuring performance...

	while(*pCH1 < endCH1)
	{
		tmpRead1 = *read1;
		if(tmpRead1 == termValue) break;
		//CHANNEL 1
		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
		// CHANNEL 2
		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
		if(LRneeded == 1)
		{
			*pCH1 += *pCH2;
			if(*pCH1 > 5000)
			{
				*pCH1 = 5000;
			}
		}
		// CHANNEL 5
		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

		// CHANNEL 6
		*pCH6 = LUT1[tmpRead1 & 0xFF];

		tmpRead2 = *read2;
		
		// CHANNEL 3 this channel is always read for particle matching on this
channel
		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	
		// CHANNEL 4
		*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

		pCH1++;
		pCH2++;
		pCH3++;
		pCH4++;
		pCH5++;
		pCH6++;
		x++;
	}
	if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
	{
		processTime = 	TIMER(1)->cnt - tmpprocessTime;
	}
}

On my testrig I'm offering particles with a fixed lenght of 985. My previous
code could read 985 samples for 6 channels in 681us. Your suggestion cut that
time down to 601us!!! My first reaction was WOW :P. I have a couple of questions
though if you can forgive my ignorance. The big question is WHY? Because it
looks like it's calculating more (6 pointers instead of 1 "x"). I
still left in the x++; because I need to know how many samples have been read.

With kind regards,

Dominic Stuart

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
>
> d.stuartnl,
> 
> I notice that the code, during the first loop, checks for the termination
value
> then throws away the first read values (by reading from read1 and read2
again).
> is that you wanted to do?
> 
> Execution could be made much faster, by eliminating the calculations
related to
> 'x' by using pointers to:
> CH1.deloggedData, 
> CH2.deloggedData, 
> CH3.deloggedData, 
> CH4.deloggedData, 
> CH5.deloggedData, 
> CH6.deloggedData.  
> Initialize the pointers before the loop and increment them at the end of
the loop.
> Also, eliminate 'x' and related calculation by precalculating the end
address
> for the loop as: 
> const endCH1 = &CH1.deloggedData[0x1000];
> const termValue = 0x84825131;
> 
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> --- // rest of initialization
> while( pCH1 < endCH1 )
> {
> ---// processing
> pCH1++;
> pCh2++;
> ...// rest of incrementing
> } // end while()
> 
> to avoid processing the termination value from *read1
> and to exit when the termination value is read:
> The first code within the 'while' loop would be:
> tmpRead1 = *read1;
> if (tmpRead1 == termValue ) break;
> tmpRead2 = *read2;
> 
> R. Williams
> ---------- Original Message -----------
> From: "d.stuartnl" <d.stuartnl@...>
> To: c...@yahoogroups.com
> Sent: Fri, 17 Jul 2009 10:11:36 -0000
> Subject: [c6x] Re: Slow EMIF transfer
> 
> > R. Williams,
> <snip>
> > 
> > x and tmpRead1 are updated in the AddSample() routine. Furthermore,
> >  I've been analyzing the compilers feedback and it's stating that it 
> > cannot implement software pipelining because there's a function call 
> > (AddSample()) in the loop. I've removed the AddSample() function and 
> > put the code from the function directly into the loop (see source),
> >  there's still some problems (Disqualified loop: Loop carried 
> > dependency bound too large). But I'm working on it :) I've also found

> > out that pipelining is not being used in a lot of my loops so I'm 
> > guessing if I adjust my C-code so that software pipelining will be 
> > possible I will notice an increase in performance.
> > 
> > Source:
> > 
> > read1 = (int*) 0x90300004;
> > read2 = (int*) 0x90300008;
> > 
> > tmpRead1 = *read1;
> > tmpRead2 = *read2;
> > x = 0;
> > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > {
> >    tmpRead1 = *read1;
> >    tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > 
> >    CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >>
16)];
> >    CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000) >>
24)];
> >    // FWS R+L Add
> >    if(LRneeded == 1)
> >    {
> >       CH1.deloggedData[x] += CH2.deloggedData[x];
> >       if(CH1.deloggedData[x] > 5000)
> >       {
> >          CH1.deloggedData[x] = 5000;
> >       }
> >    }
> >    CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> >    binData[x] = (tmpRead2 & 0xFF);
> >    CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >    CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >    CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> >    x++;
> > }
> > 
> > With kind regards,
> > 
> > Dominic
> > 
> > > 
> > > However, your idea of just using the read operation, since it is
much longer
> > > than a write, is a good one.
> > > 
> > > R. Williams
> > >  
> > > 
> > > 
> > > ---------- Original Message -----------
> > > From: Jeff Brower <jbrower@>
> > > To: Dominic Stuart <d.stuartnl@>
> > > Cc: c...@yahoogroups.com
> > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > Subject: [c6x] Re: Slow EMIF transfer
> > > 
> > > > Dominic-
> > > > 
> > > > > I am indeed trying to avoid delay in processing flow.
The data needs to be
> > > > > decompressed asap. When that is done the DSP performs
calculations on the
> > > > > data and based on the outcome of those calculations the
DSP generates a
> > > > > trigger (GPIO). Your idea of a code loop got me
thinking... If a read
> > > > > always takes longer than a write, I don't have to pull
the Empty Flag and
> > > > > can just read the data through a loop like so:
> > > > > 
> > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > {
> > > > >    Calculator_AddSample();
> > > > > }
> > > > 
> > > > Ok, so what you're saying is that once you see a "not
empty" flag, 
> > > > then you know the agent on the other side of the FIFO is
writing a 
> > > > known block size, and will write it faster than you can
read, so your 
> > > > code just needs to read.
> > > > 
> > > > > I've tested this and it did improve the performance but
nothing shocking,
> > > > > it seems the decompressing via the LookUp Table is
creating the bottle
> > > > > neck. I've already split the two dimensional LUT into 2
one dimensional
> > > > > array's. This also helped a bit.
> > > > 
> > > > One thing you might try is hand-optimized asm code just for
the read / 
> > > > look-up sequence, using techniques that Richard was
describing.  If 
> > > > you take advantage of the pipeline, you can improve
performance.  For 
> > > > example you can read sample N, then in the next 4
instructions process 
> > > > the lookup on N-1, waiting for N to become valid.  It sounds
to me 
> > > > like it wouldn't be that much code in your loop, maybe a
dozen or less 
> > > > asm instructions.
> > > > 
> > > > -Jeff
> > > > 
> > > > PS. Please post to the group, not to me.  Thanks.
> > > > 
> > > > > --- In c...@yahoogroups.com, Jeff Brower
<jbrower@> wrote:
> > > > > >
> > > > > > Dominic-
> > > > > >
> > > > > > > Thanks for the information, I think I will
refrain from using block
> > > > > > > transfers because I want to process the data
as the DSP receives it.
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > >
> > > > > > > At the moment I am starting this
"prefetch" function when a burst
> > > > > > > starts and execute this function every time
there is data available
> > > > > > > in the FIFO's (polling the Empty Flag). I'm
prefeteching 27.6% of
> > > > > > > the data before the burst ends. All variables
are in IRAM.
> > > > > >
> > > > > > The typical reason for doing it that way is to
avoid delay (latency) in
> > > your signal
> > > > > > processing flow, relative to some output (DAC,
GPIO line, digital
> > > transmission,
> > > > > > etc).  Is that the case?  If not then a block
based method would be
> > > better, otherwise
> > > > > > you will waste a lot of time polling for each
element.  You don't have to
> > > implement
> > > > > > DMA as a first step to get that working, you could
use a code loop.  Then
> > > implement
> > > > > > DMA in order to further improve performance.
> > > > > >
> > > > > > -Jeff
> > > > > >
> > > > > > > My function looks like this:
> > > > > > >
> > > > > > > void Calculator_AddSample()
> > > > > > > {
> > > > > > >    x++;
> > > > > > >
> > > > > > >    read1 = (int*) 0x90300004;
> > > > > > >    read2 = (int*) 0x90300008;
> > > > > > >
> > > > > > >    tmpRead1 = *read1;
> > > > > > >    tmpRead2 = *read2;
> > > > > > >
> > > > > > >    // CHANNEL 1
> > > > > > >    CH1.deloggedData[x] = LUT[0][((tmpRead1
& 0xFF0000) >> 16)];
> > > > > > >    // CHANNEL 2
> > > > > > >    CH2.deloggedData[x] = LUT[0][((tmpRead1
& 0xFF000000) >> 24)];
> > > > > > >    // FWS R+L Add
> > > > > > >    if(LRneeded == 1)
> > > > > > >    {
> > > > > > >       CH1.deloggedData[x] +=   
CH2.deloggedData[x];
> > > > > > >       if(CH1.deloggedData[x] > 5000)
> > > > > > >       {
> > > > > > >          CH1.deloggedData[x] = 5000;
> > > > > > >       }
> > > > > > >    }
> > > > > > >    // CHANNEL 3 this channel is always read
for particle matching on
> > > this channel
> > > > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > > > >    CH3.deloggedData[x] = LUT[0][((tmpRead2
& 0xFF))];
> > > > > > >
> > > > > > >    // CHANNEL 4
> > > > > > >    CH4.deloggedData[x] = LUT[0][((tmpRead2
& 0xFF00) >> 8)];
> > > > > > >    // CHANNEL 5
> > > > > > >    CH5.deloggedData[x] = LUT[1][((tmpRead1
& 0xFF00) >> 8)];
> > > > > > >    // CHANNEL 6
> > > > > > >    CH6.deloggedData[x] = LUT[1][tmpRead1
& 0xFF];
> > > > > > > }
> > > > > > > This function executes 2 reads from 2
different FIFO's and then
> > > seperates the different datachannels and decompresses the value's
with a LookUp
> > > Table.
> > > > > > >
> > > > > > > I am trying to streamline this function so it
can keep up with the
> > > incoming data. The data is written to the FIFO's with 4MHz. The
data consists of
> > > small burst packets ranging from 3 to 4096 bytes per channel.
> > > > > > >
> > > > > > > At the moment I am starting this
"prefetch" function when a burst starts
> > > and execute this function every time there is data available in
the FIFO's
> > > (polling the Empty Flag). I'm prefeteching 27.6% of the data
before the burst
> > > ends. All variables are in IRAM.
> > > > > > >
> > > > > > > I think I made an error in suspecting the
EMIF transfer speed and I now
> > > suspect that there may be some overhead in the polling scheme I
use for calling
> > > this function that results in the slow transfer speed. I will
look into this. I
> > > would like to thank everyone for there input.
> > > > > > >
> > > > > > > With kind regards,
> > > > > > >
> > > > > > > Dominic
> > > > > > >
> > > > > > > --- In c...@yahoogroups.com, Adolf Klemenz
<adolf.klemenz@> wrote:
> > > > > > > >
> > > > > > > > Dear Dominic,
> > > > > > > >
> > > > > > > > At 16:45 13.07.2009 +0000, d.stuartnl
wrote:
> > > > > > > > >as I understand DMA, I would need to
work in "blocks" of data but
> that
> > > > > > > > >would be very tricky in my
application since I do not know how
> big the
> > > > > > > > >datastream is gonna be. Or is it
possible to use DMA for single byte
> > > transfers?
> > > > > > > >
> > > > > > > > using DMA makes sense for block
transfers only. Typical Fifo
> applications
> > > > > > > > will use the Fifo's half-full flag (or a
similar signal) to
> trigger a DMA
> > > > > > > > block read.
> > > > > > > > You may use element-synchronized DMA
(each trigger transfers only
> one data
> > > > > > > > word), but there will be no speed
improvement: It takes about
> 100ns from
> > > > > > > > the EDMA sync event to the actual data
transfer on a C6713.
> > > > > > > >
> > > > > > > > Attached is a scope screenshot generated
by this test program
> > > > > > > >
> > > > > > > > // compiled with -o2 and without debug
info:
> > > > > > > >
> > > > > > > > volatile int buffer; // must be volatile
to prevent
> > > > > > > >                       // optimizer from
code removal
> > > > > > > > for (;;)
> > > > > > > > {
> > > > > > > >      buffer = *(volatile
int*)0x90300000;
> > > > > > > > }
> > > > > > > >
> > > > > > > > The screenshot shows chip select and
read signal with the expected
> timings
> > > > > > > > (20ns strobe width). The gap between
sucessive reads is caused by
> the DSP
> > > > > > > > architecture. Here it is 200ns because a
225MHz DSP was used,
> which should
> > > > > > > > translate to 150ns on a 300MHz device.
> > > > > > > >
> > > > > > > > If this isn't fast enough, you must use
block transfers.
> > > > > > > >
> > > > > > > >    Best Regards,
> > > > > > > >    Adolf Klemenz, D.SignT
> > > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 17 22:40:54 2009

d.stuartnl,

The reason the time is quicker, even though there is more code, is because the
code produced to do:
CH1.deloggedData[x]  
includes quite a lot of math, calculation of an address in a array is slow
compared to incrementing a pointer

R. Williams
---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Fri, 17 Jul 2009 17:06:32 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Dear R.Williams,
> 
> I changed my code to your suggestion:
> 
> void Calculator_FetchData()
> {
> 	volatile float * pCH1;
> 	volatile float * pCH2;
> 	volatile float * pCH3;
> 	volatile float * pCH4;
> 	volatile float * pCH5;
> 	volatile float * pCH6;
> 
> 	const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> 	const termValue = 0x84825131;
> 
> 	pCH1 = &CH1.deloggedData[0];
> 	pCH2 = &CH2.deloggedData[0];
> 	pCH3 = &CH3.deloggedData[0];
> 	pCH4 = &CH4.deloggedData[0];
> 	pCH5 = &CH5.deloggedData[0];
> 	pCH6 = &CH6.deloggedData[0];
> 
> 	
> 
> 	tmpprocessTime = TIMER(1)->cnt; //just in here for measuring
performance...
> 
> 	while(*pCH1 < endCH1)
> 	{
> 		tmpRead1 = *read1;
> 		if(tmpRead1 == termValue) break;
> 		//CHANNEL 1
> 		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> 		// CHANNEL 2
> 		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> 		if(LRneeded == 1)
> 		{
> 			*pCH1 += *pCH2;
> 			if(*pCH1 > 5000)
> 			{
> 				*pCH1 = 5000;
> 			}
> 		}
> 		// CHANNEL 5
> 		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> 
> 		// CHANNEL 6
> 		*pCH6 = LUT1[tmpRead1 & 0xFF];
> 
> 		tmpRead2 = *read2;
> 		
> 		// CHANNEL 3 this channel is always read for particle matching on 
> this channel 		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	 		// CHANNEL 4 	
> 	*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> 
> 		pCH1++;
> 		pCH2++;
> 		pCH3++;
> 		pCH4++;
> 		pCH5++;
> 		pCH6++;
> 		x++;
> 	}
> 	if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> 	{
> 		processTime = 	TIMER(1)->cnt - tmpprocessTime;
> 	}
> }
> 
> On my testrig I'm offering particles with a fixed lenght of 985. My 
> previous code could read 985 samples for 6 channels in 681us. Your 
> suggestion cut that time down to 601us!!! My first reaction was WOW 
> :P. I have a couple of questions though if you can forgive my 
> ignorance. The big question is WHY? Because it looks like it's 
> calculating more (6 pointers instead of 1 "x"). I still left in
the 
> x++; because I need to know how many samples have been read.
> 
> With kind regards,
> 
> Dominic Stuart
> 
> --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@...> wrote:
> >
> > d.stuartnl,
> > 
> > I notice that the code, during the first loop, checks for the
termination value
> > then throws away the first read values (by reading from read1 and
read2 again).
> > is that you wanted to do?
> > 
> > Execution could be made much faster, by eliminating the calculations
related to
> > 'x' by using pointers to:
> > CH1.deloggedData, 
> > CH2.deloggedData, 
> > CH3.deloggedData, 
> > CH4.deloggedData, 
> > CH5.deloggedData, 
> > CH6.deloggedData.  
> > Initialize the pointers before the loop and increment them at the end
of the
loop.
> > Also, eliminate 'x' and related calculation by precalculating the end
address
> > for the loop as: 
> > const endCH1 = &CH1.deloggedData[0x1000];
> > const termValue = 0x84825131;
> > 
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > --- // rest of initialization
> > while( pCH1 < endCH1 )
> > {
> > ---// processing
> > pCH1++;
> > pCh2++;
> > ...// rest of incrementing
> > } // end while()
> > 
> > to avoid processing the termination value from *read1
> > and to exit when the termination value is read:
> > The first code within the 'while' loop would be:
> > tmpRead1 = *read1;
> > if (tmpRead1 == termValue ) break;
> > tmpRead2 = *read2;
> > 
> > R. Williams
> > 
> > 
> > ---------- Original Message -----------
> > From: "d.stuartnl" <d.stuartnl@...>
> > To: c...@yahoogroups.com
> > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> > 
> > > R. Williams,
> > <snip>
> > > 
> > > x and tmpRead1 are updated in the AddSample() routine.
Furthermore,
> > >  I've been analyzing the compilers feedback and it's stating that
it 
> > > cannot implement software pipelining because there's a function
call 
> > > (AddSample()) in the loop. I've removed the AddSample() function
and 
> > > put the code from the function directly into the loop (see
source),
> > >  there's still some problems (Disqualified loop: Loop carried 
> > > dependency bound too large). But I'm working on it :) I've also
found 
> > > out that pipelining is not being used in a lot of my loops so I'm

> > > guessing if I adjust my C-code so that software pipelining will
be 
> > > possible I will notice an increase in performance.
> > > 
> > > Source:
> > > 
> > > read1 = (int*) 0x90300004;
> > > read2 = (int*) 0x90300008;
> > > 
> > > tmpRead1 = *read1;
> > > tmpRead2 = *read2;
> > > x = 0;
> > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > {
> > >    tmpRead1 = *read1;
> > >    tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > 
> > >    CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000) >>
16)];
> > >    CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000)
>> 24)];
> > >    // FWS R+L Add
> > >    if(LRneeded == 1)
> > >    {
> > >       CH1.deloggedData[x] += CH2.deloggedData[x];
> > >       if(CH1.deloggedData[x] > 5000)
> > >       {
> > >          CH1.deloggedData[x] = 5000;
> > >       }
> > >    }
> > >    CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > >    binData[x] = (tmpRead2 & 0xFF);
> > >    CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00) >>
8)];
> > >    CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00) >>
8)];
> > >    CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > >    x++;
> > > }
> > > 
> > > With kind regards,
> > > 
> > > Dominic
> > > 
> > > > 
> > > > However, your idea of just using the read operation, since
it is much longer
> > > > than a write, is a good one.
> > > > 
> > > > R. Williams
> > > >  
> > > > 
> > > > 
> > > > ---------- Original Message -----------
> > > > From: Jeff Brower <jbrower@>
> > > > To: Dominic Stuart <d.stuartnl@>
> > > > Cc: c...@yahoogroups.com
> > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > 
> > > > > Dominic-
> > > > > 
> > > > > > I am indeed trying to avoid delay in processing
flow. The data needs
to be
> > > > > > decompressed asap. When that is done the DSP
performs calculations
on the
> > > > > > data and based on the outcome of those
calculations the DSP generates a
> > > > > > trigger (GPIO). Your idea of a code loop got me
thinking... If a read
> > > > > > always takes longer than a write, I don't have to
pull the Empty
Flag and
> > > > > > can just read the data through a loop like so:
> > > > > > 
> > > > > > while(tmpRead1 != 0x84825131 & (x <=
0x1000))
> > > > > > {
> > > > > >    Calculator_AddSample();
> > > > > > }
> > > > > 
> > > > > Ok, so what you're saying is that once you see a
"not empty" flag, 
> > > > > then you know the agent on the other side of the FIFO
is writing a 
> > > > > known block size, and will write it faster than you can
read, so your 
> > > > > code just needs to read.
> > > > > 
> > > > > > I've tested this and it did improve the
performance but nothing
shocking,
> > > > > > it seems the decompressing via the LookUp Table is
creating the bottle
> > > > > > neck. I've already split the two dimensional LUT
into 2 one dimensional
> > > > > > array's. This also helped a bit.
> > > > > 
> > > > > One thing you might try is hand-optimized asm code just
for the read / 
> > > > > look-up sequence, using techniques that Richard was
describing.  If 
> > > > > you take advantage of the pipeline, you can improve
performance.  For 
> > > > > example you can read sample N, then in the next 4
instructions process 
> > > > > the lookup on N-1, waiting for N to become valid.  It
sounds to me 
> > > > > like it wouldn't be that much code in your loop, maybe
a dozen or less 
> > > > > asm instructions.
> > > > > 
> > > > > -Jeff
> > > > > 
> > > > > PS. Please post to the group, not to me.  Thanks.
> > > > > 
> > > > > > --- In c...@yahoogroups.com, Jeff Brower
<jbrower@> wrote:
> > > > > > >
> > > > > > > Dominic-
> > > > > > >
> > > > > > > > Thanks for the information, I think I
will refrain from using block
> > > > > > > > transfers because I want to process the
data as the DSP receives it.
> > > > > > > .
> > > > > > > .
> > > > > > > .
> > > > > > >
> > > > > > > > At the moment I am starting this
"prefetch" function when a burst
> > > > > > > > starts and execute this function every
time there is data available
> > > > > > > > in the FIFO's (polling the Empty Flag).
I'm prefeteching 27.6% of
> > > > > > > > the data before the burst ends. All
variables are in IRAM.
> > > > > > >
> > > > > > > The typical reason for doing it that way is
to avoid delay
(latency) in
> > > > your signal
> > > > > > > processing flow, relative to some output
(DAC, GPIO line, digital
> > > > transmission,
> > > > > > > etc).  Is that the case?  If not then a block
based method would be
> > > > better, otherwise
> > > > > > > you will waste a lot of time polling for each
element.  You don't
have to
> > > > implement
> > > > > > > DMA as a first step to get that working, you
could use a code
loop.  Then
> > > > implement
> > > > > > > DMA in order to further improve performance.
> > > > > > >
> > > > > > > -Jeff
> > > > > > >
> > > > > > > > My function looks like this:
> > > > > > > >
> > > > > > > > void Calculator_AddSample()
> > > > > > > > {
> > > > > > > >    x++;
> > > > > > > >
> > > > > > > >    read1 = (int*) 0x90300004;
> > > > > > > >    read2 = (int*) 0x90300008;
> > > > > > > >
> > > > > > > >    tmpRead1 = *read1;
> > > > > > > >    tmpRead2 = *read2;
> > > > > > > >
> > > > > > > >    // CHANNEL 1
> > > > > > > >    CH1.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > >    // CHANNEL 2
> > > > > > > >    CH2.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > > >    // FWS R+L Add
> > > > > > > >    if(LRneeded == 1)
> > > > > > > >    {
> > > > > > > >       CH1.deloggedData[x] +=   
CH2.deloggedData[x];
> > > > > > > >       if(CH1.deloggedData[x] > 5000)
> > > > > > > >       {
> > > > > > > >          CH1.deloggedData[x] = 5000;
> > > > > > > >       }
> > > > > > > >    }
> > > > > > > >    // CHANNEL 3 this channel is always
read for particle matching on
> > > > this channel
> > > > > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > > > > >    CH3.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > >
> > > > > > > >    // CHANNEL 4
> > > > > > > >    CH4.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > >    // CHANNEL 5
> > > > > > > >    CH5.deloggedData[x] =
LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > >    // CHANNEL 6
> > > > > > > >    CH6.deloggedData[x] = LUT[1][tmpRead1
& 0xFF];
> > > > > > > > }
> > > > > > > > This function executes 2 reads from 2
different FIFO's and then
> > > > seperates the different datachannels and decompresses the
value's with a
LookUp
> > > > Table.
> > > > > > > >
> > > > > > > > I am trying to streamline this function
so it can keep up with the
> > > > incoming data. The data is written to the FIFO's with 4MHz.
The data
consists of
> > > > small burst packets ranging from 3 to 4096 bytes per
channel.
> > > > > > > >
> > > > > > > > At the moment I am starting this
"prefetch" function when a
burst starts
> > > > and execute this function every time there is data available
in the FIFO's
> > > > (polling the Empty Flag). I'm prefeteching 27.6% of the data
before the
burst
> > > > ends. All variables are in IRAM.
> > > > > > > >
> > > > > > > > I think I made an error in suspecting
the EMIF transfer speed
and I now
> > > > suspect that there may be some overhead in the polling
scheme I use for
calling
> > > > this function that results in the slow transfer speed. I
will look into
this. I
> > > > would like to thank everyone for there input.
> > > > > > > >
> > > > > > > > With kind regards,
> > > > > > > >
> > > > > > > > Dominic
> > > > > > > >
> > > > > > > > --- In c...@yahoogroups.com, Adolf
Klemenz <adolf.klemenz@> wrote:
> > > > > > > > >
> > > > > > > > > Dear Dominic,
> > > > > > > > >
> > > > > > > > > At 16:45 13.07.2009 +0000,
d.stuartnl wrote:
> > > > > > > > > >as I understand DMA, I would
need to work in "blocks" of data but
> > that
> > > > > > > > > >would be very tricky in my
application since I do not know how
> > big the
> > > > > > > > > >datastream is gonna be. Or is
it possible to use DMA for
single byte
> > > > transfers?
> > > > > > > > >
> > > > > > > > > using DMA makes sense for block
transfers only. Typical Fifo
> > applications
> > > > > > > > > will use the Fifo's half-full flag
(or a similar signal) to
> > trigger a DMA
> > > > > > > > > block read.
> > > > > > > > > You may use element-synchronized
DMA (each trigger transfers only
> > one data
> > > > > > > > > word), but there will be no speed
improvement: It takes about
> > 100ns from
> > > > > > > > > the EDMA sync event to the actual
data transfer on a C6713.
> > > > > > > > >
> > > > > > > > > Attached is a scope screenshot
generated by this test program
> > > > > > > > >
> > > > > > > > > // compiled with -o2 and without
debug info:
> > > > > > > > >
> > > > > > > > > volatile int buffer; // must be
volatile to prevent
> > > > > > > > >                       // optimizer
from code removal
> > > > > > > > > for (;;)
> > > > > > > > > {
> > > > > > > > >      buffer = *(volatile
int*)0x90300000;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The screenshot shows chip select
and read signal with the expected
> > timings
> > > > > > > > > (20ns strobe width). The gap
between sucessive reads is caused by
> > the DSP
> > > > > > > > > architecture. Here it is 200ns
because a 225MHz DSP was used,
> > which should
> > > > > > > > > translate to 150ns on a 300MHz
device.
> > > > > > > > >
> > > > > > > > > If this isn't fast enough, you must
use block transfers.
> > > > > > > > >
> > > > > > > > >    Best Regards,
> > > > > > > > >    Adolf Klemenz, D.SignT
> > > > > > >
> > > > ------- End of Original Message -------
> > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 22 10:38:14 2009

Hi all,

I'm trying to further optimize my code but for some reason I cannot get
pipelining to work. I've checked several documents (SPRU425, Optimizing C
Compiler Tutorial, SPRA666, Hand Tuning Loops and Control Code). These documents
primarily focus on improving pipelines but in my .asm file it keeps stating
"Unsafe schedule for irregular loop". It produces the folowing
Software Pipeline Information:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 362
;*      Loop opening brace source line   : 363
;*      Loop closing brace source line   : 397
;*      Known Minimum Trip Count         : 1                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 110
;*      Unpartitioned Resource Bound     : 16
;*      Partitioned Resource Bound(*)    : 16
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     2        1     
;*      .S units                     4        5     
;*      .D units                    15       16*    
;*      .M units                     0        0     
;*      .X cross paths               0        0     
;*      .T address paths            15       16*    
;*      Long read paths              4        6     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           2        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        3     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             4        3     
;*      Bound(.L .S .D .LS .LSD)     8        9     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 110 Unsafe schedule for irregular loop
;*         ii = 110 Unsafe schedule for irregular loop
;*         ii = 110 Unsafe schedule for irregular loop
;*         ii = 110 Did not find schedule
;*         ii = 111 Unsafe schedule for irregular loop
;*         ii = 111 Unsafe schedule for irregular loop
;*         ii = 111 Unsafe schedule for irregular loop
;*         ii = 111 Did not find schedule
;*         ii = 113 Unsafe schedule for irregular loop
;*         ii = 113 Unsafe schedule for irregular loop
;*         ii = 113 Unsafe schedule for irregular loop
;*         ii = 113 Did not find schedule
;*         ii = 117 Unsafe schedule for irregular loop
;*         ii = 117 Unsafe schedule for irregular loop
;*         ii = 117 Unsafe schedule for irregular loop
;*         ii = 117 Did not find schedule
;*      Disqualified loop: Did not find schedule
;*----------------------------------------------------------------------------*
My code is as follows:

void Calculator_FetchData(volatile int * restrict p1, volatile int * restrict
p2)
{
	volatile int tmpRead1;
	volatile int tmpRead2;
	volatile int tmpStore1;
	volatile int tmpStore2;
	volatile float * restrict pCH1;
	volatile float * restrict pCH2;
	volatile float * restrict pCH3;
	volatile float * restrict pCH4;
	volatile float * restrict pCH5;
	volatile float * restrict pCH6;

	const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
	const termValue = 0x84825131;

	pCH1 = &CH1.deloggedData[0];
	pCH2 = &CH2.deloggedData[0];
	pCH3 = &CH3.deloggedData[0];
	pCH4 = &CH4.deloggedData[0];
	pCH5 = &CH5.deloggedData[0];
	pCH6 = &CH6.deloggedData[0];

	while((*pCH1 < endCH1) & (tmpRead1 != termValue))
	{
		tmpRead1 = *p1;

		//CHANNEL 1
		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
		// CHANNEL 2
		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
		if(LRneeded == 1)
		{
			*pCH1 += *pCH2;
			if(*pCH1 > 5000)
			{
				*pCH1 = 5000;
			}
		}
		//CHANNEL 5
		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

		// CHANNEL 6
		*pCH6 = LUT1[tmpRead1 & 0xFF];
		
		tmpRead2 = *p2;

		// CHANNEL 3 this channel is always read for particle matching on this
channel
		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	
		// CHANNEL 4
		*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];

		pCH1++;
		pCH2++;
		pCH3++;
		pCH4++;
		pCH5++;
		pCH6++;
	}
	x = (int) (pCH1 - &CH1.deloggedData[0]);
}

Is there a way I can change my C-code so the DSP can Pipeline? I think it should
be possible to have at least 2 iterations in parallel:

1st       2nd
read1
delog1    
read2     read1
delog2    delog1
etc..

I've already used the "restrict" keyword on the pointers I use since
these pointers do not overlap. I'm using the folowing compiler options:

-k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug"
-d"CHIP_6713" -d"DEBUG" -mt -mw -mh -mr1 -mv6710
--mem_model:data=far --consultant

Any advice or pointers to documents regarding how to enable pipelining other
then the ones mentioned above would be helpfull.

With kind regards,

Dominic Stuart

PS: I don't know if it's usefull but the folowing asm code is being produced:

C$L6:    
$C$DW$L$_Calculator_FetchData$2$B:
	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722
works\Calculator.c",line 363,column 0,is_stmt
;**	-----------------------g3:
;** 364	-----------------------    tmpRead1 = *p1;
;** 367	-----------------------    *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u,
24u)];
;** 369	-----------------------    *(++pCH2) =
K$32[((unsigned)tmpRead1>>22>>2)];
;** 370	-----------------------    if ( LRneeded != 1 ) goto g6;
;** 372	-----------------------    *pCH1 = *pCH1+*pCH2;
;** 373	-----------------------    if ( *pCH1 <= K$36 ) goto g6;
;** 375	-----------------------    *pCH1 = K$36;
;**	-----------------------g6:
;** 379	-----------------------    *pCH5++ = K$39[_extu((unsigned)tmpRead1, 16u,
24u)];
;** 382	-----------------------    *pCH6++ = K$39[_extu((unsigned)tmpRead1, 24u,
24u)];
;** 384	-----------------------    tmpRead2 = *p2;
;** 387	-----------------------    *pCH3++ = K$32[_extu((unsigned)tmpRead2, 24u,
24u)];
;** 389	-----------------------    *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u,
24u)];
;** 397	-----------------------    if ( (*(++pCH1) < endCH1)&(tmpRead1 !=
K$27) ) goto g3;
           LDW     .D1T2   *A4,B4            ; |364| 
           ZERO    .L1     A1
           NOP             3
           STW     .D2T2   B4,*+SP(4)        ; |364| 
           LDW     .D2T2   *+SP(4),B4        ; |367| 
           NOP             4
           EXTU    .S2     B4,8,24,B4        ; |367| 
           LDW     .D2T1   *+B7[B4],A9       ; |367| 
           NOP             4
           STW     .D1T1   A9,*A8            ; |367| 
           LDW     .D2T2   *+SP(4),B4        ; |369| 
           NOP             4
           SHRU    .S2     B4,24,B4          ; |369| 
           LDW     .D2T2   *+B7[B4],B4       ; |369| 
           NOP             4
           STW     .D2T2   B4,*++B5          ; |369| 
           LDHU    .D1T2   *A11,B4           ; |370| 
           NOP             4
           CMPEQ   .L2     B4,1,B0           ; |370| 

   [ B0]   LDW     .D1T1   *A8,A9            ; |372| 
|| [ B0]   LDW     .D2T2   *B5,B4            ; |372| 

           NOP             4
   [ B0]   ADDSP   .L1X    B4,A9,A9          ; |372| 
           NOP             3
   [ B0]   STW     .D1T1   A9,*A8            ; |372| 
   [ B0]   LDW     .D1T1   *A8,A9            ; |373| 
           NOP             4
   [ B0]   CMPGTSP .S1     A9,A2,A9          ; |373| 
   [ B0]   MV      .L1     A9,A1
   [ A1]   STW     .D1T1   A2,*A8            ; |375| 
           LDW     .D2T1   *+SP(4),A9        ; |379| 
           NOP             4
           EXTU    .S1     A9,16,24,A9       ; |379| 
           LDW     .D1T1   *+A10[A9],A9      ; |379| 
           NOP             4
           STW     .D1T1   A9,*A5++          ; |379| 
           LDW     .D2T1   *+SP(4),A9        ; |382| 
           NOP             4
           EXTU    .S1     A9,24,24,A9       ; |382| 
           LDW     .D1T1   *+A10[A9],A9      ; |382| 
           NOP             4
           STW     .D1T1   A9,*A3++          ; |382| 
           LDW     .D1T2   *A0,B4            ; |384| 
           NOP             4
           STW     .D2T2   B4,*+SP(8)        ; |384| 
           LDW     .D2T2   *+SP(8),B4        ; |387| 
           NOP             4
           EXTU    .S2     B4,24,24,B4       ; |387| 
           LDW     .D2T1   *+B7[B4],A9       ; |387| 
           NOP             4
           STW     .D1T1   A9,*A6++          ; |387| 
           LDW     .D2T2   *+SP(8),B4        ; |389| 
           NOP             4
           EXTU    .S2     B4,16,24,B4       ; |389| 
           LDW     .D2T1   *+B7[B4],A9       ; |389| 
           NOP             4
           STW     .D1T1   A9,*A7++          ; |389| 
           LDW     .D2T2   *+SP(12),B4       ; |397| 

           LDW     .D1T1   *++A8,A9          ; |397| 
||         LDW     .D2T2   *+SP(4),B8        ; |397| 

           NOP             4

           CMPEQ   .L2     B8,B6,B8          ; |397| 
||         CMPLTSP .S2X    A9,B4,B4          ; |397| 

           XOR     .L2     1,B8,B8           ; |397| 
           AND     .L2     B8,B4,B0          ; |397| 
   [ B0]   B       .S1     $C$L6             ; |397| 
	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722
works\Calculator.c",line 397,column 0,is_stmt
           NOP             5
           ; BRANCHCC OCCURS {$C$L6}         ; |397| 

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
> d.stuartnl,
> 
> The reason the time is quicker, even though there is more code, is because
the
> code produced to do:
> CH1.deloggedData[x]  
> includes quite a lot of math, calculation of an address in a array is slow
> compared to incrementing a pointer
> 
> R. Williams
> ---------- Original Message -----------
> From: "d.stuartnl" <d.stuartnl@...>
> To: c...@yahoogroups.com
> Sent: Fri, 17 Jul 2009 17:06:32 -0000
> Subject: [c6x] Re: Slow EMIF transfer
> 
> > Dear R.Williams,
> > 
> > I changed my code to your suggestion:
> > 
> > void Calculator_FetchData()
> > {
> > 	volatile float * pCH1;
> > 	volatile float * pCH2;
> > 	volatile float * pCH3;
> > 	volatile float * pCH4;
> > 	volatile float * pCH5;
> > 	volatile float * pCH6;
> > 
> > 	const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > 	const termValue = 0x84825131;
> > 
> > 	pCH1 = &CH1.deloggedData[0];
> > 	pCH2 = &CH2.deloggedData[0];
> > 	pCH3 = &CH3.deloggedData[0];
> > 	pCH4 = &CH4.deloggedData[0];
> > 	pCH5 = &CH5.deloggedData[0];
> > 	pCH6 = &CH6.deloggedData[0];
> > 
> > 	
> > 
> > 	tmpprocessTime = TIMER(1)->cnt; //just in here for measuring
performance...
> > 
> > 	while(*pCH1 < endCH1)
> > 	{
> > 		tmpRead1 = *read1;
> > 		if(tmpRead1 == termValue) break;
> > 		//CHANNEL 1
> > 		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > 		// CHANNEL 2
> > 		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > 		if(LRneeded == 1)
> > 		{
> > 			*pCH1 += *pCH2;
> > 			if(*pCH1 > 5000)
> > 			{
> > 				*pCH1 = 5000;
> > 			}
> > 		}
> > 		// CHANNEL 5
> > 		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > 
> > 		// CHANNEL 6
> > 		*pCH6 = LUT1[tmpRead1 & 0xFF];
> > 
> > 		tmpRead2 = *read2;
> > 		
> > 		// CHANNEL 3 this channel is always read for particle matching on 
> > this channel 		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	 		// CHANNEL 4
	
> > 	*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > 
> > 		pCH1++;
> > 		pCH2++;
> > 		pCH3++;
> > 		pCH4++;
> > 		pCH5++;
> > 		pCH6++;
> > 		x++;
> > 	}
> > 	if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> > 	{
> > 		processTime = 	TIMER(1)->cnt - tmpprocessTime;
> > 	}
> > }
> > 
> > On my testrig I'm offering particles with a fixed lenght of 985. My 
> > previous code could read 985 samples for 6 channels in 681us. Your 
> > suggestion cut that time down to 601us!!! My first reaction was WOW 
> > :P. I have a couple of questions though if you can forgive my 
> > ignorance. The big question is WHY? Because it looks like it's 
> > calculating more (6 pointers instead of 1 "x"). I still left
in the 
> > x++; because I need to know how many samples have been read.
> > 
> > With kind regards,
> > 
> > Dominic Stuart
> > 
> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > >
> > > d.stuartnl,
> > > 
> > > I notice that the code, during the first loop, checks for the
termination value
> > > then throws away the first read values (by reading from read1 and
read2 again).
> > > is that you wanted to do?
> > > 
> > > Execution could be made much faster, by eliminating the
calculations related to
> > > 'x' by using pointers to:
> > > CH1.deloggedData, 
> > > CH2.deloggedData, 
> > > CH3.deloggedData, 
> > > CH4.deloggedData, 
> > > CH5.deloggedData, 
> > > CH6.deloggedData.  
> > > Initialize the pointers before the loop and increment them at the
end of the
> loop.
> > > Also, eliminate 'x' and related calculation by precalculating the
end address
> > > for the loop as: 
> > > const endCH1 = &CH1.deloggedData[0x1000];
> > > const termValue = 0x84825131;
> > > 
> > > pCH1 = &CH1.deloggedData[0];
> > > pCH2 = &CH2.deloggedData[0];
> > > --- // rest of initialization
> > > while( pCH1 < endCH1 )
> > > {
> > > ---// processing
> > > pCH1++;
> > > pCh2++;
> > > ...// rest of incrementing
> > > } // end while()
> > > 
> > > to avoid processing the termination value from *read1
> > > and to exit when the termination value is read:
> > > The first code within the 'while' loop would be:
> > > tmpRead1 = *read1;
> > > if (tmpRead1 == termValue ) break;
> > > tmpRead2 = *read2;
> > > 
> > > R. Williams
> > > 
> > > 
> > > ---------- Original Message -----------
> > > From: "d.stuartnl" <d.stuartnl@>
> > > To: c...@yahoogroups.com
> > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > > 
> > > > R. Williams,
> > > <snip>
> > > > 
> > > > x and tmpRead1 are updated in the AddSample() routine.
Furthermore,
> > > >  I've been analyzing the compilers feedback and it's stating
that it 
> > > > cannot implement software pipelining because there's a
function call 
> > > > (AddSample()) in the loop. I've removed the AddSample()
function and 
> > > > put the code from the function directly into the loop (see
source),
> > > >  there's still some problems (Disqualified loop: Loop
carried 
> > > > dependency bound too large). But I'm working on it :) I've
also found 
> > > > out that pipelining is not being used in a lot of my loops
so I'm 
> > > > guessing if I adjust my C-code so that software pipelining
will be 
> > > > possible I will notice an increase in performance.
> > > > 
> > > > Source:
> > > > 
> > > > read1 = (int*) 0x90300004;
> > > > read2 = (int*) 0x90300008;
> > > > 
> > > > tmpRead1 = *read1;
> > > > tmpRead2 = *read2;
> > > > x = 0;
> > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > {
> > > >    tmpRead1 = *read1;
> > > >    tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > > 
> > > >    CH1.deloggedData[x] = LUT0[((tmpRead1 & 0xFF0000)
>> 16)];
> > > >    CH2.deloggedData[x] = LUT0[((tmpRead1 & 0xFF000000)
>> 24)];
> > > >    // FWS R+L Add
> > > >    if(LRneeded == 1)
> > > >    {
> > > >       CH1.deloggedData[x] += CH2.deloggedData[x];
> > > >       if(CH1.deloggedData[x] > 5000)
> > > >       {
> > > >          CH1.deloggedData[x] = 5000;
> > > >       }
> > > >    }
> > > >    CH3.deloggedData[x] = LUT0[((tmpRead2 & 0xFF))];
> > > >    binData[x] = (tmpRead2 & 0xFF);
> > > >    CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00)
>> 8)];
> > > >    CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00)
>> 8)];
> > > >    CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > >    x++;
> > > > }
> > > > 
> > > > With kind regards,
> > > > 
> > > > Dominic
> > > > 
> > > > > 
> > > > > However, your idea of just using the read operation,
since it is much longer
> > > > > than a write, is a good one.
> > > > > 
> > > > > R. Williams
> > > > >  
> > > > > 
> > > > > 
> > > > > ---------- Original Message -----------
> > > > > From: Jeff Brower <jbrower@>
> > > > > To: Dominic Stuart <d.stuartnl@>
> > > > > Cc: c...@yahoogroups.com
> > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > 
> > > > > > Dominic-
> > > > > > 
> > > > > > > I am indeed trying to avoid delay in
processing flow. The data needs
> to be
> > > > > > > decompressed asap. When that is done the DSP
performs calculations
> on the
> > > > > > > data and based on the outcome of those
calculations the DSP generates a
> > > > > > > trigger (GPIO). Your idea of a code loop got
me thinking... If a read
> > > > > > > always takes longer than a write, I don't
have to pull the Empty
> Flag and
> > > > > > > can just read the data through a loop like
so:
> > > > > > > 
> > > > > > > while(tmpRead1 != 0x84825131 & (x <=
0x1000))
> > > > > > > {
> > > > > > >    Calculator_AddSample();
> > > > > > > }
> > > > > > 
> > > > > > Ok, so what you're saying is that once you see a
"not empty" flag, 
> > > > > > then you know the agent on the other side of the
FIFO is writing a 
> > > > > > known block size, and will write it faster than
you can read, so your 
> > > > > > code just needs to read.
> > > > > > 
> > > > > > > I've tested this and it did improve the
performance but nothing
> shocking,
> > > > > > > it seems the decompressing via the LookUp
Table is creating the bottle
> > > > > > > neck. I've already split the two dimensional
LUT into 2 one dimensional
> > > > > > > array's. This also helped a bit.
> > > > > > 
> > > > > > One thing you might try is hand-optimized asm code
just for the read / 
> > > > > > look-up sequence, using techniques that Richard
was describing.  If 
> > > > > > you take advantage of the pipeline, you can
improve performance.  For 
> > > > > > example you can read sample N, then in the next 4
instructions process 
> > > > > > the lookup on N-1, waiting for N to become valid. 
It sounds to me 
> > > > > > like it wouldn't be that much code in your loop,
maybe a dozen or less 
> > > > > > asm instructions.
> > > > > > 
> > > > > > -Jeff
> > > > > > 
> > > > > > PS. Please post to the group, not to me.  Thanks.
> > > > > > 
> > > > > > > --- In c...@yahoogroups.com, Jeff Brower
<jbrower@> wrote:
> > > > > > > >
> > > > > > > > Dominic-
> > > > > > > >
> > > > > > > > > Thanks for the information, I think
I will refrain from using block
> > > > > > > > > transfers because I want to process
the data as the DSP receives it.
> > > > > > > > .
> > > > > > > > .
> > > > > > > > .
> > > > > > > >
> > > > > > > > > At the moment I am starting this
"prefetch" function when a burst
> > > > > > > > > starts and execute this function
every time there is data available
> > > > > > > > > in the FIFO's (polling the Empty
Flag). I'm prefeteching 27.6% of
> > > > > > > > > the data before the burst ends. All
variables are in IRAM.
> > > > > > > >
> > > > > > > > The typical reason for doing it that way
is to avoid delay
> (latency) in
> > > > > your signal
> > > > > > > > processing flow, relative to some output
(DAC, GPIO line, digital
> > > > > transmission,
> > > > > > > > etc).  Is that the case?  If not then a
block based method would be
> > > > > better, otherwise
> > > > > > > > you will waste a lot of time polling for
each element.  You don't
> have to
> > > > > implement
> > > > > > > > DMA as a first step to get that working,
you could use a code
> loop.  Then
> > > > > implement
> > > > > > > > DMA in order to further improve
performance.
> > > > > > > >
> > > > > > > > -Jeff
> > > > > > > >
> > > > > > > > > My function looks like this:
> > > > > > > > >
> > > > > > > > > void Calculator_AddSample()
> > > > > > > > > {
> > > > > > > > >    x++;
> > > > > > > > >
> > > > > > > > >    read1 = (int*) 0x90300004;
> > > > > > > > >    read2 = (int*) 0x90300008;
> > > > > > > > >
> > > > > > > > >    tmpRead1 = *read1;
> > > > > > > > >    tmpRead2 = *read2;
> > > > > > > > >
> > > > > > > > >    // CHANNEL 1
> > > > > > > > >    CH1.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > >    // CHANNEL 2
> > > > > > > > >    CH2.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF000000) >> 24)];
> > > > > > > > >    // FWS R+L Add
> > > > > > > > >    if(LRneeded == 1)
> > > > > > > > >    {
> > > > > > > > >       CH1.deloggedData[x] +=   
CH2.deloggedData[x];
> > > > > > > > >       if(CH1.deloggedData[x] >
5000)
> > > > > > > > >       {
> > > > > > > > >          CH1.deloggedData[x] =
5000;
> > > > > > > > >       }
> > > > > > > > >    }
> > > > > > > > >    // CHANNEL 3 this channel is
always read for particle matching on
> > > > > this channel
> > > > > > > > >    binData[x] = (tmpRead2 &
0xFF);
> > > > > > > > >    CH3.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > >
> > > > > > > > >    // CHANNEL 4
> > > > > > > > >    CH4.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > >    // CHANNEL 5
> > > > > > > > >    CH5.deloggedData[x] =
LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > >    // CHANNEL 6
> > > > > > > > >    CH6.deloggedData[x] =
LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > }
> > > > > > > > > This function executes 2 reads from
2 different FIFO's and then
> > > > > seperates the different datachannels and decompresses
the value's with a
> LookUp
> > > > > Table.
> > > > > > > > >
> > > > > > > > > I am trying to streamline this
function so it can keep up with the
> > > > > incoming data. The data is written to the FIFO's with
4MHz. The data
> consists of
> > > > > small burst packets ranging from 3 to 4096 bytes per
channel.
> > > > > > > > >
> > > > > > > > > At the moment I am starting this
"prefetch" function when a
> burst starts
> > > > > and execute this function every time there is data
available in the FIFO's
> > > > > (polling the Empty Flag). I'm prefeteching 27.6% of the
data before the
> burst
> > > > > ends. All variables are in IRAM.
> > > > > > > > >
> > > > > > > > > I think I made an error in
suspecting the EMIF transfer speed
> and I now
> > > > > suspect that there may be some overhead in the polling
scheme I use for
> calling
> > > > > this function that results in the slow transfer speed.
I will look into
> this. I
> > > > > would like to thank everyone for there input.
> > > > > > > > >
> > > > > > > > > With kind regards,
> > > > > > > > >
> > > > > > > > > Dominic
> > > > > > > > >
> > > > > > > > > --- In c...@yahoogroups.com, Adolf
Klemenz <adolf.klemenz@> wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Dominic,
> > > > > > > > > >
> > > > > > > > > > At 16:45 13.07.2009 +0000,
d.stuartnl wrote:
> > > > > > > > > > >as I understand DMA, I
would need to work in "blocks" of data but
> > > that
> > > > > > > > > > >would be very tricky in my
application since I do not know how
> > > big the
> > > > > > > > > > >datastream is gonna be. Or
is it possible to use DMA for
> single byte
> > > > > transfers?
> > > > > > > > > >
> > > > > > > > > > using DMA makes sense for
block transfers only. Typical Fifo
> > > applications
> > > > > > > > > > will use the Fifo's half-full
flag (or a similar signal) to
> > > trigger a DMA
> > > > > > > > > > block read.
> > > > > > > > > > You may use
element-synchronized DMA (each trigger transfers only
> > > one data
> > > > > > > > > > word), but there will be no
speed improvement: It takes about
> > > 100ns from
> > > > > > > > > > the EDMA sync event to the
actual data transfer on a C6713.
> > > > > > > > > >
> > > > > > > > > > Attached is a scope screenshot
generated by this test program
> > > > > > > > > >
> > > > > > > > > > // compiled with -o2 and
without debug info:
> > > > > > > > > >
> > > > > > > > > > volatile int buffer; // must
be volatile to prevent
> > > > > > > > > >                       //
optimizer from code removal
> > > > > > > > > > for (;;)
> > > > > > > > > > {
> > > > > > > > > >      buffer = *(volatile
int*)0x90300000;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > The screenshot shows chip
select and read signal with the expected
> > > timings
> > > > > > > > > > (20ns strobe width). The gap
between sucessive reads is caused by
> > > the DSP
> > > > > > > > > > architecture. Here it is 200ns
because a 225MHz DSP was used,
> > > which should
> > > > > > > > > > translate to 150ns on a 300MHz
device.
> > > > > > > > > >
> > > > > > > > > > If this isn't fast enough, you
must use block transfers.
> > > > > > > > > >
> > > > > > > > > >    Best Regards,
> > > > > > > > > >    Adolf Klemenz, D.SignT
> > > > > > > >
> > > > > ------- End of Original Message -------
> > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________

______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 22 21:10:56 2009

d.stuartnl,

under the assumption that the included code is actually the code being
compiled...
The line: x = (int) (pCH1 - &CH1.deloggedData[0]);
will give the number of addresses rather than the number of entries,
Therefore it should be: 
x = ((int) (pCH1 - &CH1.deloggedData[0]) / sizeof(float));
The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
is using a local variable tmpRead1 before it is set, 
so garbage is being used in the comparison.
The lines:
volatile int tmpStore1; 
volatile int tmpStore2; 
describe two local variables that are not being used, so should be deleted.
The lines:
>    volatile int * restrict p1, 
>    volatile int * restrict p2) 
have the parameter names p1 and p2.
This yields no useful information about the target of the pointers,
so should be renamed to something meaningful.
The loading of the local variable tmpRead2 
tmpRead2 = *p2;
is being immediately used in the next line.
it takes some 4 cycles for the load to complete, so the 
loading should be several cycles/lines earlier in the source.
The line: if(LRneeded == 1)
is referencing a global variable.  
This makes for maintenance problems and pipelining problems.
It would be better passed in as one of the parameters (and have a
local/parameter name).
We have previously discussed the use of the <tab> character and the
problems it
produces.
I replaced the <tab> characters with spaces in the copied code.
The term 'volatile' is not needed in any of the variables as the code does not
have repeating lines and the variables will not (unexpectedly) change during
the
execution of the code.
Removing the 'volatile' will speed up the code because the values, once into a
CPU register, will not have to be re-read at each usage of the value.
The line: const termValue = 0x84825131
is missing the 'type' for the constant.
I would suggest adding 'int' after the 'const'.
The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
is performing a bit-wise 'and' between two logical conditions.
It should be: while( (*pCH1 < endCH1) && (tmpRead1 != termValue) )
so it performs a logical 'and' between two logical conditions.
The use of the global variable 'x' is a maintenance problem.
I would suggest a modification to have a local variable 'x' and return the
value
of 'x' rather than returning 'void'
let the caller assign the returned value to the global variable 'x'.
In general, for a loop to be pipelined, the loop must be relatively simple.
Therefore, I would suggest making this two loops,
one for tmpRead1 reading and calculations
one for tmpRead2 reading and calculations
However, for parallel operation, the tmpRead1 and tmpRead2 operations could be
(somewhat) merged in a single loop.
To help absorb the needed CPU cycles after a read of p1 and p2, I would put the
first read(s) before the 'while' statement(s) and read again at the end of the
loop, just before the incrementing of the pCHx pointers.
R. Williams

---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Wed, 22 Jul 2009 13:43:32 -0000
Subject: [c6x] Re: Slow EMIF transfer

> Hi all,
> 
> I'm trying to further optimize my code but for some reason I cannot 
> get pipelining to work. I've checked several documents (SPRU425, 
> Optimizing C Compiler Tutorial, SPRA666, Hand Tuning Loops and Control 
> Code). These documents primarily focus on improving pipelines but in 
> my .asm file it keeps stating "Unsafe schedule for irregular
loop". It 
> produces the folowing Software Pipeline Information:
> 
<snip>

> -------* My code is as follows:
> 
> void Calculator_FetchData(
>    volatile int * restrict p1, 
>    volatile int * restrict p2) 
{ 
> volatile int tmpRead1; 	
> volatile int tmpRead2; 
> volatile int tmpStore1; 
> volatile int tmpStore2; 
> volatile float * restrict pCH1; 	
> volatile float * restrict pCH2;
> volatile float * restrict pCH3; 
> volatile float * restrict pCH4; 
> volatile float * restrict pCH5; 
> volatile float * restrict pCH6;
> 
> const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> const termValue = 0x84825131;
> 
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
> 
>     while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
>     {
>         tmpRead1 = *p1;
> 
> 		//CHANNEL 1if(LRneeded == 1)
>         *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> 		// CHANNEL 2
>         *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
>         if(LRneeded == 1)
>         {
>             *pCH1 += *pCH2;
>
>             if(*pCH1 > 5000)
>             {
>                 *pCH1 = 5000;
>             }
>         }
> 		//CHANNEL 5
>         *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> 
> 		// CHANNEL 6
>         *pCH6 = LUT1[tmpRead1 & 0xFF];
> 		
>         tmpRead2 = *p2;
> 
> // CHANNEL 3 this channel is always read for particle matching on 
> this channel 		
>         *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>	 		// CHANNEL 4 	
>         *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> 
>         pCH1++;
>         pCH2++;
>         pCH3++;
>         pCH4++;
>         pCH5++;
>         pCH6++;
>     }
>
>     x = (int) (pCH1 - &CH1.deloggedData[0]);
> }
> 
> Is there a way I can change my C-code so the DSP can Pipeline? I think 
> it should be possible to have at least 2 iterations in parallel:
> 
> 1st       2nd
> read1
> delog1    
> read2     read1
> delog2    delog1
> etc..
> 
> I've already used the "restrict" keyword on the pointers I use
since 
> these pointers do not overlap. I'm using the folowing compiler options:
> 
> -k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug"
-d"CHIP_6713" 
> -d"DEBUG" -mt -mw -mh -mr1 -mv6710 --mem_model:data=far
--consultant
> 
> Any advice or pointers to documents regarding how to enable pipelining 
> other then the ones mentioned above would be helpfull.
> 
> With kind regards,
> 
> Dominic Stuart
> 
> PS: I don't know if it's usefull but the folowing asm code is being
produced:
> 
> C$L6:    
> $C$DW$L$_Calculator_FetchData$2$B:
> 	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722 
> works\Calculator.c",line 363,column 0,is_stmt
;**	---------------------
> --g3: ;** 364	-----------------------    tmpRead1 = *p1; ;** 367	------
> -----------------    *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u, 24u)]; 
> ;** 369	-----------------------    *(++pCH2) = K$32[((unsigned)
> tmpRead1>>22>>2)]; ;** 370	-----------------------    if (
LRneeded != 
> 1 ) goto g6; ;** 372	-----------------------    *pCH1 = *pCH1+*pCH2; 
> ;** 373	-----------------------    if ( *pCH1 <= K$36 ) goto g6; ;** 
> 375	-----------------------    *pCH1 = K$36; ;**	----------------------
> -g6: ;** 379	-----------------------    *pCH5++ = K$39[_extu((unsigned)
> tmpRead1, 16u, 24u)]; ;** 382	-----------------------    *pCH6++ = 
> K$39[_extu((unsigned)tmpRead1, 24u, 24u)]; ;** 384	--------------------
> ---    tmpRead2 = *p2; ;** 387	-----------------------    *pCH3++ = 
> K$32[_extu((unsigned)tmpRead2, 24u, 24u)]; ;** 389	--------------------
> ---    *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u, 24u)]; ;** 397	---
> --------------------    if ( (*(++pCH1) < endCH1)&(tmpRead1 != K$27)
) 
> goto g3;           LDW     .D1T2   *A4,B4            ; |364|           
>  ZERO    .L1     A1           NOP             3           STW    
>  .D2T2   B4,*+SP(4)        ; |364|            LDW     .D2T2   *+SP(4)
> ,B4        ; |367|            NOP             4           EXTU    .S2  
>    B4,8,24,B4        ; |367|            LDW     .D2T1   *+B7[B4],A9    
>    ; |367|            NOP             4           STW     .D1T1   A9,
> *A8            ; |367|            LDW     .D2T2   *+SP(4),B4        ; 
> |369|            NOP             4           SHRU    .S2     B4,24,B4  
>         ; |369|            LDW     .D2T2   *+B7[B4],B4       ; |369|   
>          NOP             4           STW     .D2T2   B4,*++B5          
> ; |369|            LDHU    .D1T2   *A11,B4           ; |370|           
>  NOP             4           CMPEQ   .L2     B4,1,B0           ; |370|
> 
>    [ B0]   LDW     .D1T1   *A8,A9            ; |372| 
> || [ B0]   LDW     .D2T2   *B5,B4            ; |372|
> 
>            NOP             4
>    [ B0]   ADDSP   .L1X    B4,A9,A9          ; |372| 
>            NOP             3
>    [ B0]   STW     .D1T1   A9,*A8            ; |372| 
>    [ B0]   LDW     .D1T1   *A8,A9            ; |373| 
>            NOP             4
>    [ B0]   CMPGTSP .S1     A9,A2,A9          ; |373| 
>    [ B0]   MV      .L1     A9,A1
>    [ A1]   STW     .D1T1   A2,*A8            ; |375| 
>            LDW     .D2T1   *+SP(4),A9        ; |379| 
>            NOP             4
>            EXTU    .S1     A9,16,24,A9       ; |379| 
>            LDW     .D1T1   *+A10[A9],A9      ; |379| 
>            NOP             4
>            STW     .D1T1   A9,*A5++          ; |379| 
>            LDW     .D2T1   *+SP(4),A9        ; |382| 
>            NOP             4
>            EXTU    .S1     A9,24,24,A9       ; |382| 
>            LDW     .D1T1   *+A10[A9],A9      ; |382| 
>            NOP             4
>            STW     .D1T1   A9,*A3++          ; |382| 
>            LDW     .D1T2   *A0,B4            ; |384| 
>            NOP             4
>            STW     .D2T2   B4,*+SP(8)        ; |384| 
>            LDW     .D2T2   *+SP(8),B4        ; |387| 
>            NOP             4
>            EXTU    .S2     B4,24,24,B4       ; |387| 
>            LDW     .D2T1   *+B7[B4],A9       ; |387| 
>            NOP             4
>            STW     .D1T1   A9,*A6++          ; |387| 
>            LDW     .D2T2   *+SP(8),B4        ; |389| 
>            NOP             4
>            EXTU    .S2     B4,16,24,B4       ; |389| 
>            LDW     .D2T1   *+B7[B4],A9       ; |389| 
>            NOP             4
>            STW     .D1T1   A9,*A7++          ; |389| 
>            LDW     .D2T2   *+SP(12),B4       ; |397|
> 
>            LDW     .D1T1   *++A8,A9          ; |397| 
> ||         LDW     .D2T2   *+SP(4),B8        ; |397|
> 
>            NOP             4
> 
>            CMPEQ   .L2     B8,B6,B8          ; |397| 
> ||         CMPLTSP .S2X    A9,B4,B4          ; |397|
> 
>            XOR     .L2     1,B8,B8           ; |397| 
>            AND     .L2     B8,B4,B0          ; |397| 
>    [ B0]   B       .S1     $C$L6             ; |397| 
> 	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722 
> works\Calculator.c",line 397,column 0,is_stmt           NOP           

>  5           ; BRANCHCC OCCURS {$C$L6}         ; |397|
> 
> --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@...> wrote:
> >
> > 
> > d.stuartnl,
> > 
> > The reason the time is quicker, even though there is more code, is
because the
> > code produced to do:
> > CH1.deloggedData[x]  
> > includes quite a lot of math, calculation of an address in a array is
slow
> > compared to incrementing a pointer
> > 
> > R. Williams
> > 
> > 
> > ---------- Original Message -----------
> > From: "d.stuartnl" <d.stuartnl@...>
> > To: c...@yahoogroups.com
> > Sent: Fri, 17 Jul 2009 17:06:32 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> > 
> > > Dear R.Williams,
> > > 
> > > I changed my code to your suggestion:
> > > 
> > > void Calculator_FetchData()
> > > {
> > > 	volatile float * pCH1;
> > > 	volatile float * pCH2;
> > > 	volatile float * pCH3;
> > > 	volatile float * pCH4;
> > > 	volatile float * pCH5;
> > > 	volatile float * pCH6;
> > > 
> > > 	const volatile float endCH1 = (const)
&CH1.deloggedData[0x1000];
> > > 	const termValue = 0x84825131;
> > > 
> > > 	pCH1 = &CH1.deloggedData[0];
> > > 	pCH2 = &CH2.deloggedData[0];
> > > 	pCH3 = &CH3.deloggedData[0];
> > > 	pCH4 = &CH4.deloggedData[0];
> > > 	pCH5 = &CH5.deloggedData[0];
> > > 	pCH6 = &CH6.deloggedData[0];
> > > 
> > > 	
> > > 
> > > 	tmpprocessTime = TIMER(1)->cnt; //just in here for measuring
performance...
> > > 
> > > 	while(*pCH1 < endCH1)
> > > 	{
> > > 		tmpRead1 = *read1;
> > > 		if(tmpRead1 == termValue) break;
> > > 		//CHANNEL 1
> > > 		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > 		// CHANNEL 2
> > > 		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > 		if(LRneeded == 1)
> > > 		{
> > > 			*pCH1 += *pCH2;
> > > 			if(*pCH1 > 5000)
> > > 			{
> > > 				*pCH1 = 5000;
> > > 			}
> > > 		}
> > > 		// CHANNEL 5
> > > 		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > 
> > > 		// CHANNEL 6
> > > 		*pCH6 = LUT1[tmpRead1 & 0xFF];
> > > 
> > > 		tmpRead2 = *read2;
> > > 		
> > > 		// CHANNEL 3 this channel is always read for particle matching
on 
> > > this channel 		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	 		//
CHANNEL 4 	
> > > 	*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > 
> > > 		pCH1++;
> > > 		pCH2++;
> > > 		pCH3++;
> > > 		pCH4++;
> > > 		pCH5++;
> > > 		pCH6++;
> > > 		x++;
> > > 	}
> > > 	if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect overflow
> > > 	{
> > > 		processTime = 	TIMER(1)->cnt - tmpprocessTime;
> > > 	}
> > > }
> > > 
> > > On my testrig I'm offering particles with a fixed lenght of 985.
My 
> > > previous code could read 985 samples for 6 channels in 681us.
Your 
> > > suggestion cut that time down to 601us!!! My first reaction was
WOW 
> > > :P. I have a couple of questions though if you can forgive my 
> > > ignorance. The big question is WHY? Because it looks like it's 
> > > calculating more (6 pointers instead of 1 "x"). I still
left in the 
> > > x++; because I need to know how many samples have been read.
> > > 
> > > With kind regards,
> > > 
> > > Dominic Stuart
> > > 
> > > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > > >
> > > > d.stuartnl,
> > > > 
> > > > I notice that the code, during the first loop, checks for
the
termination value
> > > > then throws away the first read values (by reading from
read1 and read2
again).
> > > > is that you wanted to do?
> > > > 
> > > > Execution could be made much faster, by eliminating the
calculations
related to
> > > > 'x' by using pointers to:
> > > > CH1.deloggedData, 
> > > > CH2.deloggedData, 
> > > > CH3.deloggedData, 
> > > > CH4.deloggedData, 
> > > > CH5.deloggedData, 
> > > > CH6.deloggedData.  
> > > > Initialize the pointers before the loop and increment them
at the end of the
> > loop.
> > > > Also, eliminate 'x' and related calculation by
precalculating the end
address
> > > > for the loop as: 
> > > > const endCH1 = &CH1.deloggedData[0x1000];
> > > > const termValue = 0x84825131;
> > > > 
> > > > pCH1 = &CH1.deloggedData[0];
> > > > pCH2 = &CH2.deloggedData[0];
> > > > --- // rest of initialization
> > > > while( pCH1 < endCH1 )
> > > > {
> > > > ---// processing
> > > > pCH1++;
> > > > pCh2++;
> > > > ...// rest of incrementing
> > > > } // end while()
> > > > 
> > > > to avoid processing the termination value from *read1
> > > > and to exit when the termination value is read:
> > > > The first code within the 'while' loop would be:
> > > > tmpRead1 = *read1;
> > > > if (tmpRead1 == termValue ) break;
> > > > tmpRead2 = *read2;
> > > > 
> > > > R. Williams
> > > > 
> > > > 
> > > > ---------- Original Message -----------
> > > > From: "d.stuartnl" <d.stuartnl@>
> > > > To: c...@yahoogroups.com
> > > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > 
> > > > > R. Williams,
> > > > <snip>
> > > > > 
> > > > > x and tmpRead1 are updated in the AddSample() routine.
Furthermore,
> > > > >  I've been analyzing the compilers feedback and it's
stating that it 
> > > > > cannot implement software pipelining because there's a
function call 
> > > > > (AddSample()) in the loop. I've removed the AddSample()
function and 
> > > > > put the code from the function directly into the loop
(see source),
> > > > >  there's still some problems (Disqualified loop: Loop
carried 
> > > > > dependency bound too large). But I'm working on it :)
I've also found 
> > > > > out that pipelining is not being used in a lot of my
loops so I'm 
> > > > > guessing if I adjust my C-code so that software
pipelining will be 
> > > > > possible I will notice an increase in performance.
> > > > > 
> > > > > Source:
> > > > > 
> > > > > read1 = (int*) 0x90300004;
> > > > > read2 = (int*) 0x90300008;
> > > > > 
> > > > > tmpRead1 = *read1;
> > > > > tmpRead2 = *read2;
> > > > > x = 0;
> > > > > while(tmpRead1 != 0x84825131 & (x <= 0x1000))
> > > > > {
> > > > >    tmpRead1 = *read1;
> > > > >    tmpRead2 = *read2;YouTube - Dilbert - The Knack
> > > > > 
> > > > >    CH1.deloggedData[x] = LUT0[((tmpRead1 &
0xFF0000) >> 16)];
> > > > >    CH2.deloggedData[x] = LUT0[((tmpRead1 &
0xFF000000) >> 24)];
> > > > >    // FWS R+L Add
> > > > >    if(LRneeded == 1)
> > > > >    {
> > > > >       CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > >       if(CH1.deloggedData[x] > 5000)
> > > > >       {
> > > > >          CH1.deloggedData[x] = 5000;
> > > > >       }
> > > > >    }
> > > > >    CH3.deloggedData[x] = LUT0[((tmpRead2 &
0xFF))];
> > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > >    CH4.deloggedData[x] = LUT0[((tmpRead2 & 0xFF00)
>> 8)];
> > > > >    CH5.deloggedData[x] = LUT1[((tmpRead1 & 0xFF00)
>> 8)];
> > > > >    CH6.deloggedData[x] = LUT1[tmpRead1 & 0xFF];
> > > > >    x++;
> > > > > }
> > > > > 
> > > > > With kind regards,
> > > > > 
> > > > > Dominic
> > > > > 
> > > > > > 
> > > > > > However, your idea of just using the read
operation, since it is
much longer
> > > > > > than a write, is a good one.
> > > > > > 
> > > > > > R. Williams
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > ---------- Original Message -----------
> > > > > > From: Jeff Brower <jbrower@>
> > > > > > To: Dominic Stuart <d.stuartnl@>
> > > > > > Cc: c...@yahoogroups.com
> > > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > > 
> > > > > > > Dominic-
> > > > > > > 
> > > > > > > > I am indeed trying to avoid delay in
processing flow. The data needs
> > to be
> > > > > > > > decompressed asap. When that is done the
DSP performs calculations
> > on the
> > > > > > > > data and based on the outcome of those
calculations the DSP
generates a
> > > > > > > > trigger (GPIO). Your idea of a code loop
got me thinking... If a
read
> > > > > > > > always takes longer than a write, I
don't have to pull the Empty
> > Flag and
> > > > > > > > can just read the data through a loop
like so:
> > > > > > > > 
> > > > > > > > while(tmpRead1 != 0x84825131 & (x
<= 0x1000))
> > > > > > > > {
> > > > > > > >    Calculator_AddSample();
> > > > > > > > }
> > > > > > > 
> > > > > > > Ok, so what you're saying is that once you
see a "not empty" flag, 
> > > > > > > then you know the agent on the other side of
the FIFO is writing a 
> > > > > > > known block size, and will write it faster
than you can read, so your 
> > > > > > > code just needs to read.
> > > > > > > 
> > > > > > > > I've tested this and it did improve the
performance but nothing
> > shocking,
> > > > > > > > it seems the decompressing via the
LookUp Table is creating the
bottle
> > > > > > > > neck. I've already split the two
dimensional LUT into 2 one
dimensional
> > > > > > > > array's. This also helped a bit.
> > > > > > > 
> > > > > > > One thing you might try is hand-optimized asm
code just for the
read / 
> > > > > > > look-up sequence, using techniques that
Richard was describing.  If 
> > > > > > > you take advantage of the pipeline, you can
improve performance.  For 
> > > > > > > example you can read sample N, then in the
next 4 instructions
process 
> > > > > > > the lookup on N-1, waiting for N to become
valid.  It sounds to me 
> > > > > > > like it wouldn't be that much code in your
loop, maybe a dozen or
less 
> > > > > > > asm instructions.
> > > > > > > 
> > > > > > > -Jeff
> > > > > > > 
> > > > > > > PS. Please post to the group, not to me. 
Thanks.
> > > > > > > 
> > > > > > > > --- In c...@yahoogroups.com, Jeff Brower
<jbrower@> wrote:
> > > > > > > > >
> > > > > > > > > Dominic-
> > > > > > > > >
> > > > > > > > > > Thanks for the information, I
think I will refrain from
using block
> > > > > > > > > > transfers because I want to
process the data as the DSP
receives it.
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > >
> > > > > > > > > > At the moment I am starting
this "prefetch" function when a
burst
> > > > > > > > > > starts and execute this
function every time there is data
available
> > > > > > > > > > in the FIFO's (polling the
Empty Flag). I'm prefeteching
27.6% of
> > > > > > > > > > the data before the burst
ends. All variables are in IRAM.
> > > > > > > > >
> > > > > > > > > The typical reason for doing it
that way is to avoid delay
> > (latency) in
> > > > > > your signal
> > > > > > > > > processing flow, relative to some
output (DAC, GPIO line, digital
> > > > > > transmission,
> > > > > > > > > etc).  Is that the case?  If not
then a block based method
would be
> > > > > > better, otherwise
> > > > > > > > > you will waste a lot of time
polling for each element.  You don't
> > have to
> > > > > > implement
> > > > > > > > > DMA as a first step to get that
working, you could use a code
> > loop.  Then
> > > > > > implement
> > > > > > > > > DMA in order to further improve
performance.
> > > > > > > > >
> > > > > > > > > -Jeff
> > > > > > > > >
> > > > > > > > > > My function looks like this:
> > > > > > > > > >
> > > > > > > > > > void Calculator_AddSample()
> > > > > > > > > > {
> > > > > > > > > >    x++;
> > > > > > > > > >
> > > > > > > > > >    read1 = (int*) 0x90300004;
> > > > > > > > > >    read2 = (int*) 0x90300008;
> > > > > > > > > >
> > > > > > > > > >    tmpRead1 = *read1;
> > > > > > > > > >    tmpRead2 = *read2;
> > > > > > > > > >
> > > > > > > > > >    // CHANNEL 1
> > > > > > > > > >    CH1.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > > >    // CHANNEL 2
> > > > > > > > > >    CH2.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF000000) >>
24)];
> > > > > > > > > >    // FWS R+L Add
> > > > > > > > > >    if(LRneeded == 1)
> > > > > > > > > >    {
> > > > > > > > > >       CH1.deloggedData[x] +=  
 CH2.deloggedData[x];
> > > > > > > > > >       if(CH1.deloggedData[x]
> 5000)
> > > > > > > > > >       {
> > > > > > > > > >          CH1.deloggedData[x] =
5000;
> > > > > > > > > >       }
> > > > > > > > > >    }
> > > > > > > > > >    // CHANNEL 3 this channel
is always read for particle
matching on
> > > > > > this channel
> > > > > > > > > >    binData[x] = (tmpRead2
& 0xFF);
> > > > > > > > > >    CH3.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > > >
> > > > > > > > > >    // CHANNEL 4
> > > > > > > > > >    CH4.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > > >    // CHANNEL 5
> > > > > > > > > >    CH5.deloggedData[x] =
LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > > >    // CHANNEL 6
> > > > > > > > > >    CH6.deloggedData[x] =
LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > > }
> > > > > > > > > > This function executes 2 reads
from 2 different FIFO's and then
> > > > > > seperates the different datachannels and
decompresses the value's with a
> > LookUp
> > > > > > Table.
> > > > > > > > > >
> > > > > > > > > > I am trying to streamline this
function so it can keep up
with the
> > > > > > incoming data. The data is written to the FIFO's
with 4MHz. The data
> > consists of
> > > > > > small burst packets ranging from 3 to 4096 bytes
per channel.
> > > > > > > > > >
> > > > > > > > > > At the moment I am starting
this "prefetch" function when a
> > burst starts
> > > > > > and execute this function every time there is data
available in the
FIFO's
> > > > > > (polling the Empty Flag). I'm prefeteching 27.6%
of the data before the
> > burst
> > > > > > ends. All variables are in IRAM.
> > > > > > > > > >
> > > > > > > > > > I think I made an error in
suspecting the EMIF transfer speed
> > and I now
> > > > > > suspect that there may be some overhead in the
polling scheme I use for
> > calling
> > > > > > this function that results in the slow transfer
speed. I will look into
> > this. I
> > > > > > would like to thank everyone for there input.
> > > > > > > > > >
> > > > > > > > > > With kind regards,
> > > > > > > > > >
> > > > > > > > > > Dominic
> > > > > > > > > >
> > > > > > > > > > --- In c...@yahoogroups.com,
Adolf Klemenz <adolf.klemenz@>
wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Dominic,
> > > > > > > > > > >
> > > > > > > > > > > At 16:45 13.07.2009
+0000, d.stuartnl wrote:
> > > > > > > > > > > >as I understand DMA,
I would need to work in "blocks" of
data but
> > > > that
> > > > > > > > > > > >would be very tricky
in my application since I do not
know how
> > > > big the
> > > > > > > > > > > >datastream is gonna
be. Or is it possible to use DMA for
> > single byte
> > > > > > transfers?
> > > > > > > > > > >
> > > > > > > > > > > using DMA makes sense for
block transfers only. Typical Fifo
> > > > applications
> > > > > > > > > > > will use the Fifo's
half-full flag (or a similar signal) to
> > > > trigger a DMA
> > > > > > > > > > > block read.
> > > > > > > > > > > You may use
element-synchronized DMA (each trigger
transfers only
> > > > one data
> > > > > > > > > > > word), but there will be
no speed improvement: It takes about
> > > > 100ns from
> > > > > > > > > > > the EDMA sync event to
the actual data transfer on a C6713.
> > > > > > > > > > >
> > > > > > > > > > > Attached is a scope
screenshot generated by this test program
> > > > > > > > > > >
> > > > > > > > > > > // compiled with -o2 and
without debug info:
> > > > > > > > > > >
> > > > > > > > > > > volatile int buffer; //
must be volatile to prevent
> > > > > > > > > > >                       //
optimizer from code removal
> > > > > > > > > > > for (;;)
> > > > > > > > > > > {
> > > > > > > > > > >      buffer = *(volatile
int*)0x90300000;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > The screenshot shows chip
select and read signal with the
expected
> > > > timings
> > > > > > > > > > > (20ns strobe width). The
gap between sucessive reads is
caused by
> > > > the DSP
> > > > > > > > > > > architecture. Here it is
200ns because a 225MHz DSP was used,
> > > > which should
> > > > > > > > > > > translate to 150ns on a
300MHz device.
> > > > > > > > > > >
> > > > > > > > > > > If this isn't fast
enough, you must use block transfers.
> > > > > > > > > > >
> > > > > > > > > > >    Best Regards,
> > > > > > > > > > >    Adolf Klemenz,
D.SignT
> > > > > > > > >
> > > > > > ------- End of Original Message -------
> > > > > >
> > > > ------- End of Original Message -------
> > > >
> > ------- End of Original Message -------
> >
------- End of Original Message -------

_____________________________________

______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 24 9:04:08 2009

R.Williams,

SUCCESS! Looptime has almost halved! Software pipelining is working now thanks
to your tips:

--- In c...@yahoogroups.com, "Richard Williams" <rkwill@...>
wrote:
>
> d.stuartnl,
> 
> under the assumption that the included code is actually the code being
compiled...
> The line: x = (int) (pCH1 - &CH1.deloggedData[0]);
> will give the number of addresses rather than the number of entries,
> Therefore it should be: 
> x = ((int) (pCH1 - &CH1.deloggedData[0]) / sizeof(float));
> 

For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
is working fine as it is. Dont know why though.

> 
> The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> is using a local variable tmpRead1 before it is set, 
> so garbage is being used in the comparison.
I've now added a tmpRead1 = 0; when i init the variable every time the function
is called.
> The lines:
> volatile int tmpStore1; 
> volatile int tmpStore2; 
> describe two local variables that are not being used, so should be
deleted.
>
check 
> 
> The lines:
> >    volatile int * restrict p1, 
> >    volatile int * restrict p2) 
> have the parameter names p1 and p2.
> This yields no useful information about the target of the pointers,
> so should be renamed to something meaningful.
> 
check, renamed them.
> 
> The loading of the local variable tmpRead2 
> tmpRead2 = *p2;
> is being immediately used in the next line.
> it takes some 4 cycles for the load to complete, so the 
> loading should be several cycles/lines earlier in the source.
> 
check, moved the read operations to the top of my while loop.
> 
> The line: if(LRneeded == 1)
> is referencing a global variable.  
> This makes for maintenance problems and pipelining problems.
> It would be better passed in as one of the parameters (and have a
> local/parameter name).
> 
check
> 
> We have previously discussed the use of the <tab> character and the
problems it
> produces.
> I replaced the <tab> characters with spaces in the copied code.
> 
sorry, will do that from now on...
> 
> The term 'volatile' is not needed in any of the variables as the code does
not
> have repeating lines and the variables will not (unexpectedly) change
during the
> execution of the code.
> Removing the 'volatile' will speed up the code because the values, once
into a
> CPU register, will not have to be re-read at each usage of the value.
>
removed volatile keyword. 
> 
> The line: const termValue = 0x84825131
> is missing the 'type' for the constant.
> I would suggest adding 'int' after the 'const'.
> 
check
> 
> The line: while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> is performing a bit-wise 'and' between two logical conditions.
> It should be: while( (*pCH1 < endCH1) && (tmpRead1 != termValue)
)
> so it performs a logical 'and' between two logical conditions.
> 
check
> 
> The use of the global variable 'x' is a maintenance problem.
> I would suggest a modification to have a local variable 'x' and return the
value
> of 'x' rather than returning 'void'
> let the caller assign the returned value to the global variable 'x'.
> 
check
> 
> In general, for a loop to be pipelined, the loop must be relatively
simple.
> Therefore, I would suggest making this two loops,
> one for tmpRead1 reading and calculations
> one for tmpRead2 reading and calculations
> However, for parallel operation, the tmpRead1 and tmpRead2 operations could
be
> (somewhat) merged in a single loop.
> 
still have them in a single loop and it's pipelining. Do you think it's worth
considering splitting it into two loops and check if there's (an even better)
speed increase?
> 
> To help absorb the needed CPU cycles after a read of p1 and p2, I would put
the
> first read(s) before the 'while' statement(s) and read again at the end of
the
> loop, just before the incrementing of the pCHx pointers.
> 
for some reason when i move the read operations like you suggest the software
pipelining is not possible (Cannot find schedule).

My new and improved function:

unsigned int Calculator_FetchData(volatile int * restrict pFifo12, volatile int
* restrict pFifo3, Bool curvature)
{
   unsigned int tmpRead1 = 0;
   unsigned int tmpRead2 = 0;
   unsigned int sampleCount;
   float * restrict pCH1;
   float * restrict pCH2;
   float * restrict pCH3;
   char * restrict pBinData3;
   float * restrict pCH4;
   float * restrict pCH5;
   float * restrict pCH6;
   
   const float * endCH1 = &CH1.deloggedData[0x1000];
   const int termValue = 0x84825131;

   pCH1 = &CH1.deloggedData[0];
   pCH2 = &CH2.deloggedData[0];
   pCH3 = &CH3.deloggedData[0];
   pBinData3 = &binData3[0];
   pCH4 = &CH4.deloggedData[0];
   pCH5 = &CH5.deloggedData[0];
   pCH6 = &CH6.deloggedData[0];
   
   while((pCH1 < endCH1) && (tmpRead1 != termValue))//(*pCH1 <
endCH1) & (tmpRead1 != termValue))
   {
      tmpRead1 = *pFifo12;
      tmpRead2 = *pFifo3;
      
      //CHANNEL 1
      *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
      // CHANNEL 2
      *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
      if(curvature)
      {
         *pCH1 += *pCH2;
         if(*pCH1 > 5000)
         {
            *pCH1 = 5000;
         }
      }
      //CHANNEL 5
      *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
      // CHANNEL 6
      *pCH6 = LUT1[tmpRead1 & 0xFF];
      // CHANNEL 3 this channel is always read for particle matching on this
channel
      *pCH3 = LUT0[((tmpRead2 & 0xFF))];	
      *pBinData3 = tmpRead2 & 0xFF;
      // CHANNEL 4
      *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
      
      pCH1++;
      pCH2++;
      pCH3++;
      pBinData3++;
      pCH4++;
      pCH5++;
      pCH6++;
   }
   sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;

   return sampleCount;
}

I would like to thank you very much for helping me improve this code, i'm
learning more and more :)

As you might have seen in my code the second read (tempRead2) is a 32 bits int
but I'm only interrested in the first 16 bits (where channel 3 and 4 reside), is
there a way i can inform the compiler to ignore the other 16 bits for maybe more
efficient performance?

I had to leave pFifo12 and pFifo3 volatile because when i removed these keywords
the software pipelining was disabled again (Cannot find schedule).

With kind regards,

Dominic

> 
> R. Williams
> 
> ---------- Original Message -----------
> From: "d.stuartnl" <d.stuartnl@...>
> To: c...@yahoogroups.com
> Sent: Wed, 22 Jul 2009 13:43:32 -0000
> Subject: [c6x] Re: Slow EMIF transfer
> 
> > Hi all,
> > 
> > I'm trying to further optimize my code but for some reason I cannot 
> > get pipelining to work. I've checked several documents (SPRU425, 
> > Optimizing C Compiler Tutorial, SPRA666, Hand Tuning Loops and Control

> > Code). These documents primarily focus on improving pipelines but in 
> > my .asm file it keeps stating "Unsafe schedule for irregular
loop". It 
> > produces the folowing Software Pipeline Information:
> > 
> <snip> > -------* My code is as follows:
> > 
> > void Calculator_FetchData(
> >    volatile int * restrict p1, 
> >    volatile int * restrict p2) 
> { 
> > volatile int tmpRead1; 	
> > volatile int tmpRead2; 
> > volatile int tmpStore1; 
> > volatile int tmpStore2; 
> > volatile float * restrict pCH1; 	
> > volatile float * restrict pCH2;
> > volatile float * restrict pCH3; 
> > volatile float * restrict pCH4; 
> > volatile float * restrict pCH5; 
> > volatile float * restrict pCH6;
> > 
> > const volatile float endCH1 = (const) &CH1.deloggedData[0x1000];
> > const termValue = 0x84825131;
> > 
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> > 
> >     while( (*pCH1 < endCH1) & (tmpRead1 != termValue) )
> >     {
> >         tmpRead1 = *p1;
> > 
> > 		//CHANNEL 1if(LRneeded == 1)
> >         *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > 		// CHANNEL 2
> >         *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> >         if(LRneeded == 1)
> >         {
> >             *pCH1 += *pCH2;
> >
> >             if(*pCH1 > 5000)
> >             {
> >                 *pCH1 = 5000;
> >             }
> >         }
> > 		//CHANNEL 5
> >         *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > 
> > 		// CHANNEL 6
> >         *pCH6 = LUT1[tmpRead1 & 0xFF];
> > 		
> >         tmpRead2 = *p2;
> > 
> > // CHANNEL 3 this channel is always read for particle matching on 
> > this channel 		
> >         *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >	 		// CHANNEL 4 	
> >         *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > 
> >         pCH1++;
> >         pCH2++;
> >         pCH3++;
> >         pCH4++;
> >         pCH5++;
> >         pCH6++;
> >     }
> >
> >     x = (int) (pCH1 - &CH1.deloggedData[0]);
> > }
> > 
> > Is there a way I can change my C-code so the DSP can Pipeline? I think

> > it should be possible to have at least 2 iterations in parallel:
> > 
> > 1st       2nd
> > read1
> > delog1    
> > read2     read1
> > delog2    delog1
> > etc..
> > 
> > I've already used the "restrict" keyword on the pointers I
use since 
> > these pointers do not overlap. I'm using the folowing compiler
options:
> > 
> > -k -s -pm -os -on1 -op3 -o3 -fr"$(Proj_dir)\Debug"
-d"CHIP_6713" 
> > -d"DEBUG" -mt -mw -mh -mr1 -mv6710 --mem_model:data=far
--consultant
> > 
> > Any advice or pointers to documents regarding how to enable pipelining

> > other then the ones mentioned above would be helpfull.
> > 
> > With kind regards,
> > 
> > Dominic Stuart
> > 
> > PS: I don't know if it's usefull but the folowing asm code is being
produced:
> > 
> > C$L6:    
> > $C$DW$L$_Calculator_FetchData$2$B:
> > 	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722 
> > works\Calculator.c",line 363,column 0,is_stmt
;**	---------------------
> > --g3: ;** 364	-----------------------    tmpRead1 = *p1; ;**
367	------
> > -----------------    *pCH1 = K$32[_extu((unsigned)tmpRead1, 8u, 24u)];

> > ;** 369	-----------------------    *(++pCH2) = K$32[((unsigned)
> > tmpRead1>>22>>2)]; ;** 370	-----------------------    if (
LRneeded != 
> > 1 ) goto g6; ;** 372	-----------------------    *pCH1 = *pCH1+*pCH2; 
> > ;** 373	-----------------------    if ( *pCH1 <= K$36 ) goto g6;
;** 
> > 375	-----------------------    *pCH1 = K$36;
;**	----------------------
> > -g6: ;** 379	-----------------------    *pCH5++ =
K$39[_extu((unsigned)
> > tmpRead1, 16u, 24u)]; ;** 382	-----------------------    *pCH6++ = 
> > K$39[_extu((unsigned)tmpRead1, 24u, 24u)]; ;**
384	--------------------
> > ---    tmpRead2 = *p2; ;** 387	-----------------------    *pCH3++ = 
> > K$32[_extu((unsigned)tmpRead2, 24u, 24u)]; ;**
389	--------------------
> > ---    *pCH4++ = K$32[_extu((unsigned)tmpRead2, 16u, 24u)]; ;**
397	---
> > --------------------    if ( (*(++pCH1) < endCH1)&(tmpRead1 !=
K$27) ) 
> > goto g3;           LDW     .D1T2   *A4,B4            ; |364|          

> >  ZERO    .L1     A1           NOP             3           STW    
> >  .D2T2   B4,*+SP(4)        ; |364|            LDW     .D2T2   *+SP(4)
> > ,B4        ; |367|            NOP             4           EXTU    .S2 

> >    B4,8,24,B4        ; |367|            LDW     .D2T1   *+B7[B4],A9   

> >    ; |367|            NOP             4           STW     .D1T1   A9,
> > *A8            ; |367|            LDW     .D2T2   *+SP(4),B4        ;

> > |369|            NOP             4           SHRU    .S2     B4,24,B4 

> >         ; |369|            LDW     .D2T2   *+B7[B4],B4       ; |369|  

> >          NOP             4           STW     .D2T2   B4,*++B5         

> > ; |369|            LDHU    .D1T2   *A11,B4           ; |370|          

> >  NOP             4           CMPEQ   .L2     B4,1,B0           ;
|370|
> > 
> >    [ B0]   LDW     .D1T1   *A8,A9            ; |372| 
> > || [ B0]   LDW     .D2T2   *B5,B4            ; |372|
> > 
> >            NOP             4
> >    [ B0]   ADDSP   .L1X    B4,A9,A9          ; |372| 
> >            NOP             3
> >    [ B0]   STW     .D1T1   A9,*A8            ; |372| 
> >    [ B0]   LDW     .D1T1   *A8,A9            ; |373| 
> >            NOP             4
> >    [ B0]   CMPGTSP .S1     A9,A2,A9          ; |373| 
> >    [ B0]   MV      .L1     A9,A1
> >    [ A1]   STW     .D1T1   A2,*A8            ; |375| 
> >            LDW     .D2T1   *+SP(4),A9        ; |379| 
> >            NOP             4
> >            EXTU    .S1     A9,16,24,A9       ; |379| 
> >            LDW     .D1T1   *+A10[A9],A9      ; |379| 
> >            NOP             4
> >            STW     .D1T1   A9,*A5++          ; |379| 
> >            LDW     .D2T1   *+SP(4),A9        ; |382| 
> >            NOP             4
> >            EXTU    .S1     A9,24,24,A9       ; |382| 
> >            LDW     .D1T1   *+A10[A9],A9      ; |382| 
> >            NOP             4
> >            STW     .D1T1   A9,*A3++          ; |382| 
> >            LDW     .D1T2   *A0,B4            ; |384| 
> >            NOP             4
> >            STW     .D2T2   B4,*+SP(8)        ; |384| 
> >            LDW     .D2T2   *+SP(8),B4        ; |387| 
> >            NOP             4
> >            EXTU    .S2     B4,24,24,B4       ; |387| 
> >            LDW     .D2T1   *+B7[B4],A9       ; |387| 
> >            NOP             4
> >            STW     .D1T1   A9,*A6++          ; |387| 
> >            LDW     .D2T2   *+SP(8),B4        ; |389| 
> >            NOP             4
> >            EXTU    .S2     B4,16,24,B4       ; |389| 
> >            LDW     .D2T1   *+B7[B4],A9       ; |389| 
> >            NOP             4
> >            STW     .D1T1   A9,*A7++          ; |389| 
> >            LDW     .D2T2   *+SP(12),B4       ; |397|
> > 
> >            LDW     .D1T1   *++A8,A9          ; |397| 
> > ||         LDW     .D2T2   *+SP(4),B8        ; |397|
> > 
> >            NOP             4
> > 
> >            CMPEQ   .L2     B8,B6,B8          ; |397| 
> > ||         CMPLTSP .S2X    A9,B4,B4          ; |397|
> > 
> >            XOR     .L2     1,B8,B8           ; |397| 
> >            AND     .L2     B8,B4,B0          ; |397| 
> >    [ B0]   B       .S1     $C$L6             ; |397| 
> > 	.dwpsn	file "C:\Documents and Settings\User\Desktop\20090722 
> > works\Calculator.c",line 397,column 0,is_stmt           NOP      
     
> >  5           ; BRANCHCC OCCURS {$C$L6}         ; |397|
> > 
> > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > >
> > > 
> > > d.stuartnl,
> > > 
> > > The reason the time is quicker, even though there is more code,
is because the
> > > code produced to do:
> > > CH1.deloggedData[x]  
> > > includes quite a lot of math, calculation of an address in a
array is slow
> > > compared to incrementing a pointer
> > > 
> > > R. Williams
> > > 
> > > 
> > > ---------- Original Message -----------
> > > From: "d.stuartnl" <d.stuartnl@>
> > > To: c...@yahoogroups.com
> > > Sent: Fri, 17 Jul 2009 17:06:32 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > > 
> > > > Dear R.Williams,
> > > > 
> > > > I changed my code to your suggestion:
> > > > 
> > > > void Calculator_FetchData()
> > > > {
> > > > 	volatile float * pCH1;
> > > > 	volatile float * pCH2;
> > > > 	volatile float * pCH3;
> > > > 	volatile float * pCH4;
> > > > 	volatile float * pCH5;
> > > > 	volatile float * pCH6;
> > > > 
> > > > 	const volatile float endCH1 = (const)
&CH1.deloggedData[0x1000];
> > > > 	const termValue = 0x84825131;
> > > > 
> > > > 	pCH1 = &CH1.deloggedData[0];
> > > > 	pCH2 = &CH2.deloggedData[0];
> > > > 	pCH3 = &CH3.deloggedData[0];
> > > > 	pCH4 = &CH4.deloggedData[0];
> > > > 	pCH5 = &CH5.deloggedData[0];
> > > > 	pCH6 = &CH6.deloggedData[0];
> > > > 
> > > > 	
> > > > 
> > > > 	tmpprocessTime = TIMER(1)->cnt; //just in here for
measuring performance...
> > > > 
> > > > 	while(*pCH1 < endCH1)
> > > > 	{
> > > > 		tmpRead1 = *read1;
> > > > 		if(tmpRead1 == termValue) break;
> > > > 		//CHANNEL 1
> > > > 		*pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> > > > 		// CHANNEL 2
> > > > 		*pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> > > > 		if(LRneeded == 1)
> > > > 		{
> > > > 			*pCH1 += *pCH2;
> > > > 			if(*pCH1 > 5000)
> > > > 			{
> > > > 				*pCH1 = 5000;
> > > > 			}
> > > > 		}
> > > > 		// CHANNEL 5
> > > > 		*pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> > > > 
> > > > 		// CHANNEL 6
> > > > 		*pCH6 = LUT1[tmpRead1 & 0xFF];
> > > > 
> > > > 		tmpRead2 = *read2;
> > > > 		
> > > > 		// CHANNEL 3 this channel is always read for particle
matching on 
> > > > this channel 		*pCH3 = LUT0[((tmpRead2 & 0xFF))];	 		//
CHANNEL 4 	
> > > > 	*pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > > > 
> > > > 		pCH1++;
> > > > 		pCH2++;
> > > > 		pCH3++;
> > > > 		pCH4++;
> > > > 		pCH5++;
> > > > 		pCH6++;
> > > > 		x++;
> > > > 	}
> > > > 	if((TIMER(1)->cnt - tmpprocessTime) > 0)//detect
overflow
> > > > 	{
> > > > 		processTime = 	TIMER(1)->cnt - tmpprocessTime;
> > > > 	}
> > > > }
> > > > 
> > > > On my testrig I'm offering particles with a fixed lenght of
985. My 
> > > > previous code could read 985 samples for 6 channels in
681us. Your 
> > > > suggestion cut that time down to 601us!!! My first reaction
was WOW 
> > > > :P. I have a couple of questions though if you can forgive
my 
> > > > ignorance. The big question is WHY? Because it looks like
it's 
> > > > calculating more (6 pointers instead of 1 "x"). I
still left in the 
> > > > x++; because I need to know how many samples have been
read.
> > > > 
> > > > With kind regards,
> > > > 
> > > > Dominic Stuart
> > > > 
> > > > --- In c...@yahoogroups.com, "Richard Williams"
<rkwill@> wrote:
> > > > >
> > > > > d.stuartnl,
> > > > > 
> > > > > I notice that the code, during the first loop, checks
for the
> termination value
> > > > > then throws away the first read values (by reading from
read1 and read2
> again).
> > > > > is that you wanted to do?
> > > > > 
> > > > > Execution could be made much faster, by eliminating the
calculations
> related to
> > > > > 'x' by using pointers to:
> > > > > CH1.deloggedData, 
> > > > > CH2.deloggedData, 
> > > > > CH3.deloggedData, 
> > > > > CH4.deloggedData, 
> > > > > CH5.deloggedData, 
> > > > > CH6.deloggedData.  
> > > > > Initialize the pointers before the loop and increment
them at the end of the
> > > loop.
> > > > > Also, eliminate 'x' and related calculation by
precalculating the end
> address
> > > > > for the loop as: 
> > > > > const endCH1 = &CH1.deloggedData[0x1000];
> > > > > const termValue = 0x84825131;
> > > > > 
> > > > > pCH1 = &CH1.deloggedData[0];
> > > > > pCH2 = &CH2.deloggedData[0];
> > > > > --- // rest of initialization
> > > > > while( pCH1 < endCH1 )
> > > > > {
> > > > > ---// processing
> > > > > pCH1++;
> > > > > pCh2++;
> > > > > ...// rest of incrementing
> > > > > } // end while()
> > > > > 
> > > > > to avoid processing the termination value from *read1
> > > > > and to exit when the termination value is read:
> > > > > The first code within the 'while' loop would be:
> > > > > tmpRead1 = *read1;
> > > > > if (tmpRead1 == termValue ) break;
> > > > > tmpRead2 = *read2;
> > > > > 
> > > > > R. Williams
> > > > > 
> > > > > 
> > > > > ---------- Original Message -----------
> > > > > From: "d.stuartnl" <d.stuartnl@>
> > > > > To: c...@yahoogroups.com
> > > > > Sent: Fri, 17 Jul 2009 10:11:36 -0000
> > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > 
> > > > > > R. Williams,
> > > > > <snip>
> > > > > > 
> > > > > > x and tmpRead1 are updated in the AddSample()
routine. Furthermore,
> > > > > >  I've been analyzing the compilers feedback and
it's stating that it 
> > > > > > cannot implement software pipelining because
there's a function call 
> > > > > > (AddSample()) in the loop. I've removed the
AddSample() function and 
> > > > > > put the code from the function directly into the
loop (see source),
> > > > > >  there's still some problems (Disqualified loop:
Loop carried 
> > > > > > dependency bound too large). But I'm working on it
:) I've also found 
> > > > > > out that pipelining is not being used in a lot of
my loops so I'm 
> > > > > > guessing if I adjust my C-code so that software
pipelining will be 
> > > > > > possible I will notice an increase in
performance.
> > > > > > 
> > > > > > Source:
> > > > > > 
> > > > > > read1 = (int*) 0x90300004;
> > > > > > read2 = (int*) 0x90300008;
> > > > > > 
> > > > > > tmpRead1 = *read1;
> > > > > > tmpRead2 = *read2;
> > > > > > x = 0;
> > > > > > while(tmpRead1 != 0x84825131 & (x <=
0x1000))
> > > > > > {
> > > > > >    tmpRead1 = *read1;
> > > > > >    tmpRead2 = *read2;YouTube - Dilbert - The
Knack
> > > > > > 
> > > > > >    CH1.deloggedData[x] = LUT0[((tmpRead1 &
0xFF0000) >> 16)];
> > > > > >    CH2.deloggedData[x] = LUT0[((tmpRead1 &
0xFF000000) >> 24)];
> > > > > >    // FWS R+L Add
> > > > > >    if(LRneeded == 1)
> > > > > >    {
> > > > > >       CH1.deloggedData[x] += CH2.deloggedData[x];
> > > > > >       if(CH1.deloggedData[x] > 5000)
> > > > > >       {
> > > > > >          CH1.deloggedData[x] = 5000;
> > > > > >       }
> > > > > >    }
> > > > > >    CH3.deloggedData[x] = LUT0[((tmpRead2 &
0xFF))];
> > > > > >    binData[x] = (tmpRead2 & 0xFF);
> > > > > >    CH4.deloggedData[x] = LUT0[((tmpRead2 &
0xFF00) >> 8)];
> > > > > >    CH5.deloggedData[x] = LUT1[((tmpRead1 &
0xFF00) >> 8)];
> > > > > >    CH6.deloggedData[x] = LUT1[tmpRead1 &
0xFF];
> > > > > >    x++;
> > > > > > }
> > > > > > 
> > > > > > With kind regards,
> > > > > > 
> > > > > > Dominic
> > > > > > 
> > > > > > > 
> > > > > > > However, your idea of just using the read
operation, since it is
> much longer
> > > > > > > than a write, is a good one.
> > > > > > > 
> > > > > > > R. Williams
> > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > > > ---------- Original Message -----------
> > > > > > > From: Jeff Brower <jbrower@>
> > > > > > > To: Dominic Stuart <d.stuartnl@>
> > > > > > > Cc: c...@yahoogroups.com
> > > > > > > Sent: Wed, 15 Jul 2009 11:07:55 -0500
> > > > > > > Subject: [c6x] Re: Slow EMIF transfer
> > > > > > > 
> > > > > > > > Dominic-
> > > > > > > > 
> > > > > > > > > I am indeed trying to avoid delay
in processing flow. The data needs
> > > to be
> > > > > > > > > decompressed asap. When that is
done the DSP performs calculations
> > > on the
> > > > > > > > > data and based on the outcome of
those calculations the DSP
> generates a
> > > > > > > > > trigger (GPIO). Your idea of a code
loop got me thinking... If a
> read
> > > > > > > > > always takes longer than a write, I
don't have to pull the Empty
> > > Flag and
> > > > > > > > > can just read the data through a
loop like so:
> > > > > > > > > 
> > > > > > > > > while(tmpRead1 != 0x84825131 &
(x <= 0x1000))
> > > > > > > > > {
> > > > > > > > >    Calculator_AddSample();
> > > > > > > > > }
> > > > > > > > 
> > > > > > > > Ok, so what you're saying is that once
you see a "not empty" flag, 
> > > > > > > > then you know the agent on the other
side of the FIFO is writing a 
> > > > > > > > known block size, and will write it
faster than you can read, so your 
> > > > > > > > code just needs to read.
> > > > > > > > 
> > > > > > > > > I've tested this and it did improve
the performance but nothing
> > > shocking,
> > > > > > > > > it seems the decompressing via the
LookUp Table is creating the
> bottle
> > > > > > > > > neck. I've already split the two
dimensional LUT into 2 one
> dimensional
> > > > > > > > > array's. This also helped a bit.
> > > > > > > > 
> > > > > > > > One thing you might try is
hand-optimized asm code just for the
> read / 
> > > > > > > > look-up sequence, using techniques that
Richard was describing.  If 
> > > > > > > > you take advantage of the pipeline, you
can improve performance.  For 
> > > > > > > > example you can read sample N, then in
the next 4 instructions
> process 
> > > > > > > > the lookup on N-1, waiting for N to
become valid.  It sounds to me 
> > > > > > > > like it wouldn't be that much code in
your loop, maybe a dozen or
> less 
> > > > > > > > asm instructions.
> > > > > > > > 
> > > > > > > > -Jeff
> > > > > > > > 
> > > > > > > > PS. Please post to the group, not to me.
 Thanks.
> > > > > > > > 
> > > > > > > > > --- In c...@yahoogroups.com, Jeff
Brower <jbrower@> wrote:
> > > > > > > > > >
> > > > > > > > > > Dominic-
> > > > > > > > > >
> > > > > > > > > > > Thanks for the
information, I think I will refrain from
> using block
> > > > > > > > > > > transfers because I want
to process the data as the DSP
> receives it.
> > > > > > > > > > .
> > > > > > > > > > .
> > > > > > > > > > .
> > > > > > > > > >
> > > > > > > > > > > At the moment I am
starting this "prefetch" function when a
> burst
> > > > > > > > > > > starts and execute this
function every time there is data
> available
> > > > > > > > > > > in the FIFO's (polling
the Empty Flag). I'm prefeteching
> 27.6% of
> > > > > > > > > > > the data before the burst
ends. All variables are in IRAM.
> > > > > > > > > >
> > > > > > > > > > The typical reason for doing
it that way is to avoid delay
> > > (latency) in
> > > > > > > your signal
> > > > > > > > > > processing flow, relative to
some output (DAC, GPIO line, digital
> > > > > > > transmission,
> > > > > > > > > > etc).  Is that the case?  If
not then a block based method
> would be
> > > > > > > better, otherwise
> > > > > > > > > > you will waste a lot of time
polling for each element.  You don't
> > > have to
> > > > > > > implement
> > > > > > > > > > DMA as a first step to get
that working, you could use a code
> > > loop.  Then
> > > > > > > implement
> > > > > > > > > > DMA in order to further
improve performance.
> > > > > > > > > >
> > > > > > > > > > -Jeff
> > > > > > > > > >
> > > > > > > > > > > My function looks like
this:
> > > > > > > > > > >
> > > > > > > > > > > void
Calculator_AddSample()
> > > > > > > > > > > {
> > > > > > > > > > >    x++;
> > > > > > > > > > >
> > > > > > > > > > >    read1 = (int*)
0x90300004;
> > > > > > > > > > >    read2 = (int*)
0x90300008;
> > > > > > > > > > >
> > > > > > > > > > >    tmpRead1 = *read1;
> > > > > > > > > > >    tmpRead2 = *read2;
> > > > > > > > > > >
> > > > > > > > > > >    // CHANNEL 1
> > > > > > > > > > >    CH1.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF0000) >> 16)];
> > > > > > > > > > >    // CHANNEL 2
> > > > > > > > > > >    CH2.deloggedData[x] =
LUT[0][((tmpRead1 & 0xFF000000) >>
> 24)];
> > > > > > > > > > >    // FWS R+L Add
> > > > > > > > > > >    if(LRneeded == 1)
> > > > > > > > > > >    {
> > > > > > > > > > >       CH1.deloggedData[x]
+=    CH2.deloggedData[x];
> > > > > > > > > > >      
if(CH1.deloggedData[x] > 5000)
> > > > > > > > > > >       {
> > > > > > > > > > >         
CH1.deloggedData[x] = 5000;
> > > > > > > > > > >       }
> > > > > > > > > > >    }
> > > > > > > > > > >    // CHANNEL 3 this
channel is always read for particle
> matching on
> > > > > > > this channel
> > > > > > > > > > >    binData[x] = (tmpRead2
& 0xFF);
> > > > > > > > > > >    CH3.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF))];
> > > > > > > > > > >
> > > > > > > > > > >    // CHANNEL 4
> > > > > > > > > > >    CH4.deloggedData[x] =
LUT[0][((tmpRead2 & 0xFF00) >> 8)];
> > > > > > > > > > >    // CHANNEL 5
> > > > > > > > > > >    CH5.deloggedData[x] =
LUT[1][((tmpRead1 & 0xFF00) >> 8)];
> > > > > > > > > > >    // CHANNEL 6
> > > > > > > > > > >    CH6.deloggedData[x] =
LUT[1][tmpRead1 & 0xFF];
> > > > > > > > > > > }
> > > > > > > > > > > This function executes 2
reads from 2 different FIFO's and then
> > > > > > > seperates the different datachannels and
decompresses the value's with a
> > > LookUp
> > > > > > > Table.
> > > > > > > > > > >
> > > > > > > > > > > I am trying to streamline
this function so it can keep up
> with the
> > > > > > > incoming data. The data is written to the
FIFO's with 4MHz. The data
> > > consists of
> > > > > > > small burst packets ranging from 3 to 4096
bytes per channel.
> > > > > > > > > > >
> > > > > > > > > > > At the moment I am
starting this "prefetch" function when a
> > > burst starts
> > > > > > > and execute this function every time there is
data available in the
> FIFO's
> > > > > > > (polling the Empty Flag). I'm prefeteching
27.6% of the data before the
> > > burst
> > > > > > > ends. All variables are in IRAM.
> > > > > > > > > > >
> > > > > > > > > > > I think I made an error
in suspecting the EMIF transfer speed
> > > and I now
> > > > > > > suspect that there may be some overhead in
the polling scheme I use for
> > > calling
> > > > > > > this function that results in the slow
transfer speed. I will look into
> > > this. I
> > > > > > > would like to thank everyone for there
input.
> > > > > > > > > > >
> > > > > > > > > > > With kind regards,
> > > > > > > > > > >
> > > > > > > > > > > Dominic
> > > > > > > > > > >
> > > > > > > > > > > --- In
c...@yahoogroups.com, Adolf Klemenz <adolf.klemenz@>
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Dominic,
> > > > > > > > > > > >
> > > > > > > > > > > > At 16:45 13.07.2009
+0000, d.stuartnl wrote:
> > > > > > > > > > > > >as I understand
DMA, I would need to work in "blocks" of
> data but
> > > > > that
> > > > > > > > > > > > >would be very
tricky in my application since I do not
> know how
> > > > > big the
> > > > > > > > > > > > >datastream is
gonna be. Or is it possible to use DMA for
> > > single byte
> > > > > > > transfers?
> > > > > > > > > > > >
> > > > > > > > > > > > using DMA makes
sense for block transfers only. Typical Fifo
> > > > > applications
> > > > > > > > > > > > will use the Fifo's
half-full flag (or a similar signal) to
> > > > > trigger a DMA
> > > > > > > > > > > > block read.
> > > > > > > > > > > > You may use
element-synchronized DMA (each trigger
> transfers only
> > > > > one data
> > > > > > > > > > > > word), but there
will be no speed improvement: It takes about
> > > > > 100ns from
> > > > > > > > > > > > the EDMA sync event
to the actual data transfer on a C6713.
> > > > > > > > > > > >
> > > > > > > > > > > > Attached is a scope
screenshot generated by this test program
> > > > > > > > > > > >
> > > > > > > > > > > > // compiled with -o2
and without debug info:
> > > > > > > > > > > >
> > > > > > > > > > > > volatile int buffer;
// must be volatile to prevent
> > > > > > > > > > > >                     
 // optimizer from code removal
> > > > > > > > > > > > for (;;)
> > > > > > > > > > > > {
> > > > > > > > > > > >      buffer =
*(volatile int*)0x90300000;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > The screenshot shows
chip select and read signal with the
> expected
> > > > > timings
> > > > > > > > > > > > (20ns strobe width).
The gap between sucessive reads is
> caused by
> > > > > the DSP
> > > > > > > > > > > > architecture. Here
it is 200ns because a 225MHz DSP was used,
> > > > > which should
> > > > > > > > > > > > translate to 150ns
on a 300MHz device.
> > > > > > > > > > > >
> > > > > > > > > > > > If this isn't fast
enough, you must use block transfers.
> > > > > > > > > > > >
> > > > > > > > > > > >    Best Regards,
> > > > > > > > > > > >    Adolf Klemenz,
D.SignT
> > > > > > > > > >
> > > > > > > ------- End of Original Message -------
> > > > > > >
> > > > > ------- End of Original Message -------
> > > > >
> > > ------- End of Original Message -------
> > >
> ------- End of Original Message -------
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Richard Williams - Jul 24 11:12:41 2009

d.stuartnl,

my comments in-line and prefixed with <rkw>

R. Williams
---------- Original Message -----------
From: "d.stuartnl" <d...@yahoo.com>
To: c...@yahoogroups.com
Sent: Fri, 24 Jul 2009 09:26:55 -0000
Subject: [c6x] Re: Slow EMIF transfer

> R.Williams,
> 
> SUCCESS! Looptime has almost halved! Software pipelining is working 
> now thanks to your tips:

<rkw> congratulations!!

<snip>

> 
> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -1;
> is working fine as it is. Dont know why though.

<rkw> two reasons:
1) the data size of a float is the same as the address data size
2) the '-1' because the pCH1 pointer is incremented at the end of the loop
to point 1 past the last location used.
<snip>

> > 
> still have them in a single loop and it's pipelining. Do you think 
> it's worth considering splitting it into two loops and check if 
> there's (an even better) speed increase?

<rkw> you could experiment, but it looks like it is not necessary to
separate
the code into two loops.
<snip>

> My new and improved function:
<snip>

>       // CHANNEL 3 this channel is always read for particle matching 
> on this channel      *pCH3 = LUT0[((tmpRead2 & 0xFF))];	     
>  *pBinData3 = tmpRead2 & 0xFF;      // CHANNEL 4      *pCH4 = 
> LUT0[((tmpRead2 & 0xFF00) >> 8)];

<rkw> there seems to be a problem in the editing of the above 4 lines
It looks like pCH3 is not being used; however, pCH3 is still being initialized
and incremented in the code.
Also when testing for execution speed, adding new operations (pBinData3) makes
it very difficult to make timing comparisons.
<snip>

> 
> As you might have seen in my code the second read (tempRead2) is a 32 
> bits int but I'm only interrested in the first 16 bits (where channel 
> 3 and 4 reside), is there a way i can inform the compiler

<rkw> the natural size of a operation is 32bits, changing to a 16 bit
operation
would <probably> slow the code execution.

> 
> I had to leave pFifo12 and pFifo3 volatile because when i removed 
> these keywords the software pipelining was disabled again (Cannot find 
> schedule).

<rkw> the 'volatile' is needed for the two parameters because they DO
change
between reads.  I had suggested to remove the 'volatile' from the variables,
not
the parameters.
<snip>

> 
> With kind regards,
> 
> Dominic
<snip>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Michael Dunn - Jul 24 12:18:13 2009

Congratulations, Dominic!!

I'll top post this minor comment wrt 16/32 bit memory accesses and speed.

Assuming that you have 32 bit wide memory with aligned accesses, 32,
16, and 8 bit accesses will be the same speed.
Only if your external memory is 8 or 16 bits wide would there be any
potential advantage in performing 16 bit accesses instead of 32 bit
accesses.
Also, there would be an advantage in fetching 32 bits at a time if you
an entire array of 8 or 16 bit values.

I haven't looked at the details of your code, but if you always fetch
48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
that your hardware addresses are preventing you from picking up some
additional speed.  *If* the input addresses began on a 64 bit boundary
[0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
any memory fetch would coerce the compiler to performing an 'LDDW' [64
bit read].

Since your hardware addresses are fixed, you only need 1 pointer.  You could
use
tmpRead2 = *(read1 + 4);
This would free up one register and, depending on register
utilization, could improve the performance.

mikedunn
On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams<r...@lewiscounty.com>
wrote:
> d.stuartnl,
>
> my comments in-line and prefixed with <rkw> R. Williams
>
> ---------- Original Message -----------
> From: "d.stuartnl" <d...@yahoo.com>
> To: c...@yahoogroups.com
> Sent: Fri, 24 Jul 2009 09:26:55 -0000
> Subject: [c6x] Re: Slow EMIF transfer
>
>> R.Williams,
>>
>> SUCCESS! Looptime has almost halved! Software pipelining is working
>> now thanks to your tips:
>
> <rkw> congratulations!!
>
> <snip>>
>> For some reason, sampleCount = (int) (pCH1 - &CH1.deloggedData[0])
-1;
>> is working fine as it is. Dont know why though.
>
> <rkw> two reasons:
> 1) the data size of a float is the same as the address data size
> 2) the '-1' because the pCH1 pointer is incremented at the end of the loop
> to point 1 past the last location used.
> <snip>> >
>> still have them in a single loop and it's pipelining. Do you think
>> it's worth considering splitting it into two loops and check if
>> there's (an even better) speed increase?
>
> <rkw> you could experiment, but it looks like it is not necessary to
> separate
> the code into two loops.
> <snip>> My new and improved function:
> <snip>> // CHANNEL 3 this channel is always read for particle
matching
>> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 =
>> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> <rkw> there seems to be a problem in the editing of the above 4
lines
> It looks like pCH3 is not being used; however, pCH3 is still being
> initialized
> and incremented in the code.
> Also when testing for execution speed, adding new operations (pBinData3)
> makes
> it very difficult to make timing comparisons.
> <snip>>
>> As you might have seen in my code the second read (tempRead2) is a 32
>> bits int but I'm only interrested in the first 16 bits (where channel
>> 3 and 4 reside), is there a way i can inform the compiler
>
> <rkw> the natural size of a operation is 32bits, changing to a 16
bit
> operation
> would <probably> slow the code execution.
>
>>
>> I had to leave pFifo12 and pFifo3 volatile because when i removed
>> these keywords the software pipelining was disabled again (Cannot find
>> schedule).
>
> <rkw> the 'volatile' is needed for the two parameters because they DO
change
> between reads. I had suggested to remove the 'volatile' from the
variables,
> not
> the parameters.
> <snip>>
>> With kind regards,
>>
>> Dominic
> <snip
-- 
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Jul 24 23:50:50 2009

Thanks Mike,

I'm starting to enjoy this "tweaking" and am trying to push it as far
as I can because every microsecond that i gain means the DSP can handle more
particles/second. I've applied the tips I've gotten on this forum on the rest of
my source code as well (the actual loops in my program that do the calculations
on the data) and those are pipelining aswell now. Compared to the initial source
total improvement is over 900%! Amazing (looks like I was using the DSP as a
glorified MCU) but the true power of the DSP is starting to show! I thank you
for your input but it raises some questions if you don't mind:

--- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>
> Congratulations, Dominic!!
> 
> I'll top post this minor comment wrt 16/32 bit memory accesses and speed.
> 
> Assuming that you have 32 bit wide memory with aligned accesses, 32,
> 16, and 8 bit accesses will be the same speed.

What do you mean with aligned exactly?

> Only if your external memory is 8 or 16 bits wide would there be any
> potential advantage in performing 16 bit accesses instead of 32 bit
> accesses.
> Also, there would be an advantage in fetching 32 bits at a time if you
> an entire array of 8 or 16 bit values.
> 

I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1 reads the
first two together (logic tied together so they "act" like 1 32bits
wide FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO). 

> I haven't looked at the details of your code, but if you always fetch
> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> that your hardware addresses are preventing you from picking up some
> additional speed.  *If* the input addresses began on a 64 bit boundary
> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> any memory fetch would coerce the compiler to performing an 'LDDW' [64
> bit read].

I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling my
compiler to fetch a 64 bit read (I mean this still has to be split somehow in 2
read cycles somehow?)
> 
> Since your hardware addresses are fixed, you only need 1 pointer.  You
could use
> tmpRead2 = *(read1 + 4);
> This would free up one register and, depending on register
> utilization, could improve the performance.
> 

Improve performance, thats what I like to hear ;) I hope my questions aren't too
"basic".

Dominic

> mikedunn
> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams<rkwill@...> wrote:
> >
> >
> > d.stuartnl,
> >
> > my comments in-line and prefixed with <rkw>
> >
> > R. Williams
> >
> > ---------- Original Message -----------
> > From: "d.stuartnl" <d.stuartnl@...>
> > To: c...@yahoogroups.com
> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> > Subject: [c6x] Re: Slow EMIF transfer
> >
> >> R.Williams,
> >>
> >> SUCCESS! Looptime has almost halved! Software pipelining is
working
> >> now thanks to your tips:
> >
> > <rkw> congratulations!!
> >
> > <snip>
> >
> >>
> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
> >> is working fine as it is. Dont know why though.
> >
> > <rkw> two reasons:
> > 1) the data size of a float is the same as the address data size
> > 2) the '-1' because the pCH1 pointer is incremented at the end of the
loop
> > to point 1 past the last location used.
> > <snip>
> >
> >> >
> >> still have them in a single loop and it's pipelining. Do you
think
> >> it's worth considering splitting it into two loops and check if
> >> there's (an even better) speed increase?
> >
> > <rkw> you could experiment, but it looks like it is not
necessary to
> > separate
> > the code into two loops.
> > <snip>
> >
> >> My new and improved function:
> > <snip>
> >
> >> // CHANNEL 3 this channel is always read for particle matching
> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 =
> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > <rkw> there seems to be a problem in the editing of the above 4
lines
> > It looks like pCH3 is not being used; however, pCH3 is still being
> > initialized
> > and incremented in the code.
> > Also when testing for execution speed, adding new operations
(pBinData3)
> > makes
> > it very difficult to make timing comparisons.
> > <snip>
> >
> >>
> >> As you might have seen in my code the second read (tempRead2) is a
32
> >> bits int but I'm only interrested in the first 16 bits (where
channel
> >> 3 and 4 reside), is there a way i can inform the compiler
> >
> > <rkw> the natural size of a operation is 32bits, changing to a
16 bit
> > operation
> > would <probably> slow the code execution.
> >
> >>
> >> I had to leave pFifo12 and pFifo3 volatile because when i removed
> >> these keywords the software pipelining was disabled again (Cannot
find
> >> schedule).
> >
> > <rkw> the 'volatile' is needed for the two parameters because
they DO change
> > between reads. I had suggested to remove the 'volatile' from the
variables,
> > not
> > the parameters.
> > <snip>
> >
> >>
> >> With kind regards,
> >>
> >> Dominic
> > <snip>
> > -- 
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - William C Bonner - Jul 25 9:37:13 2009

I just wanted to mention that I've enjoyed following this thread, because
it
started with somewhat general code and has allowed me to follow as you've
learned a bunch of optimization tricks.  I've gone through several of these
stages over the past couple of years, but that doesn't mean that I remember
to use the tricks all the time.  seeing them again has been helpful.

Wim.

On Fri, Jul 24, 2009 at 2:58 PM, d.stuartnl <d...@yahoo.com> wrote:

> Thanks Mike,
>
> I'm starting to enjoy this "tweaking" and am trying to push it as
far as I
> can because every microsecond that i gain means the DSP can handle more
> particles/second. I've applied the tips I've gotten on this forum on the
> rest of my source code as well (the actual loops in my program that do the
> calculations on the data) and those are pipelining aswell now. Compared to
> the initial source total improvement is over 900%! Amazing (looks like I
was
> using the DSP as a glorified MCU) but the true power of the DSP is
starting
> to show! I thank you for your input but it raises some questions if you
> don't mind:
> --- In c...@yahoogroups.com <c6x%40yahoogroups.com>, Michael Dunn
> <mike.dunn.001@...> wrote:
> >
> > Congratulations, Dominic!!
> >
> > I'll top post this minor comment wrt 16/32 bit memory accesses and
speed.
> >
> > Assuming that you have 32 bit wide memory with aligned accesses, 32,
> > 16, and 8 bit accesses will be the same speed.
>
> What do you mean with aligned exactly?
>
> > Only if your external memory is 8 or 16 bits wide would there be any
> > potential advantage in performing 16 bit accesses instead of 32 bit
> > accesses.
> > Also, there would be an advantage in fetching 32 bits at a time if
you
> > an entire array of 8 or 16 bit values.
> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1
reads
> the first two together (logic tied together so they "act" like 1
32bits wide
> FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>
> > I haven't looked at the details of your code, but if you always fetch
> > 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
> > that your hardware addresses are preventing you from picking up some
> > additional speed. *If* the input addresses began on a 64 bit boundary
> > [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
> > any memory fetch would coerce the compiler to performing an 'LDDW'
[64
> > bit read].
>
> I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling
> my compiler to fetch a 64 bit read (I mean this still has to be split
> somehow in 2 read cycles somehow?)
>
> >
> > Since your hardware addresses are fixed, you only need 1 pointer. You
> could use
> > tmpRead2 = *(read1 + 4);
> > This would free up one register and, depending on register
> > utilization, could improve the performance.
> > Improve performance, thats what I like to hear ;) I hope my questions
> aren't too "basic".
>
> Dominic
>
> > mikedunn
> >
> >
> > On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams<rkwill@...>
wrote:
> > >
> > >
> > > d.stuartnl,
> > >
> > > my comments in-line and prefixed with <rkw>
> > >
> > > R. Williams
> > >
> > > ---------- Original Message -----------
> > > From: "d.stuartnl" <d.stuartnl@...>
> > > To: c...@yahoogroups.com <c6x%40yahoogroups.com>
> > > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> > > Subject: [c6x] Re: Slow EMIF transfer
> > >
> > >> R.Williams,
> > >>
> > >> SUCCESS! Looptime has almost halved! Software pipelining is
working
> > >> now thanks to your tips:
> > >
> > > <rkw> congratulations!!
> > >
> > > <snip>
> > >
> > >>
> > >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
> > >> is working fine as it is. Dont know why though.
> > >
> > > <rkw> two reasons:
> > > 1) the data size of a float is the same as the address data size
> > > 2) the '-1' because the pCH1 pointer is incremented at the end of
the
> loop
> > > to point 1 past the last location used.
> > > <snip>
> > >
> > >> >
> > >> still have them in a single loop and it's pipelining. Do you
think
> > >> it's worth considering splitting it into two loops and check
if
> > >> there's (an even better) speed increase?
> > >
> > > <rkw> you could experiment, but it looks like it is not
necessary to
> > > separate
> > > the code into two loops.
> > > <snip>
> > >
> > >> My new and improved function:
> > > <snip>
> > >
> > >> // CHANNEL 3 this channel is always read for particle
matching
> > >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> > >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 =
> > >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> > >
> > > <rkw> there seems to be a problem in the editing of the
above 4 lines
> > > It looks like pCH3 is not being used; however, pCH3 is still
being
> > > initialized
> > > and incremented in the code.
> > > Also when testing for execution speed, adding new operations
> (pBinData3)
> > > makes
> > > it very difficult to make timing comparisons.
> > > <snip>
> > >
> > >>
> > >> As you might have seen in my code the second read (tempRead2)
is a 32
> > >> bits int but I'm only interrested in the first 16 bits (where
channel
> > >> 3 and 4 reside), is there a way i can inform the compiler
> > >
> > > <rkw> the natural size of a operation is 32bits, changing
to a 16 bit
> > > operation
> > > would <probably> slow the code execution.
> > >
> > >>
> > >> I had to leave pFifo12 and pFifo3 volatile because when i
removed
> > >> these keywords the software pipelining was disabled again
(Cannot find
> > >> schedule).
> > >
> > > <rkw> the 'volatile' is needed for the two parameters
because they DO
> change
> > > between reads. I had suggested to remove the 'volatile' from the
> variables,
> > > not
> > > the parameters.
> > > <snip>
> > >
> > >>
> > >> With kind regards,
> > >>
> > >> Dominic
> > > <snip>
> > >
> >
> >
> >
> > --
> > www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >  
>





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Michael Dunn - Jul 25 10:38:28 2009

Dominic,

On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d...@yahoo.com> wrote:
> Thanks Mike,
>
> I'm starting to enjoy this "tweaking" and am trying to push it as
far as I
> can because every microsecond that i gain means the DSP can handle more
> particles/second. I've applied the tips I've gotten on this forum on the
> rest of my source code as well (the actual loops in my program that do the
> calculations on the data) and those are pipelining aswell now. Compared to
> the initial source total improvement is over 900%! Amazing (looks like I
was
> using the DSP as a glorified MCU) but the true power of the DSP is
starting
> to show! I thank you for your input but it raises some questions if you
> don't mind:
>
> --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>>
>> Congratulations, Dominic!!
>>
>> I'll top post this minor comment wrt 16/32 bit memory accesses and
speed.
>>
>> Assuming that you have 32 bit wide memory with aligned accesses, 32,
>> 16, and 8 bit accesses will be the same speed.
>
> What do you mean with aligned exactly?
<mld>
'Evenly divisible by the access size' or if 'myAddress % myAccessSize
== 0' then it is aligned.
For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
are aligned [byte addresses are always aligned].

>
>> Only if your external memory is 8 or 16 bits wide would there be any
>> potential advantage in performing 16 bit accesses instead of 32 bit
>> accesses.
>> Also, there would be an advantage in fetching 32 bits at a time if you
>> an entire array of 8 or 16 bit values.
>> I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1
reads
> the first two together (logic tied together so they "act" like 1
32bits wide
> FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>
>> I haven't looked at the details of your code, but if you always fetch
>> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is *possible*
>> that your hardware addresses are preventing you from picking up some
>> additional speed. *If* the input addresses began on a 64 bit boundary
>> [0x90300000, 0x90300008, etc.] and you defined a long long [64 bits],
>> any memory fetch would coerce the compiler to performing an 'LDDW' [64
>> bit read].
>
> I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by telling
my
> compiler to fetch a 64 bit read (I mean this still has to be split somehow
> in 2 read cycles somehow?)
<mld>
First of all, I wrote this before I had the idea of using a single
pointer.  Your code has 2 pointers that load data - this means that
you are using 4 processor registers.  Changing to a single 64 bit read
[32 x 2] would result in requiring only 3 registers.  If your routine
has a lot of register pressure [utilization] where it is loading and
unloading CPU registers, then a 'register reduction change' would help
performance.

As I finished writing about the double read, I thought of 'plan B' -
just use one pointer with an offset.  When you look at the asm
listing, it should give you some register usage info.  If you are
getting 'spills' then definitely try this.

>
>>
>> Since your hardware addresses are fixed, you only need 1 pointer. You
>> could use
>> tmpRead2 = *(read1 + 4);
>> This would free up one register and, depending on register
>> utilization, could improve the performance.
>> Improve performance, thats what I like to hear ;) I hope my questions
aren't
> too "basic".
<mld>
Most active members of this group are willing to help someone who
wants to learn.  As long as your questions are informed and you show a
willingness to participate, most of us will help if we can.  We come
from a variety of backgrounds and each of us end up learning something
from time to time.

As you are learning, 'performance improvement' is not something that
has a single solution.  Rather, it is a journey with many stops along
the way.

mikedunn
>
> Dominic
>
>> mikedunn
>> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams<rkwill@...>
wrote:
>> >
>> >
>> > d.stuartnl,
>> >
>> > my comments in-line and prefixed with <rkw>
>> >
>> > R. Williams
>> >
>> > ---------- Original Message -----------
>> > From: "d.stuartnl" <d.stuartnl@...>
>> > To: c...@yahoogroups.com
>> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> > Subject: [c6x] Re: Slow EMIF transfer
>> >
>> >> R.Williams,
>> >>
>> >> SUCCESS! Looptime has almost halved! Software pipelining is
working
>> >> now thanks to your tips:
>> >
>> > <rkw> congratulations!!
>> >
>> > <snip>
>> >
>> >>
>> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
>> >> is working fine as it is. Dont know why though.
>> >
>> > <rkw> two reasons:
>> > 1) the data size of a float is the same as the address data size
>> > 2) the '-1' because the pCH1 pointer is incremented at the end of
the
>> > loop
>> > to point 1 past the last location used.
>> > <snip>
>> >
>> >> >
>> >> still have them in a single loop and it's pipelining. Do you
think
>> >> it's worth considering splitting it into two loops and check
if
>> >> there's (an even better) speed increase?
>> >
>> > <rkw> you could experiment, but it looks like it is not
necessary to
>> > separate
>> > the code into two loops.
>> > <snip>
>> >
>> >> My new and improved function:
>> > <snip>
>> >
>> >> // CHANNEL 3 this channel is always read for particle
matching
>> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
>> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 =
>> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >
>> > <rkw> there seems to be a problem in the editing of the
above 4 lines
>> > It looks like pCH3 is not being used; however, pCH3 is still
being
>> > initialized
>> > and incremented in the code.
>> > Also when testing for execution speed, adding new operations
(pBinData3)
>> > makes
>> > it very difficult to make timing comparisons.
>> > <snip>
>> >
>> >>
>> >> As you might have seen in my code the second read (tempRead2)
is a 32
>> >> bits int but I'm only interrested in the first 16 bits (where
channel
>> >> 3 and 4 reside), is there a way i can inform the compiler
>> >
>> > <rkw> the natural size of a operation is 32bits, changing to
a 16 bit
>> > operation
>> > would <probably> slow the code execution.
>> >
>> >>
>> >> I had to leave pFifo12 and pFifo3 volatile because when i
removed
>> >> these keywords the software pipelining was disabled again
(Cannot find
>> >> schedule).
>> >
>> > <rkw> the 'volatile' is needed for the two parameters
because they DO
>> > change
>> > between reads. I had suggested to remove the 'volatile' from the
>> > variables,
>> > not
>> > the parameters.
>> > <snip>
>> >
>> >>
>> >> With kind regards,
>> >>
>> >> Dominic
>> > <snip>
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>> 

-- 
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________

______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Sep 2 9:43:58 2009

Dear Mike,

I have been away on holiday, I hope (for the people who are reading this they
had a holiday too (and a good one). Right, back to work it is! I was wondering
about your earlier idea about longer reads to force the compiler to use the LDDW
instruction. Now is my question, how do I do this?

the 2 reads are as follows:

tmpRead1 = *(volatile int*) 0x90300004;
tmpRead2 = *(volatile int*) 0x90300008;

these are both 32 bit reads but i only need 48 bits in total (32 bits from
tmpRead1, and 16 (least significant) bits of tmpRead2. Further more these 2
reads represent 6 (8-bit) channels:

read:         tmpRead2                        tmpRead1
     M                              L M                              L
     S                              S S                              S
     B                              B B                              B  
bit :******************************** ******************************** 
use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6

(forgive my primitive ASCII art :P)

In my fetchData routine (which is pipelining!!) I fetch the data into these 2
variables and then distribute the data in 6 channels. My code is as follows:

unsigned int Calculator_FetchData(Bool curvature)
{
   unsigned int tmpRead1 =0;
   unsigned int tmpRead2;
   unsigned int sampleCount;
   float * restrict pCH1;
   float * restrict pCH2;
   float * restrict pCH3;
   char *  restrict pBinData3;
   float * restrict pCH4;
   float * restrict pCH5;
   float * restrict pCH6;
   
   const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
   const int termValue = 0x84825131;

   pCH1 = &CH1.deloggedData[0];
   pCH2 = &CH2.deloggedData[0];
   pCH3 = &CH3.deloggedData[0];
   pBinData3 = &binData3[0];
   pCH4 = &CH4.deloggedData[0];
   pCH5 = &CH5.deloggedData[0];
   pCH6 = &CH6.deloggedData[0];
   
   #pragma MUST_ITERATE(16,4096,2);
   while(tmpRead1 != termValue)
   {
      tmpRead1 = *(volatile int*) 0x90300004;
      tmpRead2 = *(volatile int*) 0x90300008;
		
      //CHANNEL 1
      *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
		
      // CHANNEL 2
      *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];

      if(curvature)
      {
         *pCH1 += *pCH2;
         if(*pCH1 > 5000)
         {
            *pCH1 = 5000;
         }
      }
      
      //CHANNEL 5
      *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];

      // CHANNEL 6
      *pCH6 = LUT1[tmpRead1 & 0xFF];
	
      // CHANNEL 3 this channel is always read for particle matching on this
channel
      *pBinData3 = tmpRead2 & 0xFF;
      *pCH3 = LUT0[*pBinData3];

      // CHANNEL 4
      *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
		
      pCH1++;
      pCH2++;
      pCH3++;	
      pBinData3++;
      pCH4++;
      pCH5++;
      pCH6++;
	
      if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
      {
         tmpRead1 = termValue;
      }
   }
   sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
   Screen_updateSamples(sampleCount);
   return sampleCount;
}
At the moment I'm getting the folowing pipeline information is the ASM file:

_Calculator_FetchData:
;** --------------------------------------------------------------------------*
;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 219
;*      Loop opening brace source line   : 220
;*      Loop closing brace source line   : 266
;*      Known Minimum Trip Count         : 16                    
;*      Known Maximum Trip Count         : 4096                    
;*      Known Max Trip Count Factor      : 2
;*      Loop Carried Dependency Bound(^) : 7
;*      Unpartitioned Resource Bound     : 9
;*      Partitioned Resource Bound(*)    : 9
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     2        1     
;*      .S units                     4        4     
;*      .D units                     8        9*    
;*      .M units                     0        0     
;*      .X cross paths               1        1     
;*      .T address paths             9*       8     
;*      Long read paths              5        4     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           1        1     (.L or .S unit)
;*      Addition ops (.LSD)          3        3     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             4        3     
;*      Bound(.L .S .D .LS .LSD)     6        6     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 9  Unsafe schedule for irregular loop
;*         ii = 9  Did not find schedule
;*         ii = 10 Unsafe schedule for irregular loop
;*         ii = 10 Unsafe schedule for irregular loop
;*         ii = 10 Did not find schedule
;*         ii = 11 Unsafe schedule for irregular loop
;*         ii = 11 Unsafe schedule for irregular loop
;*         ii = 11 Did not find schedule
;*         ii = 12 Unsafe schedule for irregular loop
;*         ii = 12 Unsafe schedule for irregular loop
;*         ii = 12 Did not find schedule
;*         ii = 13 Unsafe schedule for irregular loop
;*         ii = 13 Unsafe schedule for irregular loop
;*         ii = 13 Unsafe schedule for irregular loop
;*         ii = 13 Did not find schedule
;*         ii = 14 Unsafe schedule for irregular loop
;*         ii = 14 Unsafe schedule for irregular loop
;*         ii = 14 Unsafe schedule for irregular loop
;*         ii = 14 Did not find schedule
;*         ii = 15 Unsafe schedule for irregular loop
;*         ii = 15 Unsafe schedule for irregular loop
;*         ii = 15 Unsafe schedule for irregular loop
;*         ii = 15 Did not find schedule
;*         ii = 16 Unsafe schedule for irregular loop
;*         ii = 16 Unsafe schedule for irregular loop
;*         ii = 16 Unsafe schedule for irregular loop
;*         ii = 16 Did not find schedule
;*         ii = 17 Unsafe schedule for irregular loop
;*         ii = 17 Unsafe schedule for irregular loop
;*         ii = 17 Unsafe schedule for irregular loop
;*         ii = 17 Did not find schedule
;*         ii = 18 Unsafe schedule for irregular loop
;*         ii = 18 Unsafe schedule for irregular loop
;*         ii = 18 Unsafe schedule for irregular loop
;*         ii = 18 Did not find schedule
;*         ii = 19 Unsafe schedule for irregular loop
;*         ii = 19 Schedule found with 1 iterations in parallel
;*
;*      Register Usage Table:
;*          +---------------------------------+
;*          |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;*          |0000000000111111|0000000000111111|
;*          |0123456789012345|0123456789012345|
;*          |----------------+----------------|
;*       0: |* *   ********* | ***** *  ***   |
;*       1: |* *  ********** | ***** *  ***   |
;*       2: |* *  ********** | ***** *  ***   |
;*       3: |* *  ********** | ***** *  ***   |
;*       4: |* *  ********** | ***** *  ***   |
;*       5: |* *  ********** | ***** *  ***   |
;*       6: |* ************* | ***** *  ***   |
;*       7: |* **************| ***** *  ***   |
;*       8: |* **************| ***** ** ***   |
;*       9: |* **************| ******** ***   |
;*      10: |* **************|********* ***   |
;*      11: |****************| ***** ******   |
;*      12: |****************| ***** ******   |
;*      13: |****************| ************   |
;*      14: |* **************| ***** ******   |
;*      15: |* **************| ***** * ****   |
;*      16: |*************** | *******  ***   |
;*      17: |*** * ********* | *******  ***   |
;*      18: |*** * ********* | *******  ***   |
;*          +---------------------------------+
;*
;*      Done
;*
;*      Loop is interruptible
;*      Collapsed epilog stages     : 0
;*      Collapsed prolog stages     : 0
;*
;*      Minimum safe trip count     : 1

Looking at the register usage it looks like it's using quite a lot of registers
and I thought that maybe the LDDW would relieve some registers. Also i was
wondering if i can force-allign arrays in memory? If that's possible I can use 1
pointer to access all the channels (by using an offset when addressing:
----------------------------------------------------------------
IDEA:
	//CHANNEL 1
	*pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];

	// CHANNEL 2
	*pCH1 + offset  = LUT0[((tmpRead1 & 0xFF000000) >> 24)];

        // OTHER CHANNELS addressed using bigger offsets...
-------------------------------------------------------------------

I don't know if this idea is feasible but if it is i think it would relieve some
more pressure of the register usage.

Anyone's idea's/comments are welcome. At the moment the code is running 33% to
slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to load into my
FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an ideal
situation i would like to complete the FetchData routine in 1000 us (not any
shorter or I would read faster then data is being written :P).

With kind regards,

Dominic Stuart

--- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>
> Dominic,
> 
> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d.stuartnl@...> wrote:
> >
> >
> > Thanks Mike,
> >
> > I'm starting to enjoy this "tweaking" and am trying to push
it as far as I
> > can because every microsecond that i gain means the DSP can handle
more
> > particles/second. I've applied the tips I've gotten on this forum on
the
> > rest of my source code as well (the actual loops in my program that do
the
> > calculations on the data) and those are pipelining aswell now.
Compared to
> > the initial source total improvement is over 900%! Amazing (looks like
I was
> > using the DSP as a glorified MCU) but the true power of the DSP is
starting
> > to show! I thank you for your input but it raises some questions if
you
> > don't mind:
> >
> > --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@>
wrote:
> >>
> >> Congratulations, Dominic!!
> >>
> >> I'll top post this minor comment wrt 16/32 bit memory accesses and
speed.
> >>
> >> Assuming that you have 32 bit wide memory with aligned accesses,
32,
> >> 16, and 8 bit accesses will be the same speed.
> >
> > What do you mean with aligned exactly?
> <mld>
> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
> == 0' then it is aligned.
> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
> are aligned [byte addresses are always aligned].
> 
> >
> >> Only if your external memory is 8 or 16 bits wide would there be
any
> >> potential advantage in performing 16 bit accesses instead of 32
bit
> >> accesses.
> >> Also, there would be an advantage in fetching 32 bits at a time if
you
> >> an entire array of 8 or 16 bit values.
> >>
> >
> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so tempRead1
reads
> > the first two together (logic tied together so they "act"
like 1 32bits wide
> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
> >
> >> I haven't looked at the details of your code, but if you always
fetch
> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is
*possible*
> >> that your hardware addresses are preventing you from picking up
some
> >> additional speed. *If* the input addresses began on a 64 bit
boundary
> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64
bits],
> >> any memory fetch would coerce the compiler to performing an 'LDDW'
[64
> >> bit read].
> >
> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by
telling my
> > compiler to fetch a 64 bit read (I mean this still has to be split
somehow
> > in 2 read cycles somehow?)
> <mld>
> First of all, I wrote this before I had the idea of using a single
> pointer.  Your code has 2 pointers that load data - this means that
> you are using 4 processor registers.  Changing to a single 64 bit read
> [32 x 2] would result in requiring only 3 registers.  If your routine
> has a lot of register pressure [utilization] where it is loading and
> unloading CPU registers, then a 'register reduction change' would help
> performance.
> 
> As I finished writing about the double read, I thought of 'plan B' -
> just use one pointer with an offset.  When you look at the asm
> listing, it should give you some register usage info.  If you are
> getting 'spills' then definitely try this.
> 
> >
> >>
> >> Since your hardware addresses are fixed, you only need 1 pointer.
You
> >> could use
> >> tmpRead2 = *(read1 + 4);
> >> This would free up one register and, depending on register
> >> utilization, could improve the performance.
> >>
> >
> > Improve performance, thats what I like to hear ;) I hope my questions
aren't
> > too "basic".
> <mld>
> Most active members of this group are willing to help someone who
> wants to learn.  As long as your questions are informed and you show a
> willingness to participate, most of us will help if we can.  We come
> from a variety of backgrounds and each of us end up learning something
> from time to time.
> 
> As you are learning, 'performance improvement' is not something that
> has a single solution.  Rather, it is a journey with many stops along
> the way.
> 
> mikedunn
> >
> > Dominic
> >
> >> mikedunn
> >>
> >>
> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard Williams<rkwill@>
wrote:
> >> >
> >> >
> >> > d.stuartnl,
> >> >
> >> > my comments in-line and prefixed with <rkw>
> >> >
> >> > R. Williams
> >> >
> >> > ---------- Original Message -----------
> >> > From: "d.stuartnl" <d.stuartnl@>
> >> > To: c...@yahoogroups.com
> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >
> >> >> R.Williams,
> >> >>
> >> >> SUCCESS! Looptime has almost halved! Software pipelining
is working
> >> >> now thanks to your tips:
> >> >
> >> > <rkw> congratulations!!
> >> >
> >> > <snip>
> >> >
> >> >>
> >> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
> >> >> is working fine as it is. Dont know why though.
> >> >
> >> > <rkw> two reasons:
> >> > 1) the data size of a float is the same as the address data
size
> >> > 2) the '-1' because the pCH1 pointer is incremented at the
end of the
> >> > loop
> >> > to point 1 past the last location used.
> >> > <snip>
> >> >
> >> >> >
> >> >> still have them in a single loop and it's pipelining. Do
you think
> >> >> it's worth considering splitting it into two loops and
check if
> >> >> there's (an even better) speed increase?
> >> >
> >> > <rkw> you could experiment, but it looks like it is not
necessary to
> >> > separate
> >> > the code into two loops.
> >> > <snip>
> >> >
> >> >> My new and improved function:
> >> > <snip>
> >> >
> >> >> // CHANNEL 3 this channel is always read for particle
matching
> >> >> on this channel *pCH3 = LUT0[((tmpRead2 & 0xFF))];
> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4 =
> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >
> >> > <rkw> there seems to be a problem in the editing of the
above 4 lines
> >> > It looks like pCH3 is not being used; however, pCH3 is still
being
> >> > initialized
> >> > and incremented in the code.
> >> > Also when testing for execution speed, adding new operations
(pBinData3)
> >> > makes
> >> > it very difficult to make timing comparisons.
> >> > <snip>
> >> >
> >> >>
> >> >> As you might have seen in my code the second read
(tempRead2) is a 32
> >> >> bits int but I'm only interrested in the first 16 bits
(where channel
> >> >> 3 and 4 reside), is there a way i can inform the
compiler
> >> >
> >> > <rkw> the natural size of a operation is 32bits,
changing to a 16 bit
> >> > operation
> >> > would <probably> slow the code execution.
> >> >
> >> >>
> >> >> I had to leave pFifo12 and pFifo3 volatile because when i
removed
> >> >> these keywords the software pipelining was disabled again
(Cannot find
> >> >> schedule).
> >> >
> >> > <rkw> the 'volatile' is needed for the two parameters
because they DO
> >> > change
> >> > between reads. I had suggested to remove the 'volatile' from
the
> >> > variables,
> >> > not
> >> > the parameters.
> >> > <snip>
> >> >
> >> >>
> >> >> With kind regards,
> >> >>
> >> >> Dominic
> >> > <snip>
> >> >
> >>
> >>
> >>
> >> --
> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >>
> >
> > -- 
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Michael Dunn - Sep 2 11:02:45 2009

Hello Dominic,

On Wed, Sep 2, 2009 at 8:13 AM, d.stuartnl<d...@yahoo.com> wrote:
> Dear Mike,
>
> I have been away on holiday, I hope (for the people who are reading this
> they had a holiday too (and a good one). Right, back to work it is! I was
> wondering about your earlier idea about longer reads to force the compiler
> to use the LDDW instruction. Now is my question, how do I do this?

<mld>
I probably did not express my point well [I sometimes answer these
posts while I am on the telephone or during a break from another
task].  *IF* the hardware was designed with knowledge of the software,
the first register would have been on a 64 bit boundary[add ends in 0
or 8.
Ignore that thought for your case.
more comments near end.

>
> the 2 reads are as follows:
>
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> these are both 32 bit reads but i only need 48 bits in total (32 bits from
> tmpRead1, and 16 (least significant) bits of tmpRead2. Further more these
2
> reads represent 6 (8-bit) channels:
>
> read: tmpRead2 tmpRead1
> M L M L
> S S S S
> B B B B
> bit :******************************** ********************************
> use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
>
> (forgive my primitive ASCII art :P)
>
> In my fetchData routine (which is pipelining!!) I fetch the data into
these
> 2 variables and then distribute the data in 6 channels. My code is as
> follows:
>
> unsigned int Calculator_FetchData(Bool curvature)
> {
> unsigned int tmpRead1 =0;
> unsigned int tmpRead2;
>
> unsigned int sampleCount;
> float * restrict pCH1;
> float * restrict pCH2;
> float * restrict pCH3;
> char * restrict pBinData3;
> float * restrict pCH4;
> float * restrict pCH5;
> float * restrict pCH6;
>
> const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> const int termValue = 0x84825131;
>
> pCH1 = &CH1.deloggedData[0];
> pCH2 = &CH2.deloggedData[0];
> pCH3 = &CH3.deloggedData[0];
> pBinData3 = &binData3[0];
> pCH4 = &CH4.deloggedData[0];
> pCH5 = &CH5.deloggedData[0];
> pCH6 = &CH6.deloggedData[0];
>
> #pragma MUST_ITERATE(16,4096,2);
> while(tmpRead1 != termValue)
> {
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> //CHANNEL 1
> *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
>
> // CHANNEL 2
> *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> if(curvature)
> {
> *pCH1 += *pCH2;
> if(*pCH1 > 5000)
> {
> *pCH1 = 5000;
> }
> }
>
> //CHANNEL 5
> *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
> // CHANNEL 6
> *pCH6 = LUT1[tmpRead1 & 0xFF];
>
> // CHANNEL 3 this channel is always read for particle matching on this
> channel
> *pBinData3 = tmpRead2 & 0xFF;
> *pCH3 = LUT0[*pBinData3];
>
> // CHANNEL 4
> *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
> pCH1++;
> pCH2++;
> pCH3++;
> pBinData3++;
> pCH4++;
> pCH5++;
> pCH6++;
>
> if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> {
> tmpRead1 = termValue;
> }
> }
> sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> Screen_updateSamples(sampleCount);
> return sampleCount;
> }
> At the moment I'm getting the folowing pipeline information is the ASM
file:
>
> _Calculator_FetchData:
> ;** ----------------------------------------------------------*
> ;*----------------------------------------------------------*
> ;* SOFTWARE PIPELINE INFORMATION
> ;*
> ;* Loop source line : 219
> ;* Loop opening brace source line : 220
> ;* Loop closing brace source line : 266
> ;* Known Minimum Trip Count : 16
> ;* Known Maximum Trip Count : 4096
> ;* Known Max Trip Count Factor : 2
> ;* Loop Carried Dependency Bound(^) : 7
> ;* Unpartitioned Resource Bound : 9
> ;* Partitioned Resource Bound(*) : 9
> ;* Resource Partition:
> ;* A-side B-side
> ;* .L units 2 1
> ;* .S units 4 4
> ;* .D units 8 9*
> ;* .M units 0 0
> ;* .X cross paths 1 1
> ;* .T address paths 9* 8
> ;* Long read paths 5 4
> ;* Long write paths 0 0
> ;* Logical ops (.LS) 1 1 (.L or .S unit)
> ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> ;* Bound(.L .S .LS) 4 3
> ;* Bound(.L .S .D .LS .LSD) 6 6
> ;*
> ;* Searching for software pipeline schedule at ...
> ;* ii = 9 Unsafe schedule for irregular loop
> ;* ii = 9 Did not find schedule
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Unsafe schedule for irregular loop
> ;* ii = 10 Did not find schedule
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Unsafe schedule for irregular loop
> ;* ii = 11 Did not find schedule
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Unsafe schedule for irregular loop
> ;* ii = 12 Did not find schedule
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Unsafe schedule for irregular loop
> ;* ii = 13 Did not find schedule
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Unsafe schedule for irregular loop
> ;* ii = 14 Did not find schedule
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Unsafe schedule for irregular loop
> ;* ii = 15 Did not find schedule
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Unsafe schedule for irregular loop
> ;* ii = 16 Did not find schedule
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Unsafe schedule for irregular loop
> ;* ii = 17 Did not find schedule
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Unsafe schedule for irregular loop
> ;* ii = 18 Did not find schedule
> ;* ii = 19 Unsafe schedule for irregular loop
> ;* ii = 19 Schedule found with 1 iterations in parallel
> ;*
> ;* Register Usage Table:
> ;* +---------------------------------+
> ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> ;* |0000000000111111|0000000000111111|
> ;* |0123456789012345|0123456789012345|
> ;* |----------------+----------------|
> ;* 0: |* * ********* | ***** * *** |
> ;* 1: |* * ********** | ***** * *** |
> ;* 2: |* * ********** | ***** * *** |
> ;* 3: |* * ********** | ***** * *** |
> ;* 4: |* * ********** | ***** * *** |
> ;* 5: |* * ********** | ***** * *** |
> ;* 6: |* ************* | ***** * *** |
> ;* 7: |* **************| ***** * *** |
> ;* 8: |* **************| ***** ** *** |
> ;* 9: |* **************| ******** *** |
> ;* 10: |* **************|********* *** |
> ;* 11: |****************| ***** ****** |
> ;* 12: |****************| ***** ****** |
> ;* 13: |****************| ************ |
> ;* 14: |* **************| ***** ****** |
> ;* 15: |* **************| ***** * **** |
> ;* 16: |*************** | ******* *** |
> ;* 17: |*** * ********* | ******* *** |
> ;* 18: |*** * ********* | ******* *** |
> ;* +---------------------------------+
> ;*
> ;* Done
> ;*
> ;* Loop is interruptible
> ;* Collapsed epilog stages : 0
> ;* Collapsed prolog stages : 0
> ;*
> ;* Minimum safe trip count : 1
>
> Looking at the register usage it looks like it's using quite a lot of
> registers and I thought that maybe the LDDW would relieve some registers.
> Also i was wondering if i can force-allign arrays in memory? If that's
> possible I can use 1 pointer to access all the channels (by using an
offset
> when addressing:
> ----------------------------------------------------------
> IDEA:
> //CHANNEL 1
> *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> // CHANNEL 2
> *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
> // OTHER CHANNELS addressed using bigger offsets...
> ----------------------------------------------------------
>
> I don't know if this idea is feasible but if it is i think it would
relieve
> some more pressure of the register usage.
>
> Anyone's idea's/comments are welcome. At the moment the code is running
33%
> to slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to load
into
> my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In an
> ideal situation i would like to complete the FetchData routine in 1000 us
> (not any shorter or I would read faster then data is being written :P).
>
<mld>
I took a quick look at your code and comments above.  You only mention
acquisition time - nothing about processing time.
If you are only interested in acquisition time, you could do something like:
1.
unsigned int aMyIntArray[numOfSamples];
unsigned short aMyShortArray[numberOfSamples];
2.
Just grab the raw data [this would allow a future EDMA implementation]
3.
Process the data from the array.
4.
You will now have a new problem - it runs too fast.

mikedunn
> With kind regards,
>
> Dominic Stuart
> --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>>
>> Dominic,
>>
>> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d.stuartnl@...>
wrote:
>> >
>> >
>> > Thanks Mike,
>> >
>> > I'm starting to enjoy this "tweaking" and am trying to
push it as far as
>> > I
>> > can because every microsecond that i gain means the DSP can handle
more
>> > particles/second. I've applied the tips I've gotten on this forum
on the
>> > rest of my source code as well (the actual loops in my program
that do
>> > the
>> > calculations on the data) and those are pipelining aswell now.
Compared
>> > to
>> > the initial source total improvement is over 900%! Amazing (looks
like I
>> > was
>> > using the DSP as a glorified MCU) but the true power of the DSP
is
>> > starting
>> > to show! I thank you for your input but it raises some questions
if you
>> > don't mind:
>> >
>> > --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@>
wrote:
>> >>
>> >> Congratulations, Dominic!!
>> >>
>> >> I'll top post this minor comment wrt 16/32 bit memory accesses
and
>> >> speed.
>> >>
>> >> Assuming that you have 32 bit wide memory with aligned
accesses, 32,
>> >> 16, and 8 bit accesses will be the same speed.
>> >
>> > What do you mean with aligned exactly?
>> <mld>
>> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
>> == 0' then it is aligned.
>> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
>> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
>> are aligned [byte addresses are always aligned].
>>
>> >
>> >> Only if your external memory is 8 or 16 bits wide would there
be any
>> >> potential advantage in performing 16 bit accesses instead of
32 bit
>> >> accesses.
>> >> Also, there would be an advantage in fetching 32 bits at a
time if you
>> >> an entire array of 8 or 16 bit values.
>> >>
>> >
>> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so
tempRead1
>> > reads
>> > the first two together (logic tied together so they
"act" like 1 32bits
>> > wide
>> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>> >
>> >> I haven't looked at the details of your code, but if you
always fetch
>> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is
*possible*
>> >> that your hardware addresses are preventing you from picking
up some
>> >> additional speed. *If* the input addresses began on a 64 bit
boundary
>> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64
bits],
>> >> any memory fetch would coerce the compiler to performing an
'LDDW' [64
>> >> bit read].
>> >
>> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by
>> > telling my
>> > compiler to fetch a 64 bit read (I mean this still has to be
split
>> > somehow
>> > in 2 read cycles somehow?)
>> <mld>
>> First of all, I wrote this before I had the idea of using a single
>> pointer. Your code has 2 pointers that load data - this means that
>> you are using 4 processor registers. Changing to a single 64 bit read
>> [32 x 2] would result in requiring only 3 registers. If your routine
>> has a lot of register pressure [utilization] where it is loading and
>> unloading CPU registers, then a 'register reduction change' would help
>> performance.
>>
>> As I finished writing about the double read, I thought of 'plan B' -
>> just use one pointer with an offset. When you look at the asm
>> listing, it should give you some register usage info. If you are
>> getting 'spills' then definitely try this.
>>
>> >
>> >>
>> >> Since your hardware addresses are fixed, you only need 1
pointer. You
>> >> could use
>> >> tmpRead2 = *(read1 + 4);
>> >> This would free up one register and, depending on register
>> >> utilization, could improve the performance.
>> >>
>> >
>> > Improve performance, thats what I like to hear ;) I hope my
questions
>> > aren't
>> > too "basic".
>> <mld>
>> Most active members of this group are willing to help someone who
>> wants to learn. As long as your questions are informed and you show a
>> willingness to participate, most of us will help if we can. We come
>> from a variety of backgrounds and each of us end up learning something
>> from time to time.
>>
>> As you are learning, 'performance improvement' is not something that
>> has a single solution. Rather, it is a journey with many stops along
>> the way.
>>
>> mikedunn
>> >
>> > Dominic
>> >
>> >> mikedunn
>> >>
>> >>
>> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard
Williams<rkwill@> wrote:
>> >> >
>> >> >
>> >> > d.stuartnl,
>> >> >
>> >> > my comments in-line and prefixed with <rkw>
>> >> >
>> >> > R. Williams
>> >> >
>> >> > ---------- Original Message -----------
>> >> > From: "d.stuartnl" <d.stuartnl@>
>> >> > To: c...@yahoogroups.com
>> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> >> > Subject: [c6x] Re: Slow EMIF transfer
>> >> >
>> >> >> R.Williams,
>> >> >>
>> >> >> SUCCESS! Looptime has almost halved! Software
pipelining is working
>> >> >> now thanks to your tips:
>> >> >
>> >> > <rkw> congratulations!!
>> >> >
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0])
>> >> >> -1;
>> >> >> is working fine as it is. Dont know why though.
>> >> >
>> >> > <rkw> two reasons:
>> >> > 1) the data size of a float is the same as the address
data size
>> >> > 2) the '-1' because the pCH1 pointer is incremented at
the end of the
>> >> > loop
>> >> > to point 1 past the last location used.
>> >> > <snip>
>> >> >
>> >> >> >
>> >> >> still have them in a single loop and it's pipelining.
Do you think
>> >> >> it's worth considering splitting it into two loops
and check if
>> >> >> there's (an even better) speed increase?
>> >> >
>> >> > <rkw> you could experiment, but it looks like it is
not necessary to
>> >> > separate
>> >> > the code into two loops.
>> >> > <snip>
>> >> >
>> >> >> My new and improved function:
>> >> > <snip>
>> >> >
>> >> >> // CHANNEL 3 this channel is always read for particle
matching
>> >> >> on this channel *pCH3 = LUT0[((tmpRead2 &
0xFF))];
>> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4
=
>> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >> >
>> >> > <rkw> there seems to be a problem in the editing of
the above 4 lines
>> >> > It looks like pCH3 is not being used; however, pCH3 is
still being
>> >> > initialized
>> >> > and incremented in the code.
>> >> > Also when testing for execution speed, adding new
operations
>> >> > (pBinData3)
>> >> > makes
>> >> > it very difficult to make timing comparisons.
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> As you might have seen in my code the second read
(tempRead2) is a
>> >> >> 32
>> >> >> bits int but I'm only interrested in the first 16
bits (where
>> >> >> channel
>> >> >> 3 and 4 reside), is there a way i can inform the
compiler
>> >> >
>> >> > <rkw> the natural size of a operation is 32bits,
changing to a 16 bit
>> >> > operation
>> >> > would <probably> slow the code execution.
>> >> >
>> >> >>
>> >> >> I had to leave pFifo12 and pFifo3 volatile because
when i removed
>> >> >> these keywords the software pipelining was disabled
again (Cannot
>> >> >> find
>> >> >> schedule).
>> >> >
>> >> > <rkw> the 'volatile' is needed for the two
parameters because they DO
>> >> > change
>> >> > between reads. I had suggested to remove the 'volatile'
from the
>> >> > variables,
>> >> > not
>> >> > the parameters.
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> With kind regards,
>> >> >>
>> >> >> Dominic
>> >> > <snip>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>> >>
>> >
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>> 

-- 
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Sep 2 11:49:59 2009

Hi Mike,

--- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>
> Hello Dominic,
> 
> On Wed, Sep 2, 2009 at 8:13 AM, d.stuartnl<d.stuartnl@...> wrote:
> >
> >
> > Dear Mike,
> >
> > I have been away on holiday, I hope (for the people who are reading
this
> > they had a holiday too (and a good one). Right, back to work it is! I
was
> > wondering about your earlier idea about longer reads to force the
compiler
> > to use the LDDW instruction. Now is my question, how do I do this?
> 
> <mld>
> I probably did not express my point well [I sometimes answer these
> posts while I am on the telephone or during a break from another
> task].  *IF* the hardware was designed with knowledge of the software,
> the first register would have been on a 64 bit boundary[add ends in 0
> or 8.
> Ignore that thought for your case.
> more comments near end.
> 
> >
> > the 2 reads are as follows:
> >
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > these are both 32 bit reads but i only need 48 bits in total (32 bits
from
> > tmpRead1, and 16 (least significant) bits of tmpRead2. Further more
these 2
> > reads represent 6 (8-bit) channels:
> >
> > read: tmpRead2 tmpRead1
> > M L M L
> > S S S S
> > B B B B
> > bit :********************************
********************************
> > use :----------------CHANNEL4CHANNEL3
CHANNEL2CHANNEL1CHANNEL5CHANNEL6
> >
> > (forgive my primitive ASCII art :P)
> >
> > In my fetchData routine (which is pipelining!!) I fetch the data into
these
> > 2 variables and then distribute the data in 6 channels. My code is as
> > follows:
> >
> > unsigned int Calculator_FetchData(Bool curvature)
> > {
> > unsigned int tmpRead1 =0;
> > unsigned int tmpRead2;
> >
> > unsigned int sampleCount;
> > float * restrict pCH1;
> > float * restrict pCH2;
> > float * restrict pCH3;
> > char * restrict pBinData3;
> > float * restrict pCH4;
> > float * restrict pCH5;
> > float * restrict pCH6;
> >
> > const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> > const int termValue = 0x84825131;
> >
> > pCH1 = &CH1.deloggedData[0];
> > pCH2 = &CH2.deloggedData[0];
> > pCH3 = &CH3.deloggedData[0];
> > pBinData3 = &binData3[0];
> > pCH4 = &CH4.deloggedData[0];
> > pCH5 = &CH5.deloggedData[0];
> > pCH6 = &CH6.deloggedData[0];
> >
> > #pragma MUST_ITERATE(16,4096,2);
> > while(tmpRead1 != termValue)
> > {
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > //CHANNEL 1
> > *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> > // CHANNEL 2
> > *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > if(curvature)
> > {
> > *pCH1 += *pCH2;
> > if(*pCH1 > 5000)
> > {
> > *pCH1 = 5000;
> > }
> > }
> >
> > //CHANNEL 5
> > *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> > // CHANNEL 6
> > *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> > // CHANNEL 3 this channel is always read for particle matching on
this
> > channel
> > *pBinData3 = tmpRead2 & 0xFF;
> > *pCH3 = LUT0[*pBinData3];
> >
> > // CHANNEL 4
> > *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> > pCH1++;
> > pCH2++;
> > pCH3++;
> > pBinData3++;
> > pCH4++;
> > pCH5++;
> > pCH6++;
> >
> > if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
> > {
> > tmpRead1 = termValue;
> > }
> > }
> > sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> > Screen_updateSamples(sampleCount);
> > return sampleCount;
> > }
> > At the moment I'm getting the folowing pipeline information is the ASM
file:
> >
> > _Calculator_FetchData:
> > ;** ----------------------------------------------------------*
> > ;*----------------------------------------------------------*
> > ;* SOFTWARE PIPELINE INFORMATION
> > ;*
> > ;* Loop source line : 219
> > ;* Loop opening brace source line : 220
> > ;* Loop closing brace source line : 266
> > ;* Known Minimum Trip Count : 16
> > ;* Known Maximum Trip Count : 4096
> > ;* Known Max Trip Count Factor : 2
> > ;* Loop Carried Dependency Bound(^) : 7
> > ;* Unpartitioned Resource Bound : 9
> > ;* Partitioned Resource Bound(*) : 9
> > ;* Resource Partition:
> > ;* A-side B-side
> > ;* .L units 2 1
> > ;* .S units 4 4
> > ;* .D units 8 9*
> > ;* .M units 0 0
> > ;* .X cross paths 1 1
> > ;* .T address paths 9* 8
> > ;* Long read paths 5 4
> > ;* Long write paths 0 0
> > ;* Logical ops (.LS) 1 1 (.L or .S unit)
> > ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
> > ;* Bound(.L .S .LS) 4 3
> > ;* Bound(.L .S .D .LS .LSD) 6 6
> > ;*
> > ;* Searching for software pipeline schedule at ...
> > ;* ii = 9 Unsafe schedule for irregular loop
> > ;* ii = 9 Did not find schedule
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Unsafe schedule for irregular loop
> > ;* ii = 10 Did not find schedule
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Unsafe schedule for irregular loop
> > ;* ii = 11 Did not find schedule
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Unsafe schedule for irregular loop
> > ;* ii = 12 Did not find schedule
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Unsafe schedule for irregular loop
> > ;* ii = 13 Did not find schedule
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Unsafe schedule for irregular loop
> > ;* ii = 14 Did not find schedule
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Unsafe schedule for irregular loop
> > ;* ii = 15 Did not find schedule
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Unsafe schedule for irregular loop
> > ;* ii = 16 Did not find schedule
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Unsafe schedule for irregular loop
> > ;* ii = 17 Did not find schedule
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Unsafe schedule for irregular loop
> > ;* ii = 18 Did not find schedule
> > ;* ii = 19 Unsafe schedule for irregular loop
> > ;* ii = 19 Schedule found with 1 iterations in parallel
> > ;*
> > ;* Register Usage Table:
> > ;* +---------------------------------+
> > ;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> > ;* |0000000000111111|0000000000111111|
> > ;* |0123456789012345|0123456789012345|
> > ;* |----------------+----------------|
> > ;* 0: |* * ********* | ***** * *** |
> > ;* 1: |* * ********** | ***** * *** |
> > ;* 2: |* * ********** | ***** * *** |
> > ;* 3: |* * ********** | ***** * *** |
> > ;* 4: |* * ********** | ***** * *** |
> > ;* 5: |* * ********** | ***** * *** |
> > ;* 6: |* ************* | ***** * *** |
> > ;* 7: |* **************| ***** * *** |
> > ;* 8: |* **************| ***** ** *** |
> > ;* 9: |* **************| ******** *** |
> > ;* 10: |* **************|********* *** |
> > ;* 11: |****************| ***** ****** |
> > ;* 12: |****************| ***** ****** |
> > ;* 13: |****************| ************ |
> > ;* 14: |* **************| ***** ****** |
> > ;* 15: |* **************| ***** * **** |
> > ;* 16: |*************** | ******* *** |
> > ;* 17: |*** * ********* | ******* *** |
> > ;* 18: |*** * ********* | ******* *** |
> > ;* +---------------------------------+
> > ;*
> > ;* Done
> > ;*
> > ;* Loop is interruptible
> > ;* Collapsed epilog stages : 0
> > ;* Collapsed prolog stages : 0
> > ;*
> > ;* Minimum safe trip count : 1
> >
> > Looking at the register usage it looks like it's using quite a lot of
> > registers and I thought that maybe the LDDW would relieve some
registers.
> > Also i was wondering if i can force-allign arrays in memory? If
that's
> > possible I can use 1 pointer to access all the channels (by using an
offset
> > when addressing:
> > ----------------------------------------------------------
> > IDEA:
> > //CHANNEL 1
> > *pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> >
> > // CHANNEL 2
> > *pCH1 + offset = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> > // OTHER CHANNELS addressed using bigger offsets...
> > ----------------------------------------------------------
> >
> > I don't know if this idea is feasible but if it is i think it would
relieve
> > some more pressure of the register usage.
> >
> > Anyone's idea's/comments are welcome. At the moment the code is
running 33%
> > to slow. If I offer 4000 samples @ 4MHz (data will take 1000 us to
load into
> > my FIFO's). It takes the DSP 1500 us to run the FetchData routine. In
an
> > ideal situation i would like to complete the FetchData routine in 1000
us
> > (not any shorter or I would read faster then data is being written
:P).
> >
> <mld>
> I took a quick look at your code and comments above.  You only mention
> acquisition time - nothing about processing time.
> If you are only interested in acquisition time, you could do something
like:
> 1.
> unsigned int aMyIntArray[numOfSamples];
> unsigned short aMyShortArray[numberOfSamples];
> 2.
> Just grab the raw data [this would allow a future EDMA implementation]
> 3.
> Process the data from the array.
> 4.
> You will now have a new problem - it runs too fast.

With aquisition of the data I mean aquiring the true data.. I'll try to
explain:

The Raw data is 8 bit compressed packed in 1 32 bit read and 1 16 bit read. To
be able to "work" with the data i need to seperate and decompress
(with a LUT which results in floats) the data so every channel gets in it's own
(float)array.

I start my "fetching" algorithm after the first byte has been written
in the FIFO. Fetching the (RAW) data faster then it is being read has no use
because I need to process the (seperated and decompressed) data ASAP. So for
(lets say 4000 samples) 1000 us it takes to write all the samples to my FIFO I
want to, not only acquire the compressed data, but also decompress it (using the
LUT) and seperate the channels. This way I can start processing the decompressed
data ASAP.

In an ideal world (read: not very likely) I would like to do some of the
calculations on the decompressed & seperated data while I am fetching it.
This would give me a huge boost because this would mean it is possible to make a
decision based on these calculations to ignore the current datastream and wait
for the next one, or continue processing it.

I think some calculations will be possible if I can tweak this fetching routine
someway.

With kind regards,

Dominic
> 
> mikedunn
> > With kind regards,
> >
> > Dominic Stuart
> >
> >
> > --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@>
wrote:
> >>
> >> Dominic,
> >>
> >> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d.stuartnl@>
wrote:
> >> >
> >> >
> >> > Thanks Mike,
> >> >
> >> > I'm starting to enjoy this "tweaking" and am trying
to push it as far as
> >> > I
> >> > can because every microsecond that i gain means the DSP can
handle more
> >> > particles/second. I've applied the tips I've gotten on this
forum on the
> >> > rest of my source code as well (the actual loops in my
program that do
> >> > the
> >> > calculations on the data) and those are pipelining aswell
now. Compared
> >> > to
> >> > the initial source total improvement is over 900%! Amazing
(looks like I
> >> > was
> >> > using the DSP as a glorified MCU) but the true power of the
DSP is
> >> > starting
> >> > to show! I thank you for your input but it raises some
questions if you
> >> > don't mind:
> >> >
> >> > --- In c...@yahoogroups.com, Michael Dunn
<mike.dunn.001@> wrote:
> >> >>
> >> >> Congratulations, Dominic!!
> >> >>
> >> >> I'll top post this minor comment wrt 16/32 bit memory
accesses and
> >> >> speed.
> >> >>
> >> >> Assuming that you have 32 bit wide memory with aligned
accesses, 32,
> >> >> 16, and 8 bit accesses will be the same speed.
> >> >
> >> > What do you mean with aligned exactly?
> >> <mld>
> >> 'Evenly divisible by the access size' or if 'myAddress %
myAccessSize
> >> == 0' then it is aligned.
> >> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending
in
> >> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in
0,4,8,C
> >> are aligned [byte addresses are always aligned].
> >>
> >> >
> >> >> Only if your external memory is 8 or 16 bits wide would
there be any
> >> >> potential advantage in performing 16 bit accesses instead
of 32 bit
> >> >> accesses.
> >> >> Also, there would be an advantage in fetching 32 bits at
a time if you
> >> >> an entire array of 8 or 16 bit values.
> >> >>
> >> >
> >> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so
tempRead1
> >> > reads
> >> > the first two together (logic tied together so they
"act" like 1 32bits
> >> > wide
> >> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the
FIFO).
> >> >
> >> >> I haven't looked at the details of your code, but if you
always fetch
> >> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is
*possible*
> >> >> that your hardware addresses are preventing you from
picking up some
> >> >> additional speed. *If* the input addresses began on a 64
bit boundary
> >> >> [0x90300000, 0x90300008, etc.] and you defined a long
long [64 bits],
> >> >> any memory fetch would coerce the compiler to performing
an 'LDDW' [64
> >> >> bit read].
> >> >
> >> > I do always fetch 48 bits (1x 32, 1x 16) but what would i
gain by
> >> > telling my
> >> > compiler to fetch a 64 bit read (I mean this still has to be
split
> >> > somehow
> >> > in 2 read cycles somehow?)
> >> <mld>
> >> First of all, I wrote this before I had the idea of using a
single
> >> pointer. Your code has 2 pointers that load data - this means
that
> >> you are using 4 processor registers. Changing to a single 64 bit
read
> >> [32 x 2] would result in requiring only 3 registers. If your
routine
> >> has a lot of register pressure [utilization] where it is loading
and
> >> unloading CPU registers, then a 'register reduction change' would
help
> >> performance.
> >>
> >> As I finished writing about the double read, I thought of 'plan B'
-
> >> just use one pointer with an offset. When you look at the asm
> >> listing, it should give you some register usage info. If you are
> >> getting 'spills' then definitely try this.
> >>
> >> >
> >> >>
> >> >> Since your hardware addresses are fixed, you only need 1
pointer. You
> >> >> could use
> >> >> tmpRead2 = *(read1 + 4);
> >> >> This would free up one register and, depending on
register
> >> >> utilization, could improve the performance.
> >> >>
> >> >
> >> > Improve performance, thats what I like to hear ;) I hope my
questions
> >> > aren't
> >> > too "basic".
> >> <mld>
> >> Most active members of this group are willing to help someone who
> >> wants to learn. As long as your questions are informed and you
show a
> >> willingness to participate, most of us will help if we can. We
come
> >> from a variety of backgrounds and each of us end up learning
something
> >> from time to time.
> >>
> >> As you are learning, 'performance improvement' is not something
that
> >> has a single solution. Rather, it is a journey with many stops
along
> >> the way.
> >>
> >> mikedunn
> >> >
> >> > Dominic
> >> >
> >> >> mikedunn
> >> >>
> >> >>
> >> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard
Williams<rkwill@> wrote:
> >> >> >
> >> >> >
> >> >> > d.stuartnl,
> >> >> >
> >> >> > my comments in-line and prefixed with <rkw>
> >> >> >
> >> >> > R. Williams
> >> >> >
> >> >> > ---------- Original Message -----------
> >> >> > From: "d.stuartnl" <d.stuartnl@>
> >> >> > To: c...@yahoogroups.com
> >> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >> >
> >> >> >> R.Williams,
> >> >> >>
> >> >> >> SUCCESS! Looptime has almost halved! Software
pipelining is working
> >> >> >> now thanks to your tips:
> >> >> >
> >> >> > <rkw> congratulations!!
> >> >> >
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0])
> >> >> >> -1;
> >> >> >> is working fine as it is. Dont know why though.
> >> >> >
> >> >> > <rkw> two reasons:
> >> >> > 1) the data size of a float is the same as the
address data size
> >> >> > 2) the '-1' because the pCH1 pointer is incremented
at the end of the
> >> >> > loop
> >> >> > to point 1 past the last location used.
> >> >> > <snip>
> >> >> >
> >> >> >> >
> >> >> >> still have them in a single loop and it's
pipelining. Do you think
> >> >> >> it's worth considering splitting it into two
loops and check if
> >> >> >> there's (an even better) speed increase?
> >> >> >
> >> >> > <rkw> you could experiment, but it looks like
it is not necessary to
> >> >> > separate
> >> >> > the code into two loops.
> >> >> > <snip>
> >> >> >
> >> >> >> My new and improved function:
> >> >> > <snip>
> >> >> >
> >> >> >> // CHANNEL 3 this channel is always read for
particle matching
> >> >> >> on this channel *pCH3 = LUT0[((tmpRead2 &
0xFF))];
> >> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4
*pCH4 =
> >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >> >
> >> >> > <rkw> there seems to be a problem in the
editing of the above 4 lines
> >> >> > It looks like pCH3 is not being used; however, pCH3
is still being
> >> >> > initialized
> >> >> > and incremented in the code.
> >> >> > Also when testing for execution speed, adding new
operations
> >> >> > (pBinData3)
> >> >> > makes
> >> >> > it very difficult to make timing comparisons.
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> As you might have seen in my code the second
read (tempRead2) is a
> >> >> >> 32
> >> >> >> bits int but I'm only interrested in the first
16 bits (where
> >> >> >> channel
> >> >> >> 3 and 4 reside), is there a way i can inform the
compiler
> >> >> >
> >> >> > <rkw> the natural size of a operation is
32bits, changing to a 16 bit
> >> >> > operation
> >> >> > would <probably> slow the code execution.
> >> >> >
> >> >> >>
> >> >> >> I had to leave pFifo12 and pFifo3 volatile
because when i removed
> >> >> >> these keywords the software pipelining was
disabled again (Cannot
> >> >> >> find
> >> >> >> schedule).
> >> >> >
> >> >> > <rkw> the 'volatile' is needed for the two
parameters because they DO
> >> >> > change
> >> >> > between reads. I had suggested to remove the
'volatile' from the
> >> >> > variables,
> >> >> > not
> >> >> > the parameters.
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> With kind regards,
> >> >> >>
> >> >> >> Dominic
> >> >> > <snip>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
> >>
> >
> > -- 
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>

_____________________________________

______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: Slow EMIF transfer - Jeff Brower - Sep 2 15:37:56 2009

Dominic-

> I have been away on holiday, I hope (for the people who are
> reading this they had a holiday too (and a good one).
> Right, back to work it is! I was wondering about your earlier
> idea about longer reads to force the compiler to use the
> LDDW instruction. Now is my question, how do I do this?
>
> the 2 reads are as follows:
>
> tmpRead1 = *(volatile int*) 0x90300004;
> tmpRead2 = *(volatile int*) 0x90300008;
>
> these are both 32 bit reads but i only need 48 bits in total
> (32 bits from tmpRead1, and 16 (least significant) bits
> of tmpRead2.

One general comment I might make...

Normally it's a good idea to put some FPGA logic between a FIFO and the DSP, so
if you're required to collate FIFO
values into larger words -- or split them apart, or arrange data into chunks
that are natural size for the DSP's DMA
engine, whatever is needed -- it's do-able *after* the hardware design process,
with minimal effort required for
trial-and-error.

I know in this case that you have a direct EMIF interface to the FIFO, without
intermediate FPGA logic... and I'm not
trying to say anything negative about your method, which seems to be working
fine.  My point is, it might be better to
not to spend too much time worrying about how you split 6 channels out in
software and how to optimize the last 10 to
20% of your performance, but instead just call it a "phase 1" or
proof-of-concept.  If increased performance is
needed, then recommend to your managers that the hardware design should be
modified and flexibility added.

Otherwise, you could work on this many months, and write increasingly specific,
non-general C code that fits only one
type of hardware model, and still not hit the MByte/sec performance level that
you really need.

Again, just a general comment... comes from many years of experience at this
stuff.

-Jeff

> Further more these 2 reads represent 6 (8-bit) channels:
>
> read:         tmpRead2                        tmpRead1
>      M                              L M                              L
>      S                              S S                              S
>      B                              B B                              B
> bit :******************************** ********************************
> use :----------------CHANNEL4CHANNEL3 CHANNEL2CHANNEL1CHANNEL5CHANNEL6
>
> (forgive my primitive ASCII art :P)
>
> In my fetchData routine (which is pipelining!!) I fetch the data into these
2 variables and then distribute the data
> in 6 channels. My code is as follows:
>
> unsigned int Calculator_FetchData(Bool curvature)
> {
>    unsigned int tmpRead1 =0;
>    unsigned int tmpRead2;
>    unsigned int sampleCount;
>    float * restrict pCH1;
>    float * restrict pCH2;
>    float * restrict pCH3;
>    char *  restrict pBinData3;
>    float * restrict pCH4;
>    float * restrict pCH5;
>    float * restrict pCH6;
>
>    const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
>    const int termValue = 0x84825131;
>
>    pCH1 = &CH1.deloggedData[0];
>    pCH2 = &CH2.deloggedData[0];
>    pCH3 = &CH3.deloggedData[0];
>    pBinData3 = &binData3[0];
>    pCH4 = &CH4.deloggedData[0];
>    pCH5 = &CH5.deloggedData[0];
>    pCH6 = &CH6.deloggedData[0];
>
>    #pragma MUST_ITERATE(16,4096,2);
>    while(tmpRead1 != termValue)
>    {
>       tmpRead1 = *(volatile int*) 0x90300004;
>       tmpRead2 = *(volatile int*) 0x90300008;
>
>       //CHANNEL 1
>       *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
>
>       // CHANNEL 2
>       *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
>       if(curvature)
>       {
>          *pCH1 += *pCH2;
>          if(*pCH1 > 5000)
>          {
>             *pCH1 = 5000;
>          }
>       }
>
>       //CHANNEL 5
>       *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
>
>       // CHANNEL 6
>       *pCH6 = LUT1[tmpRead1 & 0xFF];
>
>       // CHANNEL 3 this channel is always read for particle matching on
this channel
>       *pBinData3 = tmpRead2 & 0xFF;
>       *pCH3 = LUT0[*pBinData3];
>
>       // CHANNEL 4
>       *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
>
>       pCH1++;
>       pCH2++;
>       pCH3++;
>       pBinData3++;
>       pCH4++;
>       pCH5++;
>       pCH6++;
>
>       if(pCH1 > endCH1)//Check for sample overflow (4096 samples max)
>       {
>          tmpRead1 = termValue;
>       }
>    }
>    sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
>    Screen_updateSamples(sampleCount);
>    return sampleCount;
> }
> At the moment I'm getting the folowing pipeline information is the ASM
file:
>
> _Calculator_FetchData:
> ;**
--------------------------------------------------------------------------*
>
;*----------------------------------------------------------------------------*
> ;*   SOFTWARE PIPELINE INFORMATION
> ;*
> ;*      Loop source line                 : 219
> ;*      Loop opening brace source line   : 220
> ;*      Loop closing brace source line   : 266
> ;*      Known Minimum Trip Count         : 16
> ;*      Known Maximum Trip Count         : 4096
> ;*      Known Max Trip Count Factor      : 2
> ;*      Loop Carried Dependency Bound(^) : 7
> ;*      Unpartitioned Resource Bound     : 9
> ;*      Partitioned Resource Bound(*)    : 9
> ;*      Resource Partition:
> ;*                                A-side   B-side
> ;*      .L units                     2        1
> ;*      .S units                     4        4
> ;*      .D units                     8        9*
> ;*      .M units                     0        0
> ;*      .X cross paths               1        1
> ;*      .T address paths             9*       8
> ;*      Long read paths              5        4
> ;*      Long write paths             0        0
> ;*      Logical  ops (.LS)           1        1     (.L or .S unit)
> ;*      Addition ops (.LSD)          3        3     (.L or .S or .D unit)
> ;*      Bound(.L .S .LS)             4        3
> ;*      Bound(.L .S .D .LS .LSD)     6        6
> ;*
> ;*      Searching for software pipeline schedule at ...
> ;*         ii = 9  Unsafe schedule for irregular loop
> ;*         ii = 9  Did not find schedule
> ;*         ii = 10 Unsafe schedule for irregular loop
> ;*         ii = 10 Unsafe schedule for irregular loop
> ;*         ii = 10 Did not find schedule
> ;*         ii = 11 Unsafe schedule for irregular loop
> ;*         ii = 11 Unsafe schedule for irregular loop
> ;*         ii = 11 Did not find schedule
> ;*         ii = 12 Unsafe schedule for irregular loop
> ;*         ii = 12 Unsafe schedule for irregular loop
> ;*         ii = 12 Did not find schedule
> ;*         ii = 13 Unsafe schedule for irregular loop
> ;*         ii = 13 Unsafe schedule for irregular loop
> ;*         ii = 13 Unsafe schedule for irregular loop
> ;*         ii = 13 Did not find schedule
> ;*         ii = 14 Unsafe schedule for irregular loop
> ;*         ii = 14 Unsafe schedule for irregular loop
> ;*         ii = 14 Unsafe schedule for irregular loop
> ;*         ii = 14 Did not find schedule
> ;*         ii = 15 Unsafe schedule for irregular loop
> ;*         ii = 15 Unsafe schedule for irregular loop
> ;*         ii = 15 Unsafe schedule for irregular loop
> ;*         ii = 15 Did not find schedule
> ;*         ii = 16 Unsafe schedule for irregular loop
> ;*         ii = 16 Unsafe schedule for irregular loop
> ;*         ii = 16 Unsafe schedule for irregular loop
> ;*         ii = 16 Did not find schedule
> ;*         ii = 17 Unsafe schedule for irregular loop
> ;*         ii = 17 Unsafe schedule for irregular loop
> ;*         ii = 17 Unsafe schedule for irregular loop
> ;*         ii = 17 Did not find schedule
> ;*         ii = 18 Unsafe schedule for irregular loop
> ;*         ii = 18 Unsafe schedule for irregular loop
> ;*         ii = 18 Unsafe schedule for irregular loop
> ;*         ii = 18 Did not find schedule
> ;*         ii = 19 Unsafe schedule for irregular loop
> ;*         ii = 19 Schedule found with 1 iterations in parallel
> ;*
> ;*      Register Usage Table:
> ;*          +---------------------------------+
> ;*          |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> ;*          |0000000000111111|0000000000111111|
> ;*          |0123456789012345|0123456789012345|
> ;*          |----------------+----------------|
> ;*       0: |* *   ********* | ***** *  ***   |
> ;*       1: |* *  ********** | ***** *  ***   |
> ;*       2: |* *  ********** | ***** *  ***   |
> ;*       3: |* *  ********** | ***** *  ***   |
> ;*       4: |* *  ********** | ***** *  ***   |
> ;*       5: |* *  ********** | ***** *  ***   |
> ;*       6: |* ************* | ***** *  ***   |
> ;*       7: |* **************| ***** *  ***   |
> ;*       8: |* **************| ***** ** ***   |
> ;*       9: |* **************| ******** ***   |
> ;*      10: |* **************|********* ***   |
> ;*      11: |****************| ***** ******   |
> ;*      12: |****************| ***** ******   |
> ;*      13: |****************| ************   |
> ;*      14: |* **************| ***** ******   |
> ;*      15: |* **************| ***** * ****   |
> ;*      16: |*************** | *******  ***   |
> ;*      17: |*** * ********* | *******  ***   |
> ;*      18: |*** * ********* | *******  ***   |
> ;*          +---------------------------------+
> ;*
> ;*      Done
> ;*
> ;*      Loop is interruptible
> ;*      Collapsed epilog stages     : 0
> ;*      Collapsed prolog stages     : 0
> ;*
> ;*      Minimum safe trip count     : 1
>
> Looking at the register usage it looks like it's using quite a lot of
registers and I thought that maybe the LDDW
> would relieve some registers. Also i was wondering if i can force-allign
arrays in memory? If that's possible I can
> use 1 pointer to access all the channels (by using an offset when
addressing:
> ----------------------------------------------------------------
> IDEA:
> 	//CHANNEL 1
> 	*pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> 	// CHANNEL 2
> 	*pCH1 + offset  = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
>
>         // OTHER CHANNELS addressed using bigger offsets...
> -------------------------------------------------------------------
>
> I don't know if this idea is feasible but if it is i think it would relieve
some more pressure of the register usage.
>
> Anyone's idea's/comments are welcome. At the moment the code is running 33%
to slow. If I offer 4000 samples @ 4MHz
> (data will take 1000 us to load into my FIFO's). It takes the DSP 1500 us
to run the FetchData routine. In an ideal
> situation i would like to complete the FetchData routine in 1000 us (not
any shorter or I would read faster then data
> is being written :P).
>
> With kind regards,
>
> Dominic Stuart
> --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@...> wrote:
>>
>> Dominic,
>>
>> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d.stuartnl@...>
wrote:
>> >
>> >
>> > Thanks Mike,
>> >
>> > I'm starting to enjoy this "tweaking" and am trying to
push it as far as I
>> > can because every microsecond that i gain means the DSP can handle
more
>> > particles/second. I've applied the tips I've gotten on this forum
on the
>> > rest of my source code as well (the actual loops in my program
that do the
>> > calculations on the data) and those are pipelining aswell now.
Compared to
>> > the initial source total improvement is over 900%! Amazing (looks
like I was
>> > using the DSP as a glorified MCU) but the true power of the DSP is
starting
>> > to show! I thank you for your input but it raises some questions
if you
>> > don't mind:
>> >
>> > --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@>
wrote:
>> >>
>> >> Congratulations, Dominic!!
>> >>
>> >> I'll top post this minor comment wrt 16/32 bit memory accesses
and speed.
>> >>
>> >> Assuming that you have 32 bit wide memory with aligned
accesses, 32,
>> >> 16, and 8 bit accesses will be the same speed.
>> >
>> > What do you mean with aligned exactly?
>> <mld>
>> 'Evenly divisible by the access size' or if 'myAddress % myAccessSize
>> == 0' then it is aligned.
>> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending in
>> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in 0,4,8,C
>> are aligned [byte addresses are always aligned].
>>
>> >
>> >> Only if your external memory is 8 or 16 bits wide would there
be any
>> >> potential advantage in performing 16 bit accesses instead of
32 bit
>> >> accesses.
>> >> Also, there would be an advantage in fetching 32 bits at a
time if you
>> >> an entire array of 8 or 16 bit values.
>> >>
>> >
>> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so
tempRead1 reads
>> > the first two together (logic tied together so they
"act" like 1 32bits wide
>> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the FIFO).
>> >
>> >> I haven't looked at the details of your code, but if you
always fetch
>> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is
*possible*
>> >> that your hardware addresses are preventing you from picking
up some
>> >> additional speed. *If* the input addresses began on a 64 bit
boundary
>> >> [0x90300000, 0x90300008, etc.] and you defined a long long [64
bits],
>> >> any memory fetch would coerce the compiler to performing an
'LDDW' [64
>> >> bit read].
>> >
>> > I do always fetch 48 bits (1x 32, 1x 16) but what would i gain by
telling my
>> > compiler to fetch a 64 bit read (I mean this still has to be split
somehow
>> > in 2 read cycles somehow?)
>> <mld>
>> First of all, I wrote this before I had the idea of using a single
>> pointer.  Your code has 2 pointers that load data - this means that
>> you are using 4 processor registers.  Changing to a single 64 bit read
>> [32 x 2] would result in requiring only 3 registers.  If your routine
>> has a lot of register pressure [utilization] where it is loading and
>> unloading CPU registers, then a 'register reduction change' would help
>> performance.
>>
>> As I finished writing about the double read, I thought of 'plan B' -
>> just use one pointer with an offset.  When you look at the asm
>> listing, it should give you some register usage info.  If you are
>> getting 'spills' then definitely try this.
>>
>> >
>> >>
>> >> Since your hardware addresses are fixed, you only need 1
pointer. You
>> >> could use
>> >> tmpRead2 = *(read1 + 4);
>> >> This would free up one register and, depending on register
>> >> utilization, could improve the performance.
>> >>
>> >
>> > Improve performance, thats what I like to hear ;) I hope my
questions aren't
>> > too "basic".
>> <mld>
>> Most active members of this group are willing to help someone who
>> wants to learn.  As long as your questions are informed and you show a
>> willingness to participate, most of us will help if we can.  We come
>> from a variety of backgrounds and each of us end up learning something
>> from time to time.
>>
>> As you are learning, 'performance improvement' is not something that
>> has a single solution.  Rather, it is a journey with many stops along
>> the way.
>>
>> mikedunn
>> >
>> > Dominic
>> >
>> >> mikedunn
>> >>
>> >>
>> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard
Williams<rkwill@> wrote:
>> >> >
>> >> >
>> >> > d.stuartnl,
>> >> >
>> >> > my comments in-line and prefixed with <rkw>
>> >> >
>> >> > R. Williams
>> >> >
>> >> > ---------- Original Message -----------
>> >> > From: "d.stuartnl" <d.stuartnl@>
>> >> > To: c...@yahoogroups.com
>> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
>> >> > Subject: [c6x] Re: Slow EMIF transfer
>> >> >
>> >> >> R.Williams,
>> >> >>
>> >> >> SUCCESS! Looptime has almost halved! Software
pipelining is working
>> >> >> now thanks to your tips:
>> >> >
>> >> > <rkw> congratulations!!
>> >> >
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
>> >> >> is working fine as it is. Dont know why though.
>> >> >
>> >> > <rkw> two reasons:
>> >> > 1) the data size of a float is the same as the address
data size
>> >> > 2) the '-1' because the pCH1 pointer is incremented at
the end of the
>> >> > loop
>> >> > to point 1 past the last location used.
>> >> > <snip>
>> >> >
>> >> >> >
>> >> >> still have them in a single loop and it's pipelining.
Do you think
>> >> >> it's worth considering splitting it into two loops
and check if
>> >> >> there's (an even better) speed increase?
>> >> >
>> >> > <rkw> you could experiment, but it looks like it is
not necessary to
>> >> > separate
>> >> > the code into two loops.
>> >> > <snip>
>> >> >
>> >> >> My new and improved function:
>> >> > <snip>
>> >> >
>> >> >> // CHANNEL 3 this channel is always read for particle
matching
>> >> >> on this channel *pCH3 = LUT0[((tmpRead2 &
0xFF))];
>> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4 *pCH4
=
>> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
>> >> >
>> >> > <rkw> there seems to be a problem in the editing of
the above 4 lines
>> >> > It looks like pCH3 is not being used; however, pCH3 is
still being
>> >> > initialized
>> >> > and incremented in the code.
>> >> > Also when testing for execution speed, adding new
operations (pBinData3)
>> >> > makes
>> >> > it very difficult to make timing comparisons.
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> As you might have seen in my code the second read
(tempRead2) is a 32
>> >> >> bits int but I'm only interrested in the first 16
bits (where channel
>> >> >> 3 and 4 reside), is there a way i can inform the
compiler
>> >> >
>> >> > <rkw> the natural size of a operation is 32bits,
changing to a 16 bit
>> >> > operation
>> >> > would <probably> slow the code execution.
>> >> >
>> >> >>
>> >> >> I had to leave pFifo12 and pFifo3 volatile because
when i removed
>> >> >> these keywords the software pipelining was disabled
again (Cannot find
>> >> >> schedule).
>> >> >
>> >> > <rkw> the 'volatile' is needed for the two
parameters because they DO
>> >> > change
>> >> > between reads. I had suggested to remove the 'volatile'
from the
>> >> > variables,
>> >> > not
>> >> > the parameters.
>> >> > <snip>
>> >> >
>> >> >>
>> >> >> With kind regards,
>> >> >>
>> >> >> Dominic
>> >> > <snip>

_____________________________________





(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Slow EMIF transfer - "d.stuartnl" - Sep 3 12:12:10 2009

Jeff,

thanks for the feedback it it appreciated and you are totally right that is
would have been a good idea to add aditional FPGA logic between the FIFO and the
DSP ;), I will take your idea into consideration when a re-design/hardwareupdate
will take place. But for now I am "stuck" with my current design and I
was not intending on spending months on this particular problem, I was just
wondering if there is some more fine-tuning possible. Not only would that
benefit the current design but it would also benefit me because I am learning a
lot based on the tips you guys are providing me.

With kind regards,

Dominic

--- In c...@yahoogroups.com, "Jeff Brower" <jbrower@...> wrote:
>
> Dominic-
> 
> > I have been away on holiday, I hope (for the people who are
> > reading this they had a holiday too (and a good one).
> > Right, back to work it is! I was wondering about your earlier
> > idea about longer reads to force the compiler to use the
> > LDDW instruction. Now is my question, how do I do this?
> >
> > the 2 reads are as follows:
> >
> > tmpRead1 = *(volatile int*) 0x90300004;
> > tmpRead2 = *(volatile int*) 0x90300008;
> >
> > these are both 32 bit reads but i only need 48 bits in total
> > (32 bits from tmpRead1, and 16 (least significant) bits
> > of tmpRead2.
> 
> One general comment I might make...
> 
> Normally it's a good idea to put some FPGA logic between a FIFO and the
DSP, so if you're required to collate FIFO
> values into larger words -- or split them apart, or arrange data into
chunks that are natural size for the DSP's DMA
> engine, whatever is needed -- it's do-able *after* the hardware design
process, with minimal effort required for
> trial-and-error.
> 
> I know in this case that you have a direct EMIF interface to the FIFO,
without intermediate FPGA logic... and I'm not
> trying to say anything negative about your method, which seems to be
working fine.  My point is, it might be better to
> not to spend too much time worrying about how you split 6 channels out in
software and how to optimize the last 10 to
> 20% of your performance, but instead just call it a "phase 1" or
proof-of-concept.  If increased performance is
> needed, then recommend to your managers that the hardware design should be
modified and flexibility added.
> 
> Otherwise, you could work on this many months, and write increasingly
specific, non-general C code that fits only one
> type of hardware model, and still not hit the MByte/sec performance level
that you really need.
> 
> Again, just a general comment... comes from many years of experience at
this stuff.
> 
> -Jeff
> 
> > Further more these 2 reads represent 6 (8-bit) channels:
> >
> > read:         tmpRead2                        tmpRead1
> >      M                              L M                             
L
> >      S                              S S                             
S
> >      B                              B B                             
B
> > bit :********************************
********************************
> > use :----------------CHANNEL4CHANNEL3
CHANNEL2CHANNEL1CHANNEL5CHANNEL6
> >
> > (forgive my primitive ASCII art :P)
> >
> > In my fetchData routine (which is pipelining!!) I fetch the data into
these 2 variables and then distribute the data
> > in 6 channels. My code is as follows:
> >
> > unsigned int Calculator_FetchData(Bool curvature)
> > {
> >    unsigned int tmpRead1 =0;
> >    unsigned int tmpRead2;
> >    unsigned int sampleCount;
> >    float * restrict pCH1;
> >    float * restrict pCH2;
> >    float * restrict pCH3;
> >    char *  restrict pBinData3;
> >    float * restrict pCH4;
> >    float * restrict pCH5;
> >    float * restrict pCH6;
> >
> >    const float * restrict endCH1 = &CH1.deloggedData[0xFFF];
> >    const int termValue = 0x84825131;
> >
> >    pCH1 = &CH1.deloggedData[0];
> >    pCH2 = &CH2.deloggedData[0];
> >    pCH3 = &CH3.deloggedData[0];
> >    pBinData3 = &binData3[0];
> >    pCH4 = &CH4.deloggedData[0];
> >    pCH5 = &CH5.deloggedData[0];
> >    pCH6 = &CH6.deloggedData[0];
> >
> >    #pragma MUST_ITERATE(16,4096,2);
> >    while(tmpRead1 != termValue)
> >    {
> >       tmpRead1 = *(volatile int*) 0x90300004;
> >       tmpRead2 = *(volatile int*) 0x90300008;
> >
> >       //CHANNEL 1
> >       *pCH1 = LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> >       // CHANNEL 2
> >       *pCH2 = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> >       if(curvature)
> >       {
> >          *pCH1 += *pCH2;
> >          if(*pCH1 > 5000)
> >          {
> >             *pCH1 = 5000;
> >          }
> >       }
> >
> >       //CHANNEL 5
> >       *pCH5 = LUT1[((tmpRead1 & 0xFF00) >> 8)];
> >
> >       // CHANNEL 6
> >       *pCH6 = LUT1[tmpRead1 & 0xFF];
> >
> >       // CHANNEL 3 this channel is always read for particle matching
on this channel
> >       *pBinData3 = tmpRead2 & 0xFF;
> >       *pCH3 = LUT0[*pBinData3];
> >
> >       // CHANNEL 4
> >       *pCH4 = LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >
> >       pCH1++;
> >       pCH2++;
> >       pCH3++;
> >       pBinData3++;
> >       pCH4++;
> >       pCH5++;
> >       pCH6++;
> >
> >       if(pCH1 > endCH1)//Check for sample overflow (4096 samples
max)
> >       {
> >          tmpRead1 = termValue;
> >       }
> >    }
> >    sampleCount = (int) (pCH1 - &CH1.deloggedData[0]) -2;
> >    Screen_updateSamples(sampleCount);
> >    return sampleCount;
> > }
> > At the moment I'm getting the folowing pipeline information is the ASM
file:
> >
> > _Calculator_FetchData:
> > ;**
--------------------------------------------------------------------------*
> >
;*----------------------------------------------------------------------------*
> > ;*   SOFTWARE PIPELINE INFORMATION
> > ;*
> > ;*      Loop source line                 : 219
> > ;*      Loop opening brace source line   : 220
> > ;*      Loop closing brace source line   : 266
> > ;*      Known Minimum Trip Count         : 16
> > ;*      Known Maximum Trip Count         : 4096
> > ;*      Known Max Trip Count Factor      : 2
> > ;*      Loop Carried Dependency Bound(^) : 7
> > ;*      Unpartitioned Resource Bound     : 9
> > ;*      Partitioned Resource Bound(*)    : 9
> > ;*      Resource Partition:
> > ;*                                A-side   B-side
> > ;*      .L units                     2        1
> > ;*      .S units                     4        4
> > ;*      .D units                     8        9*
> > ;*      .M units                     0        0
> > ;*      .X cross paths               1        1
> > ;*      .T address paths             9*       8
> > ;*      Long read paths              5        4
> > ;*      Long write paths             0        0
> > ;*      Logical  ops (.LS)           1        1     (.L or .S unit)
> > ;*      Addition ops (.LSD)          3        3     (.L or .S or .D
unit)
> > ;*      Bound(.L .S .LS)             4        3
> > ;*      Bound(.L .S .D .LS .LSD)     6        6
> > ;*
> > ;*      Searching for software pipeline schedule at ...
> > ;*         ii = 9  Unsafe schedule for irregular loop
> > ;*         ii = 9  Did not find schedule
> > ;*         ii = 10 Unsafe schedule for irregular loop
> > ;*         ii = 10 Unsafe schedule for irregular loop
> > ;*         ii = 10 Did not find schedule
> > ;*         ii = 11 Unsafe schedule for irregular loop
> > ;*         ii = 11 Unsafe schedule for irregular loop
> > ;*         ii = 11 Did not find schedule
> > ;*         ii = 12 Unsafe schedule for irregular loop
> > ;*         ii = 12 Unsafe schedule for irregular loop
> > ;*         ii = 12 Did not find schedule
> > ;*         ii = 13 Unsafe schedule for irregular loop
> > ;*         ii = 13 Unsafe schedule for irregular loop
> > ;*         ii = 13 Unsafe schedule for irregular loop
> > ;*         ii = 13 Did not find schedule
> > ;*         ii = 14 Unsafe schedule for irregular loop
> > ;*         ii = 14 Unsafe schedule for irregular loop
> > ;*         ii = 14 Unsafe schedule for irregular loop
> > ;*         ii = 14 Did not find schedule
> > ;*         ii = 15 Unsafe schedule for irregular loop
> > ;*         ii = 15 Unsafe schedule for irregular loop
> > ;*         ii = 15 Unsafe schedule for irregular loop
> > ;*         ii = 15 Did not find schedule
> > ;*         ii = 16 Unsafe schedule for irregular loop
> > ;*         ii = 16 Unsafe schedule for irregular loop
> > ;*         ii = 16 Unsafe schedule for irregular loop
> > ;*         ii = 16 Did not find schedule
> > ;*         ii = 17 Unsafe schedule for irregular loop
> > ;*         ii = 17 Unsafe schedule for irregular loop
> > ;*         ii = 17 Unsafe schedule for irregular loop
> > ;*         ii = 17 Did not find schedule
> > ;*         ii = 18 Unsafe schedule for irregular loop
> > ;*         ii = 18 Unsafe schedule for irregular loop
> > ;*         ii = 18 Unsafe schedule for irregular loop
> > ;*         ii = 18 Did not find schedule
> > ;*         ii = 19 Unsafe schedule for irregular loop
> > ;*         ii = 19 Schedule found with 1 iterations in parallel
> > ;*
> > ;*      Register Usage Table:
> > ;*          +---------------------------------+
> > ;*          |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
> > ;*          |0000000000111111|0000000000111111|
> > ;*          |0123456789012345|0123456789012345|
> > ;*          |----------------+----------------|
> > ;*       0: |* *   ********* | ***** *  ***   |
> > ;*       1: |* *  ********** | ***** *  ***   |
> > ;*       2: |* *  ********** | ***** *  ***   |
> > ;*       3: |* *  ********** | ***** *  ***   |
> > ;*       4: |* *  ********** | ***** *  ***   |
> > ;*       5: |* *  ********** | ***** *  ***   |
> > ;*       6: |* ************* | ***** *  ***   |
> > ;*       7: |* **************| ***** *  ***   |
> > ;*       8: |* **************| ***** ** ***   |
> > ;*       9: |* **************| ******** ***   |
> > ;*      10: |* **************|********* ***   |
> > ;*      11: |****************| ***** ******   |
> > ;*      12: |****************| ***** ******   |
> > ;*      13: |****************| ************   |
> > ;*      14: |* **************| ***** ******   |
> > ;*      15: |* **************| ***** * ****   |
> > ;*      16: |*************** | *******  ***   |
> > ;*      17: |*** * ********* | *******  ***   |
> > ;*      18: |*** * ********* | *******  ***   |
> > ;*          +---------------------------------+
> > ;*
> > ;*      Done
> > ;*
> > ;*      Loop is interruptible
> > ;*      Collapsed epilog stages     : 0
> > ;*      Collapsed prolog stages     : 0
> > ;*
> > ;*      Minimum safe trip count     : 1
> >
> > Looking at the register usage it looks like it's using quite a lot of
registers and I thought that maybe the LDDW
> > would relieve some registers. Also i was wondering if i can
force-allign arrays in memory? If that's possible I can
> > use 1 pointer to access all the channels (by using an offset when
addressing:
> > ----------------------------------------------------------------
> > IDEA:
> > 	//CHANNEL 1
> > 	*pCH1= LUT0[((tmpRead1 & 0xFF0000) >> 16)];
> >
> >
> > 	// CHANNEL 2
> > 	*pCH1 + offset  = LUT0[((tmpRead1 & 0xFF000000) >> 24)];
> >
> >         // OTHER CHANNELS addressed using bigger offsets...
> > -------------------------------------------------------------------
> >
> > I don't know if this idea is feasible but if it is i think it would
relieve some more pressure of the register usage.
> >
> > Anyone's idea's/comments are welcome. At the moment the code is
running 33% to slow. If I offer 4000 samples @ 4MHz
> > (data will take 1000 us to load into my FIFO's). It takes the DSP 1500
us to run the FetchData routine. In an ideal
> > situation i would like to complete the FetchData routine in 1000 us
(not any shorter or I would read faster then data
> > is being written :P).
> >
> > With kind regards,
> >
> > Dominic Stuart
> >
> >
> > --- In c...@yahoogroups.com, Michael Dunn <mike.dunn.001@>
wrote:
> >>
> >> Dominic,
> >>
> >> On Fri, Jul 24, 2009 at 4:58 PM, d.stuartnl<d.stuartnl@>
wrote:
> >> >
> >> >
> >> > Thanks Mike,
> >> >
> >> > I'm starting to enjoy this "tweaking" and am trying
to push it as far as I
> >> > can because every microsecond that i gain means the DSP can
handle more
> >> > particles/second. I've applied the tips I've gotten on this
forum on the
> >> > rest of my source code as well (the actual loops in my
program that do the
> >> > calculations on the data) and those are pipelining aswell
now. Compared to
> >> > the initial source total improvement is over 900%! Amazing
(looks like I was
> >> > using the DSP as a glorified MCU) but the true power of the
DSP is starting
> >> > to show! I thank you for your input but it raises some
questions if you
> >> > don't mind:
> >> >
> >> > --- In c...@yahoogroups.com, Michael Dunn
<mike.dunn.001@> wrote:
> >> >>
> >> >> Congratulations, Dominic!!
> >> >>
> >> >> I'll top post this minor comment wrt 16/32 bit memory
accesses and speed.
> >> >>
> >> >> Assuming that you have 32 bit wide memory with aligned
accesses, 32,
> >> >> 16, and 8 bit accesses will be the same speed.
> >> >
> >> > What do you mean with aligned exactly?
> >> <mld>
> >> 'Evenly divisible by the access size' or if 'myAddress %
myAccessSize
> >> == 0' then it is aligned.
> >> For a 32 bit EMIF with 32 bit memory, all 16 bit addresses ending
in
> >> 0,2,4,6,8,A,C,E are aligned and all 32 bit addresses ending in
0,4,8,C
> >> are aligned [byte addresses are always aligned].
> >>
> >> >
> >> >> Only if your external memory is 8 or 16 bits wide would
there be any
> >> >> potential advantage in performing 16 bit accesses instead
of 32 bit
> >> >> accesses.
> >> >> Also, there would be an advantage in fetching 32 bits at
a time if you
> >> >> an entire array of 8 or 16 bit values.
> >> >>
> >> >
> >> > I'm reading from 3 (16 bits) FIFO's. I've hooked them up so
tempRead1 reads
> >> > the first two together (logic tied together so they
"act" like 1 32bits wide
> >> > FIFO. tempRead2 reads the 3rd FIFO (first 16 bits of the
FIFO).
> >> >
> >> >> I haven't looked at the details of your code, but if you
always fetch
> >> >> 48 bits [32 from 0x90300004 and 16 from 0x90300008] it is
*possible*
> >> >> that your hardware addresses are preventing you from
picking up some
> >> >> additional speed. *If* the input addresses began on a 64
bit boundary
> >> >> [0x90300000, 0x90300008, etc.] and you defined a long
long [64 bits],
> >> >> any memory fetch would coerce the compiler to performing
an 'LDDW' [64
> >> >> bit read].
> >> >
> >> > I do always fetch 48 bits (1x 32, 1x 16) but what would i
gain by telling my
> >> > compiler to fetch a 64 bit read (I mean this still has to be
split somehow
> >> > in 2 read cycles somehow?)
> >> <mld>
> >> First of all, I wrote this before I had the idea of using a
single
> >> pointer.  Your code has 2 pointers that load data - this means
that
> >> you are using 4 processor registers.  Changing to a single 64 bit
read
> >> [32 x 2] would result in requiring only 3 registers.  If your
routine
> >> has a lot of register pressure [utilization] where it is loading
and
> >> unloading CPU registers, then a 'register reduction change' would
help
> >> performance.
> >>
> >> As I finished writing about the double read, I thought of 'plan B'
-
> >> just use one pointer with an offset.  When you look at the asm
> >> listing, it should give you some register usage info.  If you are
> >> getting 'spills' then definitely try this.
> >>
> >> >
> >> >>
> >> >> Since your hardware addresses are fixed, you only need 1
pointer. You
> >> >> could use
> >> >> tmpRead2 = *(read1 + 4);
> >> >> This would free up one register and, depending on
register
> >> >> utilization, could improve the performance.
> >> >>
> >> >
> >> > Improve performance, thats what I like to hear ;) I hope my
questions aren't
> >> > too "basic".
> >> <mld>
> >> Most active members of this group are willing to help someone who
> >> wants to learn.  As long as your questions are informed and you
show a
> >> willingness to participate, most of us will help if we can.  We
come
> >> from a variety of backgrounds and each of us end up learning
something
> >> from time to time.
> >>
> >> As you are learning, 'performance improvement' is not something
that
> >> has a single solution.  Rather, it is a journey with many stops
along
> >> the way.
> >>
> >> mikedunn
> >> >
> >> > Dominic
> >> >
> >> >> mikedunn
> >> >>
> >> >>
> >> >> On Fri, Jul 24, 2009 at 9:19 AM, Richard
Williams<rkwill@> wrote:
> >> >> >
> >> >> >
> >> >> > d.stuartnl,
> >> >> >
> >> >> > my comments in-line and prefixed with <rkw>
> >> >> >
> >> >> > R. Williams
> >> >> >
> >> >> > ---------- Original Message -----------
> >> >> > From: "d.stuartnl" <d.stuartnl@>
> >> >> > To: c...@yahoogroups.com
> >> >> > Sent: Fri, 24 Jul 2009 09:26:55 -0000
> >> >> > Subject: [c6x] Re: Slow EMIF transfer
> >> >> >
> >> >> >> R.Williams,
> >> >> >>
> >> >> >> SUCCESS! Looptime has almost halved! Software
pipelining is working
> >> >> >> now thanks to your tips:
> >> >> >
> >> >> > <rkw> congratulations!!
> >> >> >
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> For some reason, sampleCount = (int) (pCH1 -
&CH1.deloggedData[0]) -1;
> >> >> >> is working fine as it is. Dont know why though.
> >> >> >
> >> >> > <rkw> two reasons:
> >> >> > 1) the data size of a float is the same as the
address data size
> >> >> > 2) the '-1' because the pCH1 pointer is incremented
at the end of the
> >> >> > loop
> >> >> > to point 1 past the last location used.
> >> >> > <snip>
> >> >> >
> >> >> >> >
> >> >> >> still have them in a single loop and it's
pipelining. Do you think
> >> >> >> it's worth considering splitting it into two
loops and check if
> >> >> >> there's (an even better) speed increase?
> >> >> >
> >> >> > <rkw> you could experiment, but it looks like
it is not necessary to
> >> >> > separate
> >> >> > the code into two loops.
> >> >> > <snip>
> >> >> >
> >> >> >> My new and improved function:
> >> >> > <snip>
> >> >> >
> >> >> >> // CHANNEL 3 this channel is always read for
particle matching
> >> >> >> on this channel *pCH3 = LUT0[((tmpRead2 &
0xFF))];
> >> >> >> *pBinData3 = tmpRead2 & 0xFF; // CHANNEL 4
*pCH4 =
> >> >> >> LUT0[((tmpRead2 & 0xFF00) >> 8)];
> >> >> >
> >> >> > <rkw> there seems to be a problem in the
editing of the above 4 lines
> >> >> > It looks like pCH3 is not being used; however, pCH3
is still being
> >> >> > initialized
> >> >> > and incremented in the code.
> >> >> > Also when testing for execution speed, adding new
operations (pBinData3)
> >> >> > makes
> >> >> > it very difficult to make timing comparisons.
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> As you might have seen in my code the second
read (tempRead2) is a 32
> >> >> >> bits int but I'm only interrested in the first
16 bits (where channel
> >> >> >> 3 and 4 reside), is there a way i can inform the
compiler
> >> >> >
> >> >> > <rkw> the natural size of a operation is
32bits, changing to a 16 bit
> >> >> > operation
> >> >> > would <probably> slow the code execution.
> >> >> >
> >> >> >>
> >> >> >> I had to leave pFifo12 and pFifo3 volatile
because when i removed
> >> >> >> these keywords the software pipelining was
disabled again (Cannot find
> >> >> >> schedule).
> >> >> >
> >> >> > <rkw> the 'volatile' is needed for the two
parameters because they DO
> >> >> > change
> >> >> > between reads. I had suggested to remove the
'volatile' from the
> >> >> > variables,
> >> >> > not
> >> >> > the parameters.
> >> >> > <snip>
> >> >> >
> >> >> >>
> >> >> >> With kind regards,
> >> >> >>
> >> >> >> Dominic
> >> >> > <snip

_____________________________________

______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )