"Michael Kedem"  writes:

> Hi
> My C6711, which runs on a custom board, runs way too slow.
> My peripherals include one asynchronous device (a PAL) and one SDRAM.
> 
> The Profiler Clock shows 6 cycles/ASM statement when single-stepping.
> However, when not single-stepping, instruction timings are even worse.
> 
> I thought this was a cache problem, so I configured the system to work
> without an L2 cache - this did not seem to make a difference.
> 
> When looking at data fetches from SDRAM on a logic analyzer,
> the Read and Write cycles look perfectly normal. But when looping
> on an instruction that repeatedly READs from a constant address in SDRAM,
> there are about 500 ns between reads, which are about 50 cycles/instruction.
> (to make sure there is no looping overhead, I unrolled the loop).
> 
> Any pointers or advice ?
> 

Are your EMIF settings correct?  50 cycles sounds like a very long
time though - even if they're wrong!

Is the code in SDRAM as well as the data?

Does the assembly code for your continuous loop looks like you might
expect?

Cheers,
Martin

-- 
martin.j.thompson@trw.com
TRW Conekt, Solihull, UK
http://www.trw.com/conekt

howard@zaxcom.com (howy) writes:

> > My C6711, which runs on a custom board, runs way too slow.
> > My peripherals include one asynchronous device (a PAL) and one SDRAM.
> > Michael Kedem
> 
> I woud like to know more about this too. I have a custom 6713 board
> and it does run pretty damn fast as long as the cache is enabled
> (32bit sdram, many 16 bit peripherals including 1/4 vga color LCD
> touch screen and full blown graphical user interface). However when
> the cache is turned off I get a 50X reduction (yes 50 times) in
> overall performance. This was also true of the 6711 DSK board.
> 
> Mye regular program with the cache enabled can update the screen at 60
> frames per second. I wrote a simple loop to fill the screen with
> pixels which takes (320*240)/2=38400 16bit transfers at 200ns each.
> With the cache disabled it can only fill the screen at a rate of 1 or
> 2 frames per second! Changing the wait states of the LCD controller
> from 200ns to 1000ns made little difference in performance. Therefore
> I was under the impression that since the DSP's instructions are up to
> 256 bits wide, that the DSP is starved by the SDRAM. But maybe there
> is something more to it than that. I recall seeing that you can get
> more than 12 wait states for every out of page SDRAM access, but I
> never measured it on my board.
> 

As an aside, remember the L1 cache is always enabled for program RAM.

Back to the SDRAM side, page 10-63 of the peripherals guide has the
cycles of interest...

So, as i understand it, an out of page access (worst case) causes:
cache miss (1 cycle)
SDRAM deactivate (1 cycle)
SDRAM row activate (2 cycles)
SDRAM column read (2 or 3 cycles latency)
in order to fill an instruction word (256 bits) from 32 bit ram, the
read will then take 8 cycles. The cache line is then full.
There may another cycle to transfer this to the DSP afterward.  This
makes us total 15-16 EMIF cycles (23-24 CPU cycles @ 150MHz vs 100Mhz
EMIF).

Btes case, assuming the DSP then continues to read to fill the cache
line *and* doesn't cross any row/page boundadries, it can continue to
read one 32 bit word per EMIF cycle, which still means it can only
execute one instruction per 8 EMIF cycles (12 CPU cycles at 150MHz
internal clock), at the absolute fastest!  The cache only gains you in
loops which fit within the cache (as we expect)

Once you throw writes into the fray as well things get even nastier.
Even a single 32-bit write will take at least 4 cycles (if in an
active row) as the EMIF only ever writes bursts of 4, using the BEs to
mask off the unwanted data!  At least with lazy writing, the core
won't stall while the write takes place... unless it needs it's next
instruction from the EMIF :-)

Hurrah for the cache when it fits the application - we've usually
found it better to use our intimate knowledge of our application to
shift the time-critical code and data into internal memory, paging it
in and out of external SDRAM by EDMA - effectively doing our own
caching.

  "All programming can be viewed as an exercise in caching" 
    -- Terje Mathisen

Cheers,
Martin
-- 
martin.j.thompson@trw.com
TRW Conekt, Solihull, UK
http://www.trw.com/conekt

howy wrote:

>> My C6711, which runs on a custom board, runs way too slow.
>> My peripherals include one asynchronous device (a PAL) and one SDRAM.
>> Michael Kedem
> 
> I woud like to know more about this too. I have a custom 6713 board
> and it does run pretty damn fast as long as the cache is enabled
> (32bit sdram, many 16 bit peripherals including 1/4 vga color LCD
> touch screen and full blown graphical user interface). However when
> the cache is turned off I get a 50X reduction (yes 50 times) in
> overall performance. This was also true of the 6711 DSK board.
> 
> Mye regular program with the cache enabled can update the screen at 60
> frames per second. I wrote a simple loop to fill the screen with
> pixels which takes (320*240)/2=38400 16bit transfers at 200ns each.
> With the cache disabled it can only fill the screen at a rate of 1 or
> 2 frames per second! Changing the wait states of the LCD controller
> from 200ns to 1000ns made little difference in performance. Therefore
> I was under the impression that since the DSP's instructions are up to
> 256 bits wide, that the DSP is starved by the SDRAM. But maybe there
> is something more to it than that. I recall seeing that you can get
> more than 12 wait states for every out of page SDRAM access, but I
> never measured it on my board.

If I remember correcly it is more like for every non cached access!
Count the stages the request has to go through
* the EMIF
* the SDRAM
* back through the EMIF
Details are in the Peripheral Reference.

When running cached the first access hurts, but the following will be
retrieved and will be in cache. Matrix access order is important if
the matrix is big.

Compare:
for (r = 0; r < rmax; r++)
        for (c = 0; c < cmax; c++)
                m[r][c]++;

with:
for (c = 0; c < cmax; c++)
        for (r = 0; r < rmax; r++)
                m[r][c]++;

Try with different sizes of 'rmax * sizeof(m[0][0])'
especially those bigger than L1 and L2.

/RogerL

-- 
Roger Larsson
Skellefte&#4294967295;
Sweden

> My C6711, which runs on a custom board, runs way too slow.
> My peripherals include one asynchronous device (a PAL) and one SDRAM.
> Michael Kedem

I woud like to know more about this too. I have a custom 6713 board
and it does run pretty damn fast as long as the cache is enabled
(32bit sdram, many 16 bit peripherals including 1/4 vga color LCD
touch screen and full blown graphical user interface). However when
the cache is turned off I get a 50X reduction (yes 50 times) in
overall performance. This was also true of the 6711 DSK board.

Mye regular program with the cache enabled can update the screen at 60
frames per second. I wrote a simple loop to fill the screen with
pixels which takes (320*240)/2=38400 16bit transfers at 200ns each.
With the cache disabled it can only fill the screen at a rate of 1 or
2 frames per second! Changing the wait states of the LCD controller
from 200ns to 1000ns made little difference in performance. Therefore
I was under the impression that since the DSP's instructions are up to
256 bits wide, that the DSP is starved by the SDRAM. But maybe there
is something more to it than that. I recall seeing that you can get
more than 12 wait states for every out of page SDRAM access, but I
never measured it on my board.

-howy

Hi
My C6711, which runs on a custom board, runs way too slow.
My peripherals include one asynchronous device (a PAL) and one SDRAM.

The Profiler Clock shows 6 cycles/ASM statement when single-stepping.
However, when not single-stepping, instruction timings are even worse.

I thought this was a cache problem, so I configured the system to work
without an L2 cache - this did not seem to make a difference.

When looking at data fetches from SDRAM on a logic analyzer,
the Read and Write cycles look perfectly normal. But when looping
on an instruction that repeatedly READs from a constant address in SDRAM,
there are about 500 ns between reads, which are about 50 cycles/instruction.
(to make sure there is no looping overhead, I unrolled the loop).

Any pointers or advice ?

Thank

Michael Kedem