Reply by October 1, 20092009-10-01
"Michael Kedem"  writes:

> Hi > My C6711, which runs on a custom board, runs way too slow. > My peripherals include one asynchronous device (a PAL) and one SDRAM. > > The Profiler Clock shows 6 cycles/ASM statement when single-stepping. > However, when not single-stepping, instruction timings are even worse. > > I thought this was a cache problem, so I configured the system to work > without an L2 cache - this did not seem to make a difference. > > When looking at data fetches from SDRAM on a logic analyzer, > the Read and Write cycles look perfectly normal. But when looping > on an instruction that repeatedly READs from a constant address in SDRAM, > there are about 500 ns between reads, which are about 50 cycles/instruction. > (to make sure there is no looping overhead, I unrolled the loop). > > Any pointers or advice ? >
Are your EMIF settings correct? 50 cycles sounds like a very long time though - even if they're wrong! Is the code in SDRAM as well as the data? Does the assembly code for your continuous loop looks like you might expect? Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt, Solihull, UK http://www.trw.com/conekt
Reply by October 9, 20032003-10-09
howard@zaxcom.com (howy) writes:

> > My C6711, which runs on a custom board, runs way too slow. > > My peripherals include one asynchronous device (a PAL) and one SDRAM. > > Michael Kedem > > I woud like to know more about this too. I have a custom 6713 board > and it does run pretty damn fast as long as the cache is enabled > (32bit sdram, many 16 bit peripherals including 1/4 vga color LCD > touch screen and full blown graphical user interface). However when > the cache is turned off I get a 50X reduction (yes 50 times) in > overall performance. This was also true of the 6711 DSK board. > > Mye regular program with the cache enabled can update the screen at 60 > frames per second. I wrote a simple loop to fill the screen with > pixels which takes (320*240)/2=38400 16bit transfers at 200ns each. > With the cache disabled it can only fill the screen at a rate of 1 or > 2 frames per second! Changing the wait states of the LCD controller > from 200ns to 1000ns made little difference in performance. Therefore > I was under the impression that since the DSP's instructions are up to > 256 bits wide, that the DSP is starved by the SDRAM. But maybe there > is something more to it than that. I recall seeing that you can get > more than 12 wait states for every out of page SDRAM access, but I > never measured it on my board. >
As an aside, remember the L1 cache is always enabled for program RAM. Back to the SDRAM side, page 10-63 of the peripherals guide has the cycles of interest... So, as i understand it, an out of page access (worst case) causes: cache miss (1 cycle) SDRAM deactivate (1 cycle) SDRAM row activate (2 cycles) SDRAM column read (2 or 3 cycles latency) in order to fill an instruction word (256 bits) from 32 bit ram, the read will then take 8 cycles. The cache line is then full. There may another cycle to transfer this to the DSP afterward. This makes us total 15-16 EMIF cycles (23-24 CPU cycles @ 150MHz vs 100Mhz EMIF). Btes case, assuming the DSP then continues to read to fill the cache line *and* doesn't cross any row/page boundadries, it can continue to read one 32 bit word per EMIF cycle, which still means it can only execute one instruction per 8 EMIF cycles (12 CPU cycles at 150MHz internal clock), at the absolute fastest! The cache only gains you in loops which fit within the cache (as we expect) Once you throw writes into the fray as well things get even nastier. Even a single 32-bit write will take at least 4 cycles (if in an active row) as the EMIF only ever writes bursts of 4, using the BEs to mask off the unwanted data! At least with lazy writing, the core won't stall while the write takes place... unless it needs it's next instruction from the EMIF :-) Hurrah for the cache when it fits the application - we've usually found it better to use our intimate knowledge of our application to shift the time-critical code and data into internal memory, paging it in and out of external SDRAM by EDMA - effectively doing our own caching. "All programming can be viewed as an exercise in caching" -- Terje Mathisen Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt, Solihull, UK http://www.trw.com/conekt
Reply by Roger Larsson October 8, 20032003-10-08
howy wrote:

>> My C6711, which runs on a custom board, runs way too slow. >> My peripherals include one asynchronous device (a PAL) and one SDRAM. >> Michael Kedem > > I woud like to know more about this too. I have a custom 6713 board > and it does run pretty damn fast as long as the cache is enabled > (32bit sdram, many 16 bit peripherals including 1/4 vga color LCD > touch screen and full blown graphical user interface). However when > the cache is turned off I get a 50X reduction (yes 50 times) in > overall performance. This was also true of the 6711 DSK board. > > Mye regular program with the cache enabled can update the screen at 60 > frames per second. I wrote a simple loop to fill the screen with > pixels which takes (320*240)/2=38400 16bit transfers at 200ns each. > With the cache disabled it can only fill the screen at a rate of 1 or > 2 frames per second! Changing the wait states of the LCD controller > from 200ns to 1000ns made little difference in performance. Therefore > I was under the impression that since the DSP's instructions are up to > 256 bits wide, that the DSP is starved by the SDRAM. But maybe there > is something more to it than that. I recall seeing that you can get > more than 12 wait states for every out of page SDRAM access, but I > never measured it on my board.
If I remember correcly it is more like for every non cached access! Count the stages the request has to go through * the EMIF * the SDRAM * back through the EMIF Details are in the Peripheral Reference. When running cached the first access hurts, but the following will be retrieved and will be in cache. Matrix access order is important if the matrix is big. Compare: for (r = 0; r < rmax; r++) for (c = 0; c < cmax; c++) m[r][c]++; with: for (c = 0; c < cmax; c++) for (r = 0; r < rmax; r++) m[r][c]++; Try with different sizes of 'rmax * sizeof(m[0][0])' especially those bigger than L1 and L2. /RogerL -- Roger Larsson Skellefte&#4294967295; Sweden
Reply by howy October 8, 20032003-10-08
> My C6711, which runs on a custom board, runs way too slow. > My peripherals include one asynchronous device (a PAL) and one SDRAM. > Michael Kedem
I woud like to know more about this too. I have a custom 6713 board and it does run pretty damn fast as long as the cache is enabled (32bit sdram, many 16 bit peripherals including 1/4 vga color LCD touch screen and full blown graphical user interface). However when the cache is turned off I get a 50X reduction (yes 50 times) in overall performance. This was also true of the 6711 DSK board. Mye regular program with the cache enabled can update the screen at 60 frames per second. I wrote a simple loop to fill the screen with pixels which takes (320*240)/2=38400 16bit transfers at 200ns each. With the cache disabled it can only fill the screen at a rate of 1 or 2 frames per second! Changing the wait states of the LCD controller from 200ns to 1000ns made little difference in performance. Therefore I was under the impression that since the DSP's instructions are up to 256 bits wide, that the DSP is starved by the SDRAM. But maybe there is something more to it than that. I recall seeing that you can get more than 12 wait states for every out of page SDRAM access, but I never measured it on my board. -howy
Reply by Michael Kedem October 8, 20032003-10-08
Hi
My C6711, which runs on a custom board, runs way too slow.
My peripherals include one asynchronous device (a PAL) and one SDRAM.

The Profiler Clock shows 6 cycles/ASM statement when single-stepping.
However, when not single-stepping, instruction timings are even worse.

I thought this was a cache problem, so I configured the system to work
without an L2 cache - this did not seem to make a difference.

When looking at data fetches from SDRAM on a logic analyzer,
the Read and Write cycles look perfectly normal. But when looping
on an instruction that repeatedly READs from a constant address in SDRAM,
there are about 500 ns between reads, which are about 50 cycles/instruction.
(to make sure there is no looping overhead, I unrolled the loop).

Any pointers or advice ?

Thank

Michael Kedem