c6x | Software pipeline - wrong number of cycles

Hi everybody, I'm new and glad to be here

I am working on a C6713DSK. I wrote a pipelined optimized ASM code, and it
consumed more than twice execution cycles than expected. To understand the
problem, I wrote a very simple code:

   LDDW  .D1  *A4++,A7:A6    ; #1
|| LDDW  .D2  *B4++,B7,B6
   LDDW  .D1  *A4++,A7:A6    ; #2
|| LDDW  .D2  *B4++,B7,B6
   .
   .
   .
   LDDW  .D1  *A4++,A7:A6    ; #512
|| LDDW  .D2  *B4++,B7,B6

It should consume little more than 512 cycles, but it actually takes about 820.
Then I replaced the LDDW instructions with LDW ones, and the number of cycles
was consideratly different. I am using internal mem (for both code and data) and
I've verified that accesses are at banks 0 and 4 each cycle. The HWI
interrupts are disabled during the execution. Am I missing something?

Thanks,

Charly

Reply by Bernhard Gustl Bauer ●February 9, 20062006-02-09

Hi Charly,

feringlo@feri... schrieb:
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and
> the number of cycles was consideratly different. I am using internal
> mem (for both code and data) and I've verified that accesses are at
> banks 0 and 4 each cycle. The HWI interrupts are disabled during the
> execution. Am I missing something?

how do you count your need cycles? By hand, by simulator or by measuring 
some HW pins? Maybe you miss the stalls. Try the simulator.

HTH Gustl

Reply by Jeff Brower ●February 9, 20062006-02-09

Charly-

> I am working on a C6713DSK. I wrote a pipelined
optimized ASM code, and it
> consumed more than twice execution cycles than expected. To understand
> the problem, I wrote a very simple code:
> 
>    LDDW  .D1  *A4++,A7:A6    ; #1
> || LDDW  .D2  *B4++,B7,B6
>    LDDW  .D1  *A4++,A7:A6    ; #2
> || LDDW  .D2  *B4++,B7,B6
>    .
>    .
>    .
>    LDDW  .D1  *A4++,A7:A6    ; #512
> || LDDW  .D2  *B4++,B7,B6
> 
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and the
> number of cycles was consideratly different. I am using internal mem (for
> both code and data) and I've verified that accesses are at banks 0 and
4
> each cycle. The HWI interrupts are disabled during the execution. Am
> I missing something?

How are you measuring cycles?  Have you verified your measurement method, for
example
write a loop that iterates NOPs many times and get some average figures?

-Jeff

Reply by Charly Feringlo ●February 9, 20062006-02-09

I measured the time using the low resolution clock, which interval is set to 10us. I am taking the difference betwen the time at which it leaves the optimized algorythm and the time at which it starts again in the next cycle. I'm not measuring the length of the execution itself, because the CLK interruptions are disabled during it.

The algorythm is called to fill a buffer of 1280 samples (640 frames) to be outputted through a SIO stream to the codec at a rate of 44100 Hz fps. Then, the optimized algorythm is called with a frequency of (44100 / 640) Hz which means an interval of 14500 us. The pipelined code is about 1800 execution cycles long. for each frame and should take aprox. 1800 * 640 / 225,000,000 = 5120 us, letting us free about 14500 - 5120 = 9380 us (a little less tan that). The measurements gave us only 2220 us free, less than the half. The execution graph shows the same results, and the max. number of execution cycles is 2100, with an almost full use of the processor.

Charly

From:  Jeff Brower <j...@signalogic.com>
To:  Charly Feringlo <f...@hotmail.com>
CC:  c...@yahoogroups.com
Subject:  Re: [c6x] Software pipeline - wrong number of cycles
Date:  Thu, 09 Feb 2006 09:44:24 -0600
>Charly-
>
> > I am working on a C6713DSK. I wrote a pipelined optimized ASM code, and it
> > consumed more than twice execution cycles than expected. To understand
> > the problem, I wrote a very simple code:
> >
> >    LDDW  .D1  *A4++,A7:A6    ; #1
> > || LDDW  .D2  *B4++,B7,B6
> >    LDDW  .D1  *A4++,A7:A6    ; #2
> > || LDDW  .D2  *B4++,B7,B6
> >    .
> >    .
> >    .
> >    LDDW  .D1  *A4++,A7:A6    ; #512
> > || LDDW  .D2  *B4++,B7,B6
> >
> > It should consume little more than 512 cycles, but it actually takes
> > about 820. Then I replaced the LDDW instructions with LDW ones, and the
> > number of cycles was consideratly different. I am using internal mem (for
> > both code and data) and I've verified that accesses are at banks 0 and 4
> > each cycle. The HWI interrupts are disabled during the execution. Am
> > I missing something?
>
>How are you measuring cycles?  Have you verified your measurement method, for example
>write a loop that iterates NOPs many times and get some average figures?
>
>-Jeff>
>
><*> To visit your group on the web, go to:
>     http://groups.yahoo.com/group/c6x/
>
><*> To unsubscribe from this group, send an email to:
>     c...@yahoogroups.com
>
><*>

Nuevo MSN Messenger Una forma rida y divertida de enviar mensajes

Reply by feringlo ●February 9, 20062006-02-09

> 
> How are you measuring cycles?  Have you verified your measurement 
method, for example
> write a loop that iterates NOPs many times and get
some average 
figures?
> 

I measured the time using the low resolution clock, which interval is 
set to 10us. I am taking the difference betwen the time at which it 
leaves the optimized algorythm and the time at which it starts again 
in the next cycle. I'm not measuring the length of the execution 
itself, because the CLK interruptions are disabled during it.  
 
The algorythm is called to fill a buffer of 1280 samples (640 frames) 
to be outputted through a SIO stream to the codec at a rate of 44100 
Hz fps. Then, the optimized algorythm is called with a frequency of 
(44100 / 640) Hz which means an interval of 14500 us. The pipelined 
code is about 1800 execution cycles long. for each frame and should 
take aprox. 1800 * 640 / 225,000,000 = 5120 us, letting us free about 
14500 - 5120 = 9380 us (a little less tan that). The measurements 
gave us only 2220 us free, less than the half. The execution graph 
shows the same results, and the max. number of execution cycles is 
2100, with an almost full use of the processor. 

Thanks for your time,

Charly

Reply by Mike Dunn ●February 9, 20062006-02-09

Hello Charly,

--- feringlo@feri... wrote:

> Hi everybody, I'm new and glad to be here
> 
> I am working on a C6713DSK. I wrote a pipelined
> optimized ASM code, and it consumed more than twice
> execution cycles than expected. To understand the
> problem, I wrote a very simple code:
> 
>    LDDW  .D1  *A4++,A7:A6    ; #1
> || LDDW  .D2  *B4++,B7,B6
>    LDDW  .D1  *A4++,A7:A6    ; #2
> || LDDW  .D2  *B4++,B7,B6
>    .
>    .
>    .
>    LDDW  .D1  *A4++,A7:A6    ; #512
> || LDDW  .D2  *B4++,B7,B6
> 
> It should consume little more than 512 cycles, 
Really??  There is a problem with your assumption [as
I see it]
1. as I understand it, in order to get 'single cycle
performance' on the 6713 your code + data need to be
in L1 cache. It obviously will not be there on pass
#1.
2. as I understand your example, it is made up of 1024
words [4k bytes] that fetch 2048 words [8k bytes]. 
The L1P cache size is 4k bytes and the L1D cache size
is also 4K bytes - a bit small to hold 8k bytes of
data.  L1D misses are guaranteed.  What happens [the
second pass] if you cut the number of LDDWs in half??

> but
> it actually takes about 820. Then I replaced the
> LDDW instructions with LDW ones, and the number of
> cycles was consideratly different. 
What number did you get??

> I am using
> internal mem (for both code and data) and I've
> verified that accesses are at banks 0 and 4 each
> cycle. The HWI interrupts are disabled during the
> execution. Am I missing something?
yes. :-)

mikedunn
> 
> Thanks,
> 
> Charly
> 
> 
> 
> 
> 
> 
> 
> 
>     c6x-unsubscribe@c6x-...
> 
>  
> 
> 
> 
>

Reply by Mike Dunn ●February 9, 20062006-02-09

Charly,

--- feringlo <feringlo@feri...> wrote:

> 
> > 
> > How are you measuring cycles?  Have you verified
> your measurement 
> method, for example
> > write a loop that iterates NOPs many times and get
> some average 
> figures?
> > 
> 
> I measured the time using the low resolution clock,
> which interval is 
> set to 10us. I am taking the difference betwen the
> time at which it 
> leaves the optimized algorythm and the time at which
> it starts again 
> in the next cycle. I'm not measuring the length of
> the execution 
> itself, because the CLK interruptions are disabled
> during it.  
>  
> The algorythm is called to fill a buffer of 1280
> samples (640 frames) 
> to be outputted through a SIO stream to the codec at
> a rate of 44100 
> Hz fps. Then, the optimized algorythm is called with
> a frequency of 
> (44100 / 640) Hz which means an interval of 14500
> us. The pipelined 
> code is about 1800 execution cycles long. for each
> frame and should 
> take aprox. 1800 * 640 / 225,000,000 = 5120 us,

I think that you need to review TI's docs on the 6713.
 Your assumption is too optimistic.

mikedunn

> letting us free about 
> 14500 - 5120 = 9380 us (a little less tan that). The
> measurements 
> gave us only 2220 us free, less than the half. The
> execution graph 
> shows the same results, and the max. number of
> execution cycles is 
> 2100, with an almost full use of the processor. 
> 
> Thanks for your time,
> 
> Charly
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>     c6x-unsubscribe@c6x-...
> 
>  
> 
> 
>

Reply by Jeff Brower ●February 9, 20062006-02-09

Charly-

> I measured the time using the low resolution
clock, which interval is set to 10us.
> I am taking the difference betwen the time at which it leaves the optimized
> algorythm and the time at which it starts again in the next cycle. I'm
not
> measuring the length of the execution itself, because the CLK interruptions
are
> disabled during it.

I would not do this.  If you are depending on DSP/BIOS, RTDX etc to make cycle
measurements you are introducing extra factors until you have a full
understanding of
your system and your measurement process.

The better approach is to enable TIMER1 (not TIMER0, which is used by DSP/BIOS),
read
it's value prior to your test code section, then read it again after the
section
finishes.  This method is insensitive to interrupts, RTDX, DMA -- just about
anything
except for stop-mode emulation.  You will get super-precise figures, you can
store an
array (short history) of figures to determine what's happening with cache
(re. Mike's
comments), and you will not be subject to CCS + JTAG weirdnesses that might jump
out
to bite you.

-Jeff

> 
> The algorythm is called to fill a buffer of 1280 samples (640 frames) to be
> outputted through a SIO stream to the codec at a rate of 44100 Hz fps.
Then, the
> optimized algorythm is called with a frequency of (44100 / 640) Hz which
means an
> interval of 14500 us. The pipelined code is about 1800 execution cycles
long. for
> each frame and should take aprox. 1800 * 640 / 225,000,000 = 5120 us,
letting us
> free about 14500 - 5120 = 9380 us (a little less tan that). The
measurements gave
> us only 2220 us free, less than the half. The execution graph shows the
same
> results, and the max. number of execution cycles is 2100, with an almost
full use
> of the processor.
> 
> Charly
> 
>      -
>      From:  Jeff Brower <jbrower@jbro...>
>      To:  Charly Feringlo <feringlo@feri...>
>      CC:  c6x@c6x@...
>      Subject:  Re: [c6x] Software pipeline - wrong number of cycles
>      Date:  Thu, 09 Feb 2006 09:44:24 -0600
>      >Charly-
>      >
>      > > I am working on a C6713DSK. I wrote a pipelined optimized
ASM code,
>      and it
>      > > consumed more than twice execution cycles than expected. To
>      understand
>      > > the problem, I wrote a very simple code:
>      > >
>      > >    LDDW  .D1  *A4++,A7:A6    ; #1
>      > > || LDDW  .D2  *B4++,B7,B6
>      > >    LDDW  .D1  *A4++,A7:A6    ; #2
>      > > || LDDW  .D2  *B4++,B7,B6
>      > >    .
>      > >    .
>      > >    .
>      > >    LDDW  .D1  *A4++,A7:A6    ; #512
>      > > || LDDW  .D2  *B4++,B7,B6
>      > >
>      > > It should consume little more than 512 cycles, but it
actually takes
>      > > about 820. Then I replaced the LDDW instructions with LDW
ones, and
>      the
>      > > number of cycles was consideratly different. I am using
internal mem
>      (for
>      > > both code and data) and I've verified that accesses are
at banks 0
>      and 4
>      > > each cycle. The HWI interrupts are disabled during the
execution. Am
>      > > I missing something?
>      >
>      >How are you measuring cycles?  Have you verified your measurement
>      method, for example
>      >write a loop that iterates NOPs many times and get some average
figures?
>      >
>      >-Jeff

Reply by pavan kumar ●February 10, 20062006-02-10

Jeff ,
  Guess there can some Cache misses initially ..i am not aware  of C6713 
stuff....
   
  actually i even got more cycle count that the worst case  hand counted cycles
of a certain algorithm ,for a particular DSP
  then i realised and found that its bcoz of Cache misses and for each cache
miss there would be some amount of cycles used .which would add up and reflect 
as  our  high cycle count
   
  u can think on this angle  also ..sorry if i deviated from u r main probs
   
  any one correct me if i am wrong!!
  Thanks
  Pavan Kumar

Reply by Jeff Brower ●February 10, 20062006-02-10

Pavan-

>   Guess there can some Cache misses initially ..i
am not aware
>  of C6713  stuff....
> 
>   actually i even got more cycle count that the worst case  hand
> counted cycles of a certain algorithm ,for a particular DSP
>   then i realised and found that its bcoz of Cache misses and for
> each cache miss there would be some amount of cycles used .which
> would add up and reflect as our high cycle count

This is partly why I suggested to Charly to use the onchip TIMER1, and code a
short
array that keeps track of execution time in successive loops through the code. 
If
cache plays a role, then it will jump out when Charly plots the time history.

-Jeff

Software pipeline - wrong number of cycles

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group