Pavan-
> Guess there can some Cache misses initially ..i
am not aware
> of C6713 stuff....
>
> actually i even got more cycle count that the worst case hand
> counted cycles of a certain algorithm ,for a particular DSP
> then i realised and found that its bcoz of Cache misses and for
> each cache miss there would be some amount of cycles used .which
> would add up and reflect as our high cycle count
This is partly why I suggested to Charly to use the onchip TIMER1, and code a
short
array that keeps track of execution time in successive loops through the code.
If
cache plays a role, then it will jump out when Charly plots the time history.
-Jeff
Reply by pavan kumar●February 10, 20062006-02-10
Jeff ,
Guess there can some Cache misses initially ..i am not aware of C6713
stuff....
actually i even got more cycle count that the worst case hand counted cycles
of a certain algorithm ,for a particular DSP
then i realised and found that its bcoz of Cache misses and for each cache
miss there would be some amount of cycles used .which would add up and reflect
as our high cycle count
u can think on this angle also ..sorry if i deviated from u r main probs
any one correct me if i am wrong!!
Thanks
Pavan Kumar
Reply by Jeff Brower●February 9, 20062006-02-09
Charly-
> I measured the time using the low resolution
clock, which interval is set to 10us.
> I am taking the difference betwen the time at which it leaves the optimized
> algorythm and the time at which it starts again in the next cycle. I'm
not
> measuring the length of the execution itself, because the CLK interruptions
are
> disabled during it.
I would not do this. If you are depending on DSP/BIOS, RTDX etc to make cycle
measurements you are introducing extra factors until you have a full
understanding of
your system and your measurement process.
The better approach is to enable TIMER1 (not TIMER0, which is used by DSP/BIOS),
read
it's value prior to your test code section, then read it again after the
section
finishes. This method is insensitive to interrupts, RTDX, DMA -- just about
anything
except for stop-mode emulation. You will get super-precise figures, you can
store an
array (short history) of figures to determine what's happening with cache
(re. Mike's
comments), and you will not be subject to CCS + JTAG weirdnesses that might jump
out
to bite you.
-Jeff
>
> The algorythm is called to fill a buffer of 1280 samples (640 frames) to be
> outputted through a SIO stream to the codec at a rate of 44100 Hz fps.
Then, the
> optimized algorythm is called with a frequency of (44100 / 640) Hz which
means an
> interval of 14500 us. The pipelined code is about 1800 execution cycles
long. for
> each frame and should take aprox. 1800 * 640 / 225,000,000 = 5120 us,
letting us
> free about 14500 - 5120 = 9380 us (a little less tan that). The
measurements gave
> us only 2220 us free, less than the half. The execution graph shows the
same
> results, and the max. number of execution cycles is 2100, with an almost
full use
> of the processor.
>
> Charly
>
> -
> From: Jeff Brower <jbrower@jbro...>
> To: Charly Feringlo <feringlo@feri...>
> CC: c6x@c6x@...
> Subject: Re: [c6x] Software pipeline - wrong number of cycles
> Date: Thu, 09 Feb 2006 09:44:24 -0600
> >Charly-
> >
> > > I am working on a C6713DSK. I wrote a pipelined optimized
ASM code,
> and it
> > > consumed more than twice execution cycles than expected. To
> understand
> > > the problem, I wrote a very simple code:
> > >
> > > LDDW .D1 *A4++,A7:A6 ; #1
> > > || LDDW .D2 *B4++,B7,B6
> > > LDDW .D1 *A4++,A7:A6 ; #2
> > > || LDDW .D2 *B4++,B7,B6
> > > .
> > > .
> > > .
> > > LDDW .D1 *A4++,A7:A6 ; #512
> > > || LDDW .D2 *B4++,B7,B6
> > >
> > > It should consume little more than 512 cycles, but it
actually takes
> > > about 820. Then I replaced the LDDW instructions with LDW
ones, and
> the
> > > number of cycles was consideratly different. I am using
internal mem
> (for
> > > both code and data) and I've verified that accesses are
at banks 0
> and 4
> > > each cycle. The HWI interrupts are disabled during the
execution. Am
> > > I missing something?
> >
> >How are you measuring cycles? Have you verified your measurement
> method, for example
> >write a loop that iterates NOPs many times and get some average
figures?
> >
> >-Jeff
Reply by Mike Dunn●February 9, 20062006-02-09
Charly,
--- feringlo <feringlo@feri...> wrote:
>
> >
> > How are you measuring cycles? Have you verified
> your measurement
> method, for example
> > write a loop that iterates NOPs many times and get
> some average
> figures?
> >
>
> I measured the time using the low resolution clock,
> which interval is
> set to 10us. I am taking the difference betwen the
> time at which it
> leaves the optimized algorythm and the time at which
> it starts again
> in the next cycle. I'm not measuring the length of
> the execution
> itself, because the CLK interruptions are disabled
> during it.
>
> The algorythm is called to fill a buffer of 1280
> samples (640 frames)
> to be outputted through a SIO stream to the codec at
> a rate of 44100
> Hz fps. Then, the optimized algorythm is called with
> a frequency of
> (44100 / 640) Hz which means an interval of 14500
> us. The pipelined
> code is about 1800 execution cycles long. for each
> frame and should
> take aprox. 1800 * 640 / 225,000,000 = 5120 us,
I think that you need to review TI's docs on the 6713.
Your assumption is too optimistic.
mikedunn
> letting us free about
> 14500 - 5120 = 9380 us (a little less tan that). The
> measurements
> gave us only 2220 us free, less than the half. The
> execution graph
> shows the same results, and the max. number of
> execution cycles is
> 2100, with an almost full use of the processor.
>
> Thanks for your time,
>
> Charly
>
>
>
>
>
>
>
>
>
>
>
>
> c6x-unsubscribe@c6x-...
>
>
>
>
>
Reply by Mike Dunn●February 9, 20062006-02-09
Hello Charly,
--- feringlo@feri... wrote:
> Hi everybody, I'm new and glad to be here
>
> I am working on a C6713DSK. I wrote a pipelined
> optimized ASM code, and it consumed more than twice
> execution cycles than expected. To understand the
> problem, I wrote a very simple code:
>
> LDDW .D1 *A4++,A7:A6 ; #1
> || LDDW .D2 *B4++,B7,B6
> LDDW .D1 *A4++,A7:A6 ; #2
> || LDDW .D2 *B4++,B7,B6
> .
> .
> .
> LDDW .D1 *A4++,A7:A6 ; #512
> || LDDW .D2 *B4++,B7,B6
>
> It should consume little more than 512 cycles,
Really?? There is a problem with your assumption [as
I see it]
1. as I understand it, in order to get 'single cycle
performance' on the 6713 your code + data need to be
in L1 cache. It obviously will not be there on pass
#1.
2. as I understand your example, it is made up of 1024
words [4k bytes] that fetch 2048 words [8k bytes].
The L1P cache size is 4k bytes and the L1D cache size
is also 4K bytes - a bit small to hold 8k bytes of
data. L1D misses are guaranteed. What happens [the
second pass] if you cut the number of LDDWs in half??
> but
> it actually takes about 820. Then I replaced the
> LDDW instructions with LDW ones, and the number of
> cycles was consideratly different.
What number did you get??
> I am using
> internal mem (for both code and data) and I've
> verified that accesses are at banks 0 and 4 each
> cycle. The HWI interrupts are disabled during the
> execution. Am I missing something?
yes. :-)
mikedunn
>
> Thanks,
>
> Charly
>
>
>
>
>
>
>
>
> c6x-unsubscribe@c6x-...
>
>
>
>
>
>
Reply by feringlo●February 9, 20062006-02-09
>
> How are you measuring cycles? Have you verified your measurement
method, for example
> write a loop that iterates NOPs many times and get
some average
figures?
>
I measured the time using the low resolution clock, which interval is
set to 10us. I am taking the difference betwen the time at which it
leaves the optimized algorythm and the time at which it starts again
in the next cycle. I'm not measuring the length of the execution
itself, because the CLK interruptions are disabled during it.
The algorythm is called to fill a buffer of 1280 samples (640 frames)
to be outputted through a SIO stream to the codec at a rate of 44100
Hz fps. Then, the optimized algorythm is called with a frequency of
(44100 / 640) Hz which means an interval of 14500 us. The pipelined
code is about 1800 execution cycles long. for each frame and should
take aprox. 1800 * 640 / 225,000,000 = 5120 us, letting us free about
14500 - 5120 = 9380 us (a little less tan that). The measurements
gave us only 2220 us free, less than the half. The execution graph
shows the same results, and the max. number of execution cycles is
2100, with an almost full use of the processor.
Thanks for your time,
Charly
Reply by Charly Feringlo●February 9, 20062006-02-09
I measured the time using the low resolution clock, which interval is set
to 10us. I am taking the difference betwen the time at which it leaves the
optimized algorythm and the time at which it starts again in the next cycle.
I'm not measuring the length of the execution itself, because
the CLK interruptions are disabled during it.
The algorythm is called to fill a buffer of 1280 samples (640 frames)
to be outputted through a SIO stream to the codec at a rate of 44100 Hz fps.
Then, the optimized algorythm is called with a frequency of (44100 / 640) Hz
which means an interval of 14500 us. The pipelined code is about 1800 execution
cycles long. for each frame and should take aprox. 1800 * 640 / 225,000,000 =
5120 us, letting us free about 14500 - 5120 = 9380 us (a little less tan that).
The measurements gave us only 2220 us free, less than the half. The execution
graph shows the same results, and the max. number of execution cycles is 2100,
with an almost full use of the processor.
Charly
From: Jeff Brower <j...@signalogic.com> To: Charly Feringlo <f...@hotmail.com> CC: c...@yahoogroups.com Subject: Re:
[c6x] Software pipeline - wrong number of cycles Date: Thu, 09 Feb 2006 09:44:24 -0600 >Charly- > > > I am working on a C6713DSK. I wrote a pipelined optimized
ASM code, and it > > consumed more than twice execution cycles than
expected. To understand > > the problem, I wrote a very simple
code: > > >
> LDDW .D1 *A4++,A7:A6 ;
#1 > > || LDDW .D2 *B4++,B7,B6 >
> LDDW .D1 *A4++,A7:A6 ;
#2 > > || LDDW .D2 *B4++,B7,B6 >
> . > > . >
> . >
> LDDW .D1 *A4++,A7:A6 ;
#512 > > || LDDW .D2 *B4++,B7,B6 >
> > > It should consume little more than 512 cycles, but it
actually takes > > about 820. Then I replaced the LDDW instructions
with LDW ones, and the > > number of cycles was consideratly
different. I am using internal mem (for > > both code and data) and
I've verified that accesses are at banks 0 and 4 > > each cycle.
The HWI interrupts are disabled during the execution. Am > > I
missing something? > >How are you measuring
cycles? Have you verified your measurement method, for example >write a loop that iterates NOPs many times and get some average
figures? > >-Jeff> > ><*> To visit your
group on the web, go to: >
http://groups.yahoo.com/group/c6x/ > ><*> To unsubscribe
from this group, send an email to: >
c...@yahoogroups.com > ><*>
Nuevo MSN Messenger Una forma
rida y divertida de enviar mensajes
Reply by Jeff Brower●February 9, 20062006-02-09
Charly-
> I am working on a C6713DSK. I wrote a pipelined
optimized ASM code, and it
> consumed more than twice execution cycles than expected. To understand
> the problem, I wrote a very simple code:
>
> LDDW .D1 *A4++,A7:A6 ; #1
> || LDDW .D2 *B4++,B7,B6
> LDDW .D1 *A4++,A7:A6 ; #2
> || LDDW .D2 *B4++,B7,B6
> .
> .
> .
> LDDW .D1 *A4++,A7:A6 ; #512
> || LDDW .D2 *B4++,B7,B6
>
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and the
> number of cycles was consideratly different. I am using internal mem (for
> both code and data) and I've verified that accesses are at banks 0 and
4
> each cycle. The HWI interrupts are disabled during the execution. Am
> I missing something?
How are you measuring cycles? Have you verified your measurement method, for
example
write a loop that iterates NOPs many times and get some average figures?
-Jeff
Reply by Bernhard Gustl Bauer●February 9, 20062006-02-09
Hi Charly,
feringlo@feri... schrieb:
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and
> the number of cycles was consideratly different. I am using internal
> mem (for both code and data) and I've verified that accesses are at
> banks 0 and 4 each cycle. The HWI interrupts are disabled during the
> execution. Am I missing something?
how do you count your need cycles? By hand, by simulator or by measuring
some HW pins? Maybe you miss the stalls. Try the simulator.
HTH Gustl
Reply by feri...@hotmail.com●February 9, 20062006-02-09
Hi everybody, I'm new and glad to be here
I am working on a C6713DSK. I wrote a pipelined optimized ASM code, and it
consumed more than twice execution cycles than expected. To understand the
problem, I wrote a very simple code:
LDDW .D1 *A4++,A7:A6 ; #1
|| LDDW .D2 *B4++,B7,B6
LDDW .D1 *A4++,A7:A6 ; #2
|| LDDW .D2 *B4++,B7,B6
.
.
.
LDDW .D1 *A4++,A7:A6 ; #512
|| LDDW .D2 *B4++,B7,B6
It should consume little more than 512 cycles, but it actually takes about 820.
Then I replaced the LDDW instructions with LDW ones, and the number of cycles
was consideratly different. I am using internal mem (for both code and data) and
I've verified that accesses are at banks 0 and 4 each cycle. The HWI
interrupts are disabled during the execution. Am I missing something?
Thanks,
Charly