Hi everybody, I'm new and glad to be here I am working on a C6713DSK. I wrote a pipelined optimized ASM code, and it consumed more than twice execution cycles than expected. To understand the problem, I wrote a very simple code: LDDW .D1 *A4++,A7:A6 ; #1 || LDDW .D2 *B4++,B7,B6 LDDW .D1 *A4++,A7:A6 ; #2 || LDDW .D2 *B4++,B7,B6 . . . LDDW .D1 *A4++,A7:A6 ; #512 || LDDW .D2 *B4++,B7,B6 It should consume little more than 512 cycles, but it actually takes about 820. Then I replaced the LDDW instructions with LDW ones, and the number of cycles was consideratly different. I am using internal mem (for both code and data) and I've verified that accesses are at banks 0 and 4 each cycle. The HWI interrupts are disabled during the execution. Am I missing something? Thanks, Charly
Software pipeline - wrong number of cycles
Started by ●February 9, 2006
Reply by ●February 9, 20062006-02-09
Hi Charly,
feringlo@feri... schrieb:
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and
> the number of cycles was consideratly different. I am using internal
> mem (for both code and data) and I've verified that accesses are at
> banks 0 and 4 each cycle. The HWI interrupts are disabled during the
> execution. Am I missing something?
how do you count your need cycles? By hand, by simulator or by measuring
some HW pins? Maybe you miss the stalls. Try the simulator.
HTH Gustl
Reply by ●February 9, 20062006-02-09
Charly-
> I am working on a C6713DSK. I wrote a pipelined
optimized ASM code, and it
> consumed more than twice execution cycles than expected. To understand
> the problem, I wrote a very simple code:
>
> LDDW .D1 *A4++,A7:A6 ; #1
> || LDDW .D2 *B4++,B7,B6
> LDDW .D1 *A4++,A7:A6 ; #2
> || LDDW .D2 *B4++,B7,B6
> .
> .
> .
> LDDW .D1 *A4++,A7:A6 ; #512
> || LDDW .D2 *B4++,B7,B6
>
> It should consume little more than 512 cycles, but it actually takes
> about 820. Then I replaced the LDDW instructions with LDW ones, and the
> number of cycles was consideratly different. I am using internal mem (for
> both code and data) and I've verified that accesses are at banks 0 and
4
> each cycle. The HWI interrupts are disabled during the execution. Am
> I missing something?
How are you measuring cycles? Have you verified your measurement method, for
example
write a loop that iterates NOPs many times and get some average figures?
-Jeff
Reply by ●February 9, 20062006-02-09
I measured the time using the low resolution clock, which interval is set
to 10us. I am taking the difference betwen the time at which it leaves the
optimized algorythm and the time at which it starts again in the next cycle.
I'm not measuring the length of the execution itself, because
the CLK interruptions are disabled during it.
The algorythm is called to fill a buffer of 1280 samples (640 frames)
to be outputted through a SIO stream to the codec at a rate of 44100 Hz fps.
Then, the optimized algorythm is called with a frequency of (44100 / 640) Hz
which means an interval of 14500 us. The pipelined code is about 1800 execution
cycles long. for each frame and should take aprox. 1800 * 640 / 225,000,000 =
5120 us, letting us free about 14500 - 5120 = 9380 us (a little less tan that).
The measurements gave us only 2220 us free, less than the half. The execution
graph shows the same results, and the max. number of execution cycles is 2100,
with an almost full use of the processor.
Charly
From: Jeff Brower <j...@signalogic.com>
To: Charly Feringlo <f...@hotmail.com>
CC: c...@yahoogroups.com
Subject: Re: [c6x] Software pipeline - wrong number of cycles
Date: Thu, 09 Feb 2006 09:44:24 -0600
>Charly-
>
> > I am working on a C6713DSK. I wrote a pipelined optimized ASM code, and it
> > consumed more than twice execution cycles than expected. To understand
> > the problem, I wrote a very simple code:
> >
> > LDDW .D1 *A4++,A7:A6 ; #1
> > || LDDW .D2 *B4++,B7,B6
> > LDDW .D1 *A4++,A7:A6 ; #2
> > || LDDW .D2 *B4++,B7,B6
> > .
> > .
> > .
> > LDDW .D1 *A4++,A7:A6 ; #512
> > || LDDW .D2 *B4++,B7,B6
> >
> > It should consume little more than 512 cycles, but it actually takes
> > about 820. Then I replaced the LDDW instructions with LDW ones, and the
> > number of cycles was consideratly different. I am using internal mem (for
> > both code and data) and I've verified that accesses are at banks 0 and 4
> > each cycle. The HWI interrupts are disabled during the execution. Am
> > I missing something?
>
>How are you measuring cycles? Have you verified your measurement method, for example
>write a loop that iterates NOPs many times and get some average figures?
>
>-Jeff>
>
><*> To visit your group on the web, go to:
> http://groups.yahoo.com/group/c6x/
>
><*> To unsubscribe from this group, send an email to:
> c...@yahoogroups.com
>
><*>
Nuevo MSN Messenger Una forma rida y divertida de enviar mensajes
Reply by ●February 9, 20062006-02-09
> > How are you measuring cycles? Have you verified your measurement method, for example > write a loop that iterates NOPs many times and get some average figures? > I measured the time using the low resolution clock, which interval is set to 10us. I am taking the difference betwen the time at which it leaves the optimized algorythm and the time at which it starts again in the next cycle. I'm not measuring the length of the execution itself, because the CLK interruptions are disabled during it. The algorythm is called to fill a buffer of 1280 samples (640 frames) to be outputted through a SIO stream to the codec at a rate of 44100 Hz fps. Then, the optimized algorythm is called with a frequency of (44100 / 640) Hz which means an interval of 14500 us. The pipelined code is about 1800 execution cycles long. for each frame and should take aprox. 1800 * 640 / 225,000,000 = 5120 us, letting us free about 14500 - 5120 = 9380 us (a little less tan that). The measurements gave us only 2220 us free, less than the half. The execution graph shows the same results, and the max. number of execution cycles is 2100, with an almost full use of the processor. Thanks for your time, Charly
Reply by ●February 9, 20062006-02-09
Hello Charly, --- feringlo@feri... wrote: > Hi everybody, I'm new and glad to be here > > I am working on a C6713DSK. I wrote a pipelined > optimized ASM code, and it consumed more than twice > execution cycles than expected. To understand the > problem, I wrote a very simple code: > > LDDW .D1 *A4++,A7:A6 ; #1 > || LDDW .D2 *B4++,B7,B6 > LDDW .D1 *A4++,A7:A6 ; #2 > || LDDW .D2 *B4++,B7,B6 > . > . > . > LDDW .D1 *A4++,A7:A6 ; #512 > || LDDW .D2 *B4++,B7,B6 > > It should consume little more than 512 cycles, Really?? There is a problem with your assumption [as I see it] 1. as I understand it, in order to get 'single cycle performance' on the 6713 your code + data need to be in L1 cache. It obviously will not be there on pass #1. 2. as I understand your example, it is made up of 1024 words [4k bytes] that fetch 2048 words [8k bytes]. The L1P cache size is 4k bytes and the L1D cache size is also 4K bytes - a bit small to hold 8k bytes of data. L1D misses are guaranteed. What happens [the second pass] if you cut the number of LDDWs in half?? > but > it actually takes about 820. Then I replaced the > LDDW instructions with LDW ones, and the number of > cycles was consideratly different. What number did you get?? > I am using > internal mem (for both code and data) and I've > verified that accesses are at banks 0 and 4 each > cycle. The HWI interrupts are disabled during the > execution. Am I missing something? yes. :-) mikedunn > > Thanks, > > Charly > > > > > > > > > c6x-unsubscribe@c6x-... > > > > > >
Reply by ●February 9, 20062006-02-09
Charly, --- feringlo <feringlo@feri...> wrote: > > > > > How are you measuring cycles? Have you verified > your measurement > method, for example > > write a loop that iterates NOPs many times and get > some average > figures? > > > > I measured the time using the low resolution clock, > which interval is > set to 10us. I am taking the difference betwen the > time at which it > leaves the optimized algorythm and the time at which > it starts again > in the next cycle. I'm not measuring the length of > the execution > itself, because the CLK interruptions are disabled > during it. > > The algorythm is called to fill a buffer of 1280 > samples (640 frames) > to be outputted through a SIO stream to the codec at > a rate of 44100 > Hz fps. Then, the optimized algorythm is called with > a frequency of > (44100 / 640) Hz which means an interval of 14500 > us. The pipelined > code is about 1800 execution cycles long. for each > frame and should > take aprox. 1800 * 640 / 225,000,000 = 5120 us, I think that you need to review TI's docs on the 6713. Your assumption is too optimistic. mikedunn > letting us free about > 14500 - 5120 = 9380 us (a little less tan that). The > measurements > gave us only 2220 us free, less than the half. The > execution graph > shows the same results, and the max. number of > execution cycles is > 2100, with an almost full use of the processor. > > Thanks for your time, > > Charly > > > > > > > > > > > > > c6x-unsubscribe@c6x-... > > > > >
Reply by ●February 9, 20062006-02-09
Charly- > I measured the time using the low resolution clock, which interval is set to 10us. > I am taking the difference betwen the time at which it leaves the optimized > algorythm and the time at which it starts again in the next cycle. I'm not > measuring the length of the execution itself, because the CLK interruptions are > disabled during it. I would not do this. If you are depending on DSP/BIOS, RTDX etc to make cycle measurements you are introducing extra factors until you have a full understanding of your system and your measurement process. The better approach is to enable TIMER1 (not TIMER0, which is used by DSP/BIOS), read it's value prior to your test code section, then read it again after the section finishes. This method is insensitive to interrupts, RTDX, DMA -- just about anything except for stop-mode emulation. You will get super-precise figures, you can store an array (short history) of figures to determine what's happening with cache (re. Mike's comments), and you will not be subject to CCS + JTAG weirdnesses that might jump out to bite you. -Jeff > > The algorythm is called to fill a buffer of 1280 samples (640 frames) to be > outputted through a SIO stream to the codec at a rate of 44100 Hz fps. Then, the > optimized algorythm is called with a frequency of (44100 / 640) Hz which means an > interval of 14500 us. The pipelined code is about 1800 execution cycles long. for > each frame and should take aprox. 1800 * 640 / 225,000,000 = 5120 us, letting us > free about 14500 - 5120 = 9380 us (a little less tan that). The measurements gave > us only 2220 us free, less than the half. The execution graph shows the same > results, and the max. number of execution cycles is 2100, with an almost full use > of the processor. > > Charly > > - > From: Jeff Brower <jbrower@jbro...> > To: Charly Feringlo <feringlo@feri...> > CC: c6x@c6x@... > Subject: Re: [c6x] Software pipeline - wrong number of cycles > Date: Thu, 09 Feb 2006 09:44:24 -0600 > >Charly- > > > > > I am working on a C6713DSK. I wrote a pipelined optimized ASM code, > and it > > > consumed more than twice execution cycles than expected. To > understand > > > the problem, I wrote a very simple code: > > > > > > LDDW .D1 *A4++,A7:A6 ; #1 > > > || LDDW .D2 *B4++,B7,B6 > > > LDDW .D1 *A4++,A7:A6 ; #2 > > > || LDDW .D2 *B4++,B7,B6 > > > . > > > . > > > . > > > LDDW .D1 *A4++,A7:A6 ; #512 > > > || LDDW .D2 *B4++,B7,B6 > > > > > > It should consume little more than 512 cycles, but it actually takes > > > about 820. Then I replaced the LDDW instructions with LDW ones, and > the > > > number of cycles was consideratly different. I am using internal mem > (for > > > both code and data) and I've verified that accesses are at banks 0 > and 4 > > > each cycle. The HWI interrupts are disabled during the execution. Am > > > I missing something? > > > >How are you measuring cycles? Have you verified your measurement > method, for example > >write a loop that iterates NOPs many times and get some average figures? > > > >-Jeff
Reply by ●February 10, 20062006-02-10
Jeff , Guess there can some Cache misses initially ..i am not aware of C6713 stuff.... actually i even got more cycle count that the worst case hand counted cycles of a certain algorithm ,for a particular DSP then i realised and found that its bcoz of Cache misses and for each cache miss there would be some amount of cycles used .which would add up and reflect as our high cycle count u can think on this angle also ..sorry if i deviated from u r main probs any one correct me if i am wrong!! Thanks Pavan Kumar
Reply by ●February 10, 20062006-02-10
Pavan-
> Guess there can some Cache misses initially ..i
am not aware
> of C6713 stuff....
>
> actually i even got more cycle count that the worst case hand
> counted cycles of a certain algorithm ,for a particular DSP
> then i realised and found that its bcoz of Cache misses and for
> each cache miss there would be some amount of cycles used .which
> would add up and reflect as our high cycle count
This is partly why I suggested to Charly to use the onchip TIMER1, and code a
short
array that keeps track of execution time in successive loops through the code.
If
cache plays a role, then it will jump out when Charly plots the time history.
-Jeff