DSPRelated.com
Forums

Function Profile Debug giving lower cycle count

Started by piyush kaul May 21, 2004
Hi All,

When I compile my code by "function profile debug" I
seem to get lower overall cycle count than with "no
debug info". This is totally unexpected, since some
optimizations should be turned off for it. I have
tried to measure the cycles both with the clock and
using timer registers and consistently found it so.

This does not happen if I compile the code with "dwarf
debug info" or "full debug info". In those two cases
the cycle count is higher than that for the "no debug
info" case, which is very much on expected lines.

I have checked the asm generated and it looks to be
very similar, especially for linear asm code.

Has anybody else faced a similar result, and any
possible explanation for it.

Regards
Piyush

=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________



Is it possible for you to share the code with a wider audience.
Also provide more info about your compiler flags.

Regds
JS

-----Original Message-----
From: piyush kaul [mailto:]
Sent: Friday, May 21, 2004 4:23 AM
To:
Subject: [c6x] Function Profile Debug giving lower cycle count

Hi All,

When I compile my code by "function profile debug" I
seem to get lower overall cycle count than with "no
debug info". This is totally unexpected, since some
optimizations should be turned off for it. I have
tried to measure the cycles both with the clock and
using timer registers and consistently found it so.

This does not happen if I compile the code with "dwarf
debug info" or "full debug info". In those two cases
the cycle count is higher than that for the "no debug
info" case, which is very much on expected lines.

I have checked the asm generated and it looks to be
very similar, especially for linear asm code.

Has anybody else faced a similar result, and any
possible explanation for it.

Regards
Piyush

=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________
_____________________________________
Note: If you do a simple "reply" with your email client, only the author
of this message will receive your answer. You need to do a "reply all"
if you want your answer to be distributed to the entire group.

_____________________________________
About this discussion group:

To Join: Send an email to

To Post: Send an email to

To Leave: Send an email to

Archives: http://www.yahoogroups.com/group/c6x

Other Groups: http://www.dsprelated.com

Yahoo! Groups Links



Hi Sankaran,

The code I was talking about is a Mpeg4 ASP decoder.
It might not be possible to share the entire code but
I have extracted a single function, implemented both
in c and l-asm, which you can compile with and
without function profile debug to see the anomaly.
There is a difference of 9 cycles between the two
compilation modes. For the entire decoder the
difference is about 20%
You can see the flags in the attached project file.

Regards
Piyush

PS: I hope nobody has a problem with the attaching zip
files on this newgroup. The size is pretty small at
3K. --- sankaran <> wrote:
> Is it possible for you to share the code with a
> wider audience.
> Also provide more info about your compiler flags.
>
> Regds
> JS
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Friday, May 21, 2004 4:23 AM
> To:
> Subject: [c6x] Function Profile Debug giving lower
> cycle count
>
> Hi All,
>
> When I compile my code by "function profile debug" I
> seem to get lower overall cycle count than with "no
> debug info". This is totally unexpected, since some
> optimizations should be turned off for it. I have
> tried to measure the cycles both with the clock and
> using timer registers and consistently found it so.
>
> This does not happen if I compile the code with
> "dwarf
> debug info" or "full debug info". In those two cases
> the cycle count is higher than that for the "no
> debug
> info" case, which is very much on expected lines.
>
> I have checked the asm generated and it looks to be
> very similar, especially for linear asm code.
>
> Has anybody else faced a similar result, and any
> possible explanation for it.
>
> Regards
> Piyush
>
> =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul >
>
> __________________________________ >
> _____________________________________
> Note: If you do a simple "reply" with your email
> client, only the author
> of this message will receive your answer. You need
> to do a "reply all"
> if you want your answer to be distributed to the
> entire group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to > To Post: Send an email to
>
> To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x
>
> Other Groups: http://www.dsprelated.com
>
> Yahoo! Groups Links >
>


=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________

Attachment (not stored)
test.zip
Type: application/x-zip-compressed



A possible *informal* explanation to this "phenomenon" :)

I emphasised on the *informal* quite on purpose, since I won't
add a single formula to it.

The original problem is to find an optimal schedule for a given
linear assembly or a C code.

This problem is reduced (read this as "reformulated", since the
reducing will not decrease neither complexity or irregularity
of the original problem) to a problem of constrained optimization
(finding a minimum of a function on a constrained support set) of
a *discrete* target function with locally changing constraints.

Discrete functions are very difficult to minimize, since they
do not have derivatives, as in the case of continuos functions,
for which quickly convergent Newton's methods can be applied.

Thus, a constraint minimization of a discrete function is a
difficult task on it own. Futher, the behaviour of the target
function is usually unknown. The function can have a number
of local extremums, therefore the task is to find a global minimum
over a set of local minimums that are spaced very irregularly.
And this is the answer to you original question: two different
schedules (i.e. minimums) were found for two settings, with
and without profile information, where the first minimum happen
to be smaller than the second.

Once again, I did not draw any mathematical treatment here, because
the size of the pages here is too small for it :)

Rgds,
Andrew

> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Friday, May 21, 2004 4:23 AM
> To:
> Subject: [c6x] Function Profile Debug giving lower cycle count
>
> Hi All,
>
> When I compile my code by "function profile debug" I
> seem to get lower overall cycle count than with "no
> debug info". This is totally unexpected, since some
> optimizations should be turned off for it. I have
> tried to measure the cycles both with the clock and
> using timer registers and consistently found it so.
>
> This does not happen if I compile the code with "dwarf
> debug info" or "full debug info". In those two cases
> the cycle count is higher than that for the "no debug
> info" case, which is very much on expected lines.
>
> I have checked the asm generated and it looks to be
> very similar, especially for linear asm code.
>
> Has anybody else faced a similar result, and any
> possible explanation for it.
>
> Regards
> Piyush
>
> =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul



I am on vacation but will take a look at it as soon as I can.
Nonetheless I will get back to you on the problem. Also please
let me know the version of the tools you have been using.

Regds
JS

-----Original Message-----
From: piyush kaul [mailto:]
Sent: Saturday, May 22, 2004 5:02 AM
To: sankaran;
Subject: RE: [c6x] Function Profile Debug giving lower cycle count

Hi Sankaran,

The code I was talking about is a Mpeg4 ASP decoder.
It might not be possible to share the entire code but
I have extracted a single function, implemented both
in c and l-asm, which you can compile with and
without function profile debug to see the anomaly.
There is a difference of 9 cycles between the two
compilation modes. For the entire decoder the
difference is about 20%
You can see the flags in the attached project file.

Regards
Piyush

PS: I hope nobody has a problem with the attaching zip
files on this newgroup. The size is pretty small at
3K. --- sankaran <> wrote:
> Is it possible for you to share the code with a
> wider audience.
> Also provide more info about your compiler flags.
>
> Regds
> JS
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Friday, May 21, 2004 4:23 AM
> To:
> Subject: [c6x] Function Profile Debug giving lower
> cycle count
>
> Hi All,
>
> When I compile my code by "function profile debug" I
> seem to get lower overall cycle count than with "no
> debug info". This is totally unexpected, since some
> optimizations should be turned off for it. I have
> tried to measure the cycles both with the clock and
> using timer registers and consistently found it so.
>
> This does not happen if I compile the code with
> "dwarf
> debug info" or "full debug info". In those two cases
> the cycle count is higher than that for the "no
> debug
> info" case, which is very much on expected lines.
>
> I have checked the asm generated and it looks to be
> very similar, especially for linear asm code.
>
> Has anybody else faced a similar result, and any
> possible explanation for it.
>
> Regards
> Piyush
>
> =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul >
>
> __________________________________ >
> _____________________________________
> Note: If you do a simple "reply" with your email
> client, only the author
> of this message will receive your answer. You need
> to do a "reply all"
> if you want your answer to be distributed to the
> entire group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to > To Post: Send an email to
>
> To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x
>
> Other Groups: http://www.dsprelated.com
>
> Yahoo! Groups Links >
>


=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________


Piyush,
First of all thanks for the detailed inputs. I compiled the files
with vfersion 4.31 tools, I am attaching the scheduled output from the
compiler as a reference, in case your tools do not produce the same. The
resulting code produces a 7 cycle loop, which means that for eight
iterations you should get about 56 cycles in core inner loop + prolog
+ epilog + setup overhead. I measured 91 cycles for ASM code and C: 528
cycles. Please check compiled output of serial assembly to make sure you
get a similar software pipelined loop.

One alternative on the C64x given the number of registers it has, is to
completely unroll the loop and perform the computations of all eight
rows
in parallel, and run this modified loop for as many half-pel
interpolation
cases as you may have, by building such a worklist ahead of time and
calling
this function once.

Regds
Jagadeesh Sankaran

Disclaimer:
The comments in this e-mail are solely my own opinions and do not imply
any written consent or permission from Texas Instruments. The views and
opinions in this e-mail are solely my own and do not constitute any
approval from Texas Instruments.

-----Original Message-----
From: piyush kaul [mailto:]
Sent: Saturday, May 22, 2004 5:02 AM
To: sankaran;
Subject: RE: [c6x] Function Profile Debug giving lower cycle count

Hi Sankaran,

The code I was talking about is a Mpeg4 ASP decoder.
It might not be possible to share the entire code but
I have extracted a single function, implemented both
in c and l-asm, which you can compile with and
without function profile debug to see the anomaly.
There is a difference of 9 cycles between the two
compilation modes. For the entire decoder the
difference is about 20%
You can see the flags in the attached project file.

Regards
Piyush

PS: I hope nobody has a problem with the attaching zip
files on this newgroup. The size is pretty small at
3K. --- sankaran <> wrote:
> Is it possible for you to share the code with a
> wider audience.
> Also provide more info about your compiler flags.
>
> Regds
> JS
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Friday, May 21, 2004 4:23 AM
> To:
> Subject: [c6x] Function Profile Debug giving lower
> cycle count
>
> Hi All,
>
> When I compile my code by "function profile debug" I
> seem to get lower overall cycle count than with "no
> debug info". This is totally unexpected, since some
> optimizations should be turned off for it. I have
> tried to measure the cycles both with the clock and
> using timer registers and consistently found it so.
>
> This does not happen if I compile the code with
> "dwarf
> debug info" or "full debug info". In those two cases
> the cycle count is higher than that for the "no
> debug
> info" case, which is very much on expected lines.
>
> I have checked the asm generated and it looks to be
> very similar, especially for linear asm code.
>
> Has anybody else faced a similar result, and any
> possible explanation for it.
>
> Regards
> Piyush
>
> =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul >
>
> __________________________________ >
> _____________________________________
> Note: If you do a simple "reply" with your email
> client, only the author
> of this message will receive your answer. You need
> to do a "reply all"
> if you want your answer to be distributed to the
> entire group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to > To Post: Send an email to
>
> To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x
>
> Other Groups: http://www.dsprelated.com
>
> Yahoo! Groups Links >
>


=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________
Attachment (not stored)
profile.zip
Type: application/x-zip-compressed

Attachment (not stored)
testHandAsm.asm
Type: application/octet-stream


Hi Sankaran,Andrew

I am using the tools version 4.32.
The problem which I mentioned can be seen only for
the C version of sample function
InterpolateBlockOddEven I had sent.(thought the same
problem exists in l-asm too for some other
functions). When you compile and run the program, you
may see that the C version gives 8-9 cycles less for
"function profile debug" than for the corrsponding
"without debug" version.

When I count the cycles from the generated asm for the
function IntepolateBlockOddEven I see the following
cycle count division

For "Function Profile Debug"
1. Initialization+Prolog = 4 cycles
2. Kernel = 14 cycles(14 scd./3 in ||)

3. Epilog+Exit = 13 cycles

For "No Debug"
1. Initialization+Prolog = 7 cycles
2. Kernel = 14 cycles(14 scd./3 in ||)
3. Epilog+Exit = 18 cycles What I noticed is that "No Debug" version seems to
avoid using parallism in some places where "function
profile" debug uses them.

Moreover, I noticed that the RET instruction is being
called 5 cycles before in the "function profile
debug" version, thus requiring no NOPs. For the "No
Debug" version the RETNOP 5, is being called right at
the end.

So one possible inference is that the difference is
not because of of software pipelining but some
dependency miscalculation. Is this right. What can
be the possible cause,solution for this.

Regards
Piyush

PS: Sankaran, the linear asm code has the inner loop
fully unrolled i.e. the entire row is being worked
upon in same iteration. The generated asm is however
almost identical for both "function profile debug" and
"no debug", so the problem is not visible for it.

--- sankaran <> wrote:
> Piyush,
> First of all thanks for the detailed inputs.
> I compiled the files
> with vfersion 4.31 tools, I am attaching the
> scheduled output from the
> compiler as a reference, in case your tools do not
> produce the same. The
> resulting code produces a 7 cycle loop, which means
> that for eight
> iterations you should get about 56 cycles in core
> inner loop + prolog
> + epilog + setup overhead. I measured 91 cycles for
> ASM code and C: 528
> cycles. Please check compiled output of serial
> assembly to make sure you
> get a similar software pipelined loop.
>
> One alternative on the C64x given the number of
> registers it has, is to
> completely unroll the loop and perform the
> computations of all eight
> rows
> in parallel, and run this modified loop for as many
> half-pel
> interpolation
> cases as you may have, by building such a worklist
> ahead of time and
> calling
> this function once.
>
> Regds
> Jagadeesh Sankaran
>
> Disclaimer:
> The comments in this e-mail are solely my own
> opinions and do not imply
> any written consent or permission from Texas
> Instruments. The views and
> opinions in this e-mail are solely my own and do not
> constitute any
> approval from Texas Instruments.
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Saturday, May 22, 2004 5:02 AM
> To: sankaran;
> Subject: RE: [c6x] Function Profile Debug giving
> lower cycle count
>
> Hi Sankaran,
>
> The code I was talking about is a Mpeg4 ASP decoder.
> It might not be possible to share the entire code
> but
> I have extracted a single function, implemented both
> in c and l-asm, which you can compile with and
> without function profile debug to see the anomaly.
> There is a difference of 9 cycles between the two
> compilation modes. For the entire decoder the
> difference is about 20%
> You can see the flags in the attached project file.
>
> Regards
> Piyush
>
> PS: I hope nobody has a problem with the attaching
> zip
> files on this newgroup. The size is pretty small at
> 3K. > --- sankaran <> wrote:
> > Is it possible for you to share the code with a
> > wider audience.
> > Also provide more info about your compiler flags.
> >
> > Regds
> > JS
> >
> > -----Original Message-----
> > From: piyush kaul [mailto:]
> > Sent: Friday, May 21, 2004 4:23 AM
> > To:
> > Subject: [c6x] Function Profile Debug giving lower
> > cycle count
> >
> > Hi All,
> >
> > When I compile my code by "function profile debug"
> I
> > seem to get lower overall cycle count than with
> "no
> > debug info". This is totally unexpected, since
> some
> > optimizations should be turned off for it. I have
> > tried to measure the cycles both with the clock
> and
> > using timer registers and consistently found it
> so.
> >
> > This does not happen if I compile the code with
> > "dwarf
> > debug info" or "full debug info". In those two
> cases
> > the cycle count is higher than that for the "no
> > debug
> > info" case, which is very much on expected lines.
> >
> > I have checked the asm generated and it looks to
> be
> > very similar, especially for linear asm code.
> >
> > Has anybody else faced a similar result, and any
> > possible explanation for it.
> >
> > Regards
> > Piyush
> >
> > =====
> > **************************************
> > And---"A blind Understanding!" Heav'n replied.
> >
> > Piyush Kaul
> > http://www.geocities.com/piyushkaul
> >
> >
> >
> >
> > __________________________________
> >
> >
> >
> > _____________________________________
> > Note: If you do a simple "reply" with your email
> > client, only the author
> > of this message will receive your answer. You
> need
> > to do a "reply all"
> > if you want your answer to be distributed to the
> > entire group.
> >
> > _____________________________________
> > About this discussion group:
> >
> > To Join: Send an email to
> >
> >
> > To Post: Send an email to
> >
> > To Leave: Send an email to
> >
> >
> > Archives: http://www.yahoogroups.com/group/c6x
> >
> > Other Groups: http://www.dsprelated.com
> >
> > Yahoo! Groups Links
> >
> >
> >
> >
> >
> >
> > =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul >
>
> __________________________________
>

> ATTACHMENT part 2 application/x-zip-compressed
name=profile.zip > ATTACHMENT part 3 application/octet-stream
name=testHandAsm.asm
=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________



Slowing down of the code due to compiling in debug is to be
expected. As far as unrolling I suggest you do away with the
loop that you have and write the code for all 8 rows explicitly.
Then software pipeline across multiple such motion vectors
that need half-pel interpolation.

Regds
JS

-----Original Message-----
From: piyush kaul [mailto:]
Sent: Monday, May 24, 2004 1:14 AM
To: sankaran; Andrew Nesterov;
Subject: RE: [c6x] Function Profile Debug giving lower cycle count

Hi Sankaran,Andrew

I am using the tools version 4.32.
The problem which I mentioned can be seen only for
the C version of sample function
InterpolateBlockOddEven I had sent.(thought the same
problem exists in l-asm too for some other
functions). When you compile and run the program, you
may see that the C version gives 8-9 cycles less for
"function profile debug" than for the corrsponding
"without debug" version.

When I count the cycles from the generated asm for the
function IntepolateBlockOddEven I see the following
cycle count division

For "Function Profile Debug"
1. Initialization+Prolog = 4 cycles
2. Kernel = 14 cycles(14 scd./3 in ||)

3. Epilog+Exit = 13 cycles

For "No Debug"
1. Initialization+Prolog = 7 cycles
2. Kernel = 14 cycles(14 scd./3 in ||)
3. Epilog+Exit = 18 cycles What I noticed is that "No Debug" version seems to
avoid using parallism in some places where "function
profile" debug uses them.

Moreover, I noticed that the RET instruction is being
called 5 cycles before in the "function profile
debug" version, thus requiring no NOPs. For the "No
Debug" version the RETNOP 5, is being called right at
the end.

So one possible inference is that the difference is
not because of of software pipelining but some
dependency miscalculation. Is this right. What can
be the possible cause,solution for this.

Regards
Piyush

PS: Sankaran, the linear asm code has the inner loop
fully unrolled i.e. the entire row is being worked
upon in same iteration. The generated asm is however
almost identical for both "function profile debug" and
"no debug", so the problem is not visible for it.

--- sankaran <> wrote:
> Piyush,
> First of all thanks for the detailed inputs.
> I compiled the files
> with vfersion 4.31 tools, I am attaching the
> scheduled output from the
> compiler as a reference, in case your tools do not
> produce the same. The
> resulting code produces a 7 cycle loop, which means
> that for eight
> iterations you should get about 56 cycles in core
> inner loop + prolog
> + epilog + setup overhead. I measured 91 cycles for
> ASM code and C: 528
> cycles. Please check compiled output of serial
> assembly to make sure you
> get a similar software pipelined loop.
>
> One alternative on the C64x given the number of
> registers it has, is to
> completely unroll the loop and perform the
> computations of all eight
> rows
> in parallel, and run this modified loop for as many
> half-pel
> interpolation
> cases as you may have, by building such a worklist
> ahead of time and
> calling
> this function once.
>
> Regds
> Jagadeesh Sankaran
>
> Disclaimer:
> The comments in this e-mail are solely my own
> opinions and do not imply
> any written consent or permission from Texas
> Instruments. The views and
> opinions in this e-mail are solely my own and do not
> constitute any
> approval from Texas Instruments.
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Saturday, May 22, 2004 5:02 AM
> To: sankaran;
> Subject: RE: [c6x] Function Profile Debug giving
> lower cycle count
>
> Hi Sankaran,
>
> The code I was talking about is a Mpeg4 ASP decoder.
> It might not be possible to share the entire code
> but
> I have extracted a single function, implemented both
> in c and l-asm, which you can compile with and
> without function profile debug to see the anomaly.
> There is a difference of 9 cycles between the two
> compilation modes. For the entire decoder the
> difference is about 20%
> You can see the flags in the attached project file.
>
> Regards
> Piyush
>
> PS: I hope nobody has a problem with the attaching
> zip
> files on this newgroup. The size is pretty small at
> 3K. > --- sankaran <> wrote:
> > Is it possible for you to share the code with a
> > wider audience.
> > Also provide more info about your compiler flags.
> >
> > Regds
> > JS
> >
> > -----Original Message-----
> > From: piyush kaul [mailto:]
> > Sent: Friday, May 21, 2004 4:23 AM
> > To:
> > Subject: [c6x] Function Profile Debug giving lower
> > cycle count
> >
> > Hi All,
> >
> > When I compile my code by "function profile debug"
> I
> > seem to get lower overall cycle count than with
> "no
> > debug info". This is totally unexpected, since
> some
> > optimizations should be turned off for it. I have
> > tried to measure the cycles both with the clock
> and
> > using timer registers and consistently found it
> so.
> >
> > This does not happen if I compile the code with
> > "dwarf
> > debug info" or "full debug info". In those two
> cases
> > the cycle count is higher than that for the "no
> > debug
> > info" case, which is very much on expected lines.
> >
> > I have checked the asm generated and it looks to
> be
> > very similar, especially for linear asm code.
> >
> > Has anybody else faced a similar result, and any
> > possible explanation for it.
> >
> > Regards
> > Piyush
> >
> > =====
> > **************************************
> > And---"A blind Understanding!" Heav'n replied.
> >
> > Piyush Kaul
> > http://www.geocities.com/piyushkaul
> >
> >
> >
> >
> > __________________________________
> >
> >
> >
> > _____________________________________
> > Note: If you do a simple "reply" with your email
> > client, only the author
> > of this message will receive your answer. You
> need
> > to do a "reply all"
> > if you want your answer to be distributed to the
> > entire group.
> >
> > _____________________________________
> > About this discussion group:
> >
> > To Join: Send an email to
> >
> >
> > To Post: Send an email to
> >
> > To Leave: Send an email to
> >
> >
> > Archives: http://www.yahoogroups.com/group/c6x
> >
> > Other Groups: http://www.dsprelated.com
> >
> > Yahoo! Groups Links
> >
> >
> >
> >
> >
> >
> > =====
> **************************************
> And---"A blind Understanding!" Heav'n replied.
>
> Piyush Kaul
> http://www.geocities.com/piyushkaul >
>
> __________________________________
>

> ATTACHMENT part 2 application/x-zip-compressed
name=profile.zip > ATTACHMENT part 3 application/octet-stream
name=testHandAsm.asm
=====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________



Hi Sankaran,

That is what I was also saying. "Slowing down is
expected for debug". Not for "No debug". The code is
faster for the "FP Debug" than for "No Debug".

I expect that the function will return with RETNOP,5
for funcion profile debug, for getting clearly
demarcated functions, but it is happening the other
way round.

I am strongly suspecting a bug in the
compiler/assembler. Though it seems improbable that it
would have escaped till now.

I am attaching the generated asm for the same function
with and without -gp option. Please have a look at the
RET instruction at the end of the function for both.

For the total unroll for all the rows, I think I would
run out of registers. I am currently using 12 register
for a single row. For 8 rows it would become 96 which
is clearly not possible. I forsee that reuse of same
registers across rows would cause problems in software
pipelining w.r.t loop carry paths, as it usually does.
Please let me know if you think otherwise.

Regards
Piyush --- sankaran <> wrote:
> Slowing down of the code due to compiling in debug
> is to be
> expected. As far as unrolling I suggest you do away
> with the
> loop that you have and write the code for all 8 rows
> explicitly.
> Then software pipeline across multiple such motion
> vectors
> that need half-pel interpolation.
>
> Regds
> JS
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Monday, May 24, 2004 1:14 AM
> To: sankaran; Andrew Nesterov;
> Subject: RE: [c6x] Function Profile Debug giving
> lower cycle count
>
> Hi Sankaran,Andrew
>
> I am using the tools version 4.32.
> The problem which I mentioned can be seen only for
> the C version of sample function
> InterpolateBlockOddEven I had sent.(thought the same
> problem exists in l-asm too for some other
> functions). When you compile and run the program,
> you
> may see that the C version gives 8-9 cycles less for
> "function profile debug" than for the corrsponding
> "without debug" version.
>
> When I count the cycles from the generated asm for
> the
> function IntepolateBlockOddEven I see the following
> cycle count division
>
> For "Function Profile Debug"
> 1. Initialization+Prolog = 4 cycles
> 2. Kernel = 14 cycles(14 scd./3 in
> ||)
>
> 3. Epilog+Exit = 13 cycles
>
> For "No Debug"
> 1. Initialization+Prolog = 7 cycles
> 2. Kernel = 14 cycles(14 scd./3 in
> ||)
> 3. Epilog+Exit = 18 cycles > What I noticed is that "No Debug" version seems to
> avoid using parallism in some places where "function
> profile" debug uses them.
>
> Moreover, I noticed that the RET instruction is
> being
> called 5 cycles before in the "function profile
> debug" version, thus requiring no NOPs. For the "No
> Debug" version the RETNOP 5, is being called right
> at
> the end.
>
> So one possible inference is that the difference is
> not because of of software pipelining but some
> dependency miscalculation. Is this right. What can
> be the possible cause,solution for this.
>
> Regards
> Piyush
>
> PS: Sankaran, the linear asm code has the inner loop
> fully unrolled i.e. the entire row is being worked
> upon in same iteration. The generated asm is however
> almost identical for both "function profile debug"
> and
> "no debug", so the problem is not visible for it.
>
> --- sankaran <> wrote:
> > Piyush,
> > First of all thanks for the detailed
> inputs.
> > I compiled the files
> > with vfersion 4.31 tools, I am attaching the
> > scheduled output from the
> > compiler as a reference, in case your tools do not
> > produce the same. The
> > resulting code produces a 7 cycle loop, which
> means
> > that for eight
> > iterations you should get about 56 cycles in core
> > inner loop + prolog
> > + epilog + setup overhead. I measured 91 cycles
> for
> > ASM code and C: 528
> > cycles. Please check compiled output of serial
> > assembly to make sure you
> > get a similar software pipelined loop.
> >
> > One alternative on the C64x given the number of
> > registers it has, is to
> > completely unroll the loop and perform the
> > computations of all eight
> > rows
> > in parallel, and run this modified loop for as
> many
> > half-pel
> > interpolation
> > cases as you may have, by building such a worklist
> > ahead of time and
> > calling
> > this function once.
> >
> > Regds
> > Jagadeesh Sankaran
> >
> > Disclaimer:
> > The comments in this e-mail are solely my own
> > opinions and do not imply
> > any written consent or permission from Texas
> > Instruments. The views and
> > opinions in this e-mail are solely my own and do
> not
> > constitute any
> > approval from Texas Instruments.
> >
> > -----Original Message-----
> > From: piyush kaul [mailto:]
> > Sent: Saturday, May 22, 2004 5:02 AM
> > To: sankaran;
> > Subject: RE: [c6x] Function Profile Debug giving
> > lower cycle count
> >
> > Hi Sankaran,
> >
> > The code I was talking about is a Mpeg4 ASP
> decoder.
> > It might not be possible to share the entire code
> > but
> > I have extracted a single function, implemented
> both
> > in c and l-asm, which you can compile with and
> > without function profile debug to see the anomaly.
> > There is a difference of 9 cycles between the two
> > compilation modes. For the entire decoder the
> > difference is about 20%
> > You can see the flags in the attached project
> file.
> >
> > Regards
> > Piyush
> >
> > PS: I hope nobody has a problem with the attaching
> > zip
> > files on this newgroup. The size is pretty small
> at
> > 3K.
> >
> >
> > --- sankaran <> wrote:
> > > Is it possible for you to share the code with a
> > > wider audience.
> > > Also provide more info about your compiler
> flags.
> > >
> > > Regds
> > > JS
> > >
> > > -----Original Message-----
> > > From: piyush kaul [mailto:]
> > > Sent: Friday, May 21, 2004 4:23 AM
> > > To:
> > > Subject: [c6x] Function Profile Debug giving
> lower
> > > cycle count
> > >
> > > Hi All,
> > >
> > > When I compile my code by "function profile
> debug"
> > I
> > > seem to get lower overall cycle count than with
> > "no
> > > debug info". This is totally unexpected, since
> > some
> > > optimizations should be turned off for it. I
> have
> > > tried to measure the cycles both with the clock
> > and
> > > using timer registers and consistently found it
> > so.
> > >
> > > This does not happen if I compile the code with
> > > "dwarf
> > > debug info" or "full debug info". In those two
> > cases
> > > the cycle count is higher than that for the "no
> > > debug
> > > info" case, which is very much on expected
> lines.
>
=== message truncated === =====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________

Attachment (not stored)
testAsm.zip
Type: application/x-zip-compressed


The fact that no debug runs slower is indeed strange. As far as
unrolling
the rows I would still suggest you try it because registers are not
single register assignment and multiple values can share the same
register over time. So try it!, I have written similar unrolled code for
the 8x89 interpolation case for MPEG-2

Regds
JS

-----Original Message-----
From: piyush kaul [mailto:]
Sent: Monday, May 24, 2004 2:35 AM
To: sankaran; 'Andrew Nesterov';
Subject: RE: [c6x] Function Profile Debug giving lower cycle count

Hi Sankaran,

That is what I was also saying. "Slowing down is
expected for debug". Not for "No debug". The code is
faster for the "FP Debug" than for "No Debug".

I expect that the function will return with RETNOP,5
for funcion profile debug, for getting clearly
demarcated functions, but it is happening the other
way round.

I am strongly suspecting a bug in the
compiler/assembler. Though it seems improbable that it
would have escaped till now.

I am attaching the generated asm for the same function
with and without -gp option. Please have a look at the
RET instruction at the end of the function for both.

For the total unroll for all the rows, I think I would
run out of registers. I am currently using 12 register
for a single row. For 8 rows it would become 96 which
is clearly not possible. I forsee that reuse of same
registers across rows would cause problems in software
pipelining w.r.t loop carry paths, as it usually does.
Please let me know if you think otherwise.

Regards
Piyush --- sankaran <> wrote:
> Slowing down of the code due to compiling in debug
> is to be
> expected. As far as unrolling I suggest you do away
> with the
> loop that you have and write the code for all 8 rows
> explicitly.
> Then software pipeline across multiple such motion
> vectors
> that need half-pel interpolation.
>
> Regds
> JS
>
> -----Original Message-----
> From: piyush kaul [mailto:]
> Sent: Monday, May 24, 2004 1:14 AM
> To: sankaran; Andrew Nesterov;
> Subject: RE: [c6x] Function Profile Debug giving
> lower cycle count
>
> Hi Sankaran,Andrew
>
> I am using the tools version 4.32.
> The problem which I mentioned can be seen only for
> the C version of sample function
> InterpolateBlockOddEven I had sent.(thought the same
> problem exists in l-asm too for some other
> functions). When you compile and run the program,
> you
> may see that the C version gives 8-9 cycles less for
> "function profile debug" than for the corrsponding
> "without debug" version.
>
> When I count the cycles from the generated asm for
> the
> function IntepolateBlockOddEven I see the following
> cycle count division
>
> For "Function Profile Debug"
> 1. Initialization+Prolog = 4 cycles
> 2. Kernel = 14 cycles(14 scd./3 in
> ||)
>
> 3. Epilog+Exit = 13 cycles
>
> For "No Debug"
> 1. Initialization+Prolog = 7 cycles
> 2. Kernel = 14 cycles(14 scd./3 in
> ||)
> 3. Epilog+Exit = 18 cycles > What I noticed is that "No Debug" version seems to
> avoid using parallism in some places where "function
> profile" debug uses them.
>
> Moreover, I noticed that the RET instruction is
> being
> called 5 cycles before in the "function profile
> debug" version, thus requiring no NOPs. For the "No
> Debug" version the RETNOP 5, is being called right
> at
> the end.
>
> So one possible inference is that the difference is
> not because of of software pipelining but some
> dependency miscalculation. Is this right. What can
> be the possible cause,solution for this.
>
> Regards
> Piyush
>
> PS: Sankaran, the linear asm code has the inner loop
> fully unrolled i.e. the entire row is being worked
> upon in same iteration. The generated asm is however
> almost identical for both "function profile debug"
> and
> "no debug", so the problem is not visible for it.
>
> --- sankaran <> wrote:
> > Piyush,
> > First of all thanks for the detailed
> inputs.
> > I compiled the files
> > with vfersion 4.31 tools, I am attaching the
> > scheduled output from the
> > compiler as a reference, in case your tools do not
> > produce the same. The
> > resulting code produces a 7 cycle loop, which
> means
> > that for eight
> > iterations you should get about 56 cycles in core
> > inner loop + prolog
> > + epilog + setup overhead. I measured 91 cycles
> for
> > ASM code and C: 528
> > cycles. Please check compiled output of serial
> > assembly to make sure you
> > get a similar software pipelined loop.
> >
> > One alternative on the C64x given the number of
> > registers it has, is to
> > completely unroll the loop and perform the
> > computations of all eight
> > rows
> > in parallel, and run this modified loop for as
> many
> > half-pel
> > interpolation
> > cases as you may have, by building such a worklist
> > ahead of time and
> > calling
> > this function once.
> >
> > Regds
> > Jagadeesh Sankaran
> >
> > Disclaimer:
> > The comments in this e-mail are solely my own
> > opinions and do not imply
> > any written consent or permission from Texas
> > Instruments. The views and
> > opinions in this e-mail are solely my own and do
> not
> > constitute any
> > approval from Texas Instruments.
> >
> > -----Original Message-----
> > From: piyush kaul [mailto:]
> > Sent: Saturday, May 22, 2004 5:02 AM
> > To: sankaran;
> > Subject: RE: [c6x] Function Profile Debug giving
> > lower cycle count
> >
> > Hi Sankaran,
> >
> > The code I was talking about is a Mpeg4 ASP
> decoder.
> > It might not be possible to share the entire code
> > but
> > I have extracted a single function, implemented
> both
> > in c and l-asm, which you can compile with and
> > without function profile debug to see the anomaly.
> > There is a difference of 9 cycles between the two
> > compilation modes. For the entire decoder the
> > difference is about 20%
> > You can see the flags in the attached project
> file.
> >
> > Regards
> > Piyush
> >
> > PS: I hope nobody has a problem with the attaching
> > zip
> > files on this newgroup. The size is pretty small
> at
> > 3K.
> >
> >
> > --- sankaran <> wrote:
> > > Is it possible for you to share the code with a
> > > wider audience.
> > > Also provide more info about your compiler
> flags.
> > >
> > > Regds
> > > JS
> > >
> > > -----Original Message-----
> > > From: piyush kaul [mailto:]
> > > Sent: Friday, May 21, 2004 4:23 AM
> > > To:
> > > Subject: [c6x] Function Profile Debug giving
> lower
> > > cycle count
> > >
> > > Hi All,
> > >
> > > When I compile my code by "function profile
> debug"
> > I
> > > seem to get lower overall cycle count than with
> > "no
> > > debug info". This is totally unexpected, since
> > some
> > > optimizations should be turned off for it. I
> have
> > > tried to measure the cycles both with the clock
> > and
> > > using timer registers and consistently found it
> > so.
> > >
> > > This does not happen if I compile the code with
> > > "dwarf
> > > debug info" or "full debug info". In those two
> > cases
> > > the cycle count is higher than that for the "no
> > > debug
> > > info" case, which is very much on expected
> lines.
>
=== message truncated === =====
**************************************
And---"A blind Understanding!" Heav'n replied.

Piyush Kaul
http://www.geocities.com/piyushkaul

__________________________________