|
Hi All, When I compile my code by "function profile debug" I seem to get lower overall cycle count than with "no debug info". This is totally unexpected, since some optimizations should be turned off for it. I have tried to measure the cycles both with the clock and using timer registers and consistently found it so. This does not happen if I compile the code with "dwarf debug info" or "full debug info". In those two cases the cycle count is higher than that for the "no debug info" case, which is very much on expected lines. I have checked the asm generated and it looks to be very similar, especially for linear asm code. Has anybody else faced a similar result, and any possible explanation for it. Regards Piyush ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ |
|
|
Function Profile Debug giving lower cycle count
Started by ●May 21, 2004
Reply by ●May 22, 20042004-05-22
|
Is it possible for you to share the code with a wider audience. Also provide more info about your compiler flags. Regds JS -----Original Message----- From: piyush kaul [mailto:] Sent: Friday, May 21, 2004 4:23 AM To: Subject: [c6x] Function Profile Debug giving lower cycle count Hi All, When I compile my code by "function profile debug" I seem to get lower overall cycle count than with "no debug info". This is totally unexpected, since some optimizations should be turned off for it. I have tried to measure the cycles both with the clock and using timer registers and consistently found it so. This does not happen if I compile the code with "dwarf debug info" or "full debug info". In those two cases the cycle count is higher than that for the "no debug info" case, which is very much on expected lines. I have checked the asm generated and it looks to be very similar, especially for linear asm code. Has anybody else faced a similar result, and any possible explanation for it. Regards Piyush ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ _____________________________________ Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. _____________________________________ About this discussion group: To Join: Send an email to To Post: Send an email to To Leave: Send an email to Archives: http://www.yahoogroups.com/group/c6x Other Groups: http://www.dsprelated.com Yahoo! Groups Links |
|
|
Reply by ●May 22, 20042004-05-22
|
Hi Sankaran, The code I was talking about is a Mpeg4 ASP decoder. It might not be possible to share the entire code but I have extracted a single function, implemented both in c and l-asm, which you can compile with and without function profile debug to see the anomaly. There is a difference of 9 cycles between the two compilation modes. For the entire decoder the difference is about 20% You can see the flags in the attached project file. Regards Piyush PS: I hope nobody has a problem with the attaching zip files on this newgroup. The size is pretty small at 3K. --- sankaran <> wrote: > Is it possible for you to share the code with a > wider audience. > Also provide more info about your compiler flags. > > Regds > JS > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Friday, May 21, 2004 4:23 AM > To: > Subject: [c6x] Function Profile Debug giving lower > cycle count > > Hi All, > > When I compile my code by "function profile debug" I > seem to get lower overall cycle count than with "no > debug info". This is totally unexpected, since some > optimizations should be turned off for it. I have > tried to measure the cycles both with the clock and > using timer registers and consistently found it so. > > This does not happen if I compile the code with > "dwarf > debug info" or "full debug info". In those two cases > the cycle count is higher than that for the "no > debug > info" case, which is very much on expected lines. > > I have checked the asm generated and it looks to be > very similar, especially for linear asm code. > > Has anybody else faced a similar result, and any > possible explanation for it. > > Regards > Piyush > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul > > > __________________________________ > > _____________________________________ > Note: If you do a simple "reply" with your email > client, only the author > of this message will receive your answer. You need > to do a "reply all" > if you want your answer to be distributed to the > entire group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > To Post: Send an email to > > To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x > > Other Groups: http://www.dsprelated.com > > Yahoo! Groups Links > > ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ | |||
| |||
|
|
Reply by ●May 23, 20042004-05-23
|
A possible *informal* explanation to this "phenomenon" :) I emphasised on the *informal* quite on purpose, since I won't add a single formula to it. The original problem is to find an optimal schedule for a given linear assembly or a C code. This problem is reduced (read this as "reformulated", since the reducing will not decrease neither complexity or irregularity of the original problem) to a problem of constrained optimization (finding a minimum of a function on a constrained support set) of a *discrete* target function with locally changing constraints. Discrete functions are very difficult to minimize, since they do not have derivatives, as in the case of continuos functions, for which quickly convergent Newton's methods can be applied. Thus, a constraint minimization of a discrete function is a difficult task on it own. Futher, the behaviour of the target function is usually unknown. The function can have a number of local extremums, therefore the task is to find a global minimum over a set of local minimums that are spaced very irregularly. And this is the answer to you original question: two different schedules (i.e. minimums) were found for two settings, with and without profile information, where the first minimum happen to be smaller than the second. Once again, I did not draw any mathematical treatment here, because the size of the pages here is too small for it :) Rgds, Andrew > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Friday, May 21, 2004 4:23 AM > To: > Subject: [c6x] Function Profile Debug giving lower cycle count > > Hi All, > > When I compile my code by "function profile debug" I > seem to get lower overall cycle count than with "no > debug info". This is totally unexpected, since some > optimizations should be turned off for it. I have > tried to measure the cycles both with the clock and > using timer registers and consistently found it so. > > This does not happen if I compile the code with "dwarf > debug info" or "full debug info". In those two cases > the cycle count is higher than that for the "no debug > info" case, which is very much on expected lines. > > I have checked the asm generated and it looks to be > very similar, especially for linear asm code. > > Has anybody else faced a similar result, and any > possible explanation for it. > > Regards > Piyush > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul |
Reply by ●May 23, 20042004-05-23
|
I am on vacation but will take a look at it as soon as I can. Nonetheless I will get back to you on the problem. Also please let me know the version of the tools you have been using. Regds JS -----Original Message----- From: piyush kaul [mailto:] Sent: Saturday, May 22, 2004 5:02 AM To: sankaran; Subject: RE: [c6x] Function Profile Debug giving lower cycle count Hi Sankaran, The code I was talking about is a Mpeg4 ASP decoder. It might not be possible to share the entire code but I have extracted a single function, implemented both in c and l-asm, which you can compile with and without function profile debug to see the anomaly. There is a difference of 9 cycles between the two compilation modes. For the entire decoder the difference is about 20% You can see the flags in the attached project file. Regards Piyush PS: I hope nobody has a problem with the attaching zip files on this newgroup. The size is pretty small at 3K. --- sankaran <> wrote: > Is it possible for you to share the code with a > wider audience. > Also provide more info about your compiler flags. > > Regds > JS > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Friday, May 21, 2004 4:23 AM > To: > Subject: [c6x] Function Profile Debug giving lower > cycle count > > Hi All, > > When I compile my code by "function profile debug" I > seem to get lower overall cycle count than with "no > debug info". This is totally unexpected, since some > optimizations should be turned off for it. I have > tried to measure the cycles both with the clock and > using timer registers and consistently found it so. > > This does not happen if I compile the code with > "dwarf > debug info" or "full debug info". In those two cases > the cycle count is higher than that for the "no > debug > info" case, which is very much on expected lines. > > I have checked the asm generated and it looks to be > very similar, especially for linear asm code. > > Has anybody else faced a similar result, and any > possible explanation for it. > > Regards > Piyush > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul > > > __________________________________ > > _____________________________________ > Note: If you do a simple "reply" with your email > client, only the author > of this message will receive your answer. You need > to do a "reply all" > if you want your answer to be distributed to the > entire group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > To Post: Send an email to > > To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x > > Other Groups: http://www.dsprelated.com > > Yahoo! Groups Links > > ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ |
Reply by ●May 23, 20042004-05-23
|
Piyush, First of all thanks for the detailed inputs. I compiled the files with vfersion 4.31 tools, I am attaching the scheduled output from the compiler as a reference, in case your tools do not produce the same. The resulting code produces a 7 cycle loop, which means that for eight iterations you should get about 56 cycles in core inner loop + prolog + epilog + setup overhead. I measured 91 cycles for ASM code and C: 528 cycles. Please check compiled output of serial assembly to make sure you get a similar software pipelined loop. One alternative on the C64x given the number of registers it has, is to completely unroll the loop and perform the computations of all eight rows in parallel, and run this modified loop for as many half-pel interpolation cases as you may have, by building such a worklist ahead of time and calling this function once. Regds Jagadeesh Sankaran Disclaimer: The comments in this e-mail are solely my own opinions and do not imply any written consent or permission from Texas Instruments. The views and opinions in this e-mail are solely my own and do not constitute any approval from Texas Instruments. -----Original Message----- From: piyush kaul [mailto:] Sent: Saturday, May 22, 2004 5:02 AM To: sankaran; Subject: RE: [c6x] Function Profile Debug giving lower cycle count Hi Sankaran, The code I was talking about is a Mpeg4 ASP decoder. It might not be possible to share the entire code but I have extracted a single function, implemented both in c and l-asm, which you can compile with and without function profile debug to see the anomaly. There is a difference of 9 cycles between the two compilation modes. For the entire decoder the difference is about 20% You can see the flags in the attached project file. Regards Piyush PS: I hope nobody has a problem with the attaching zip files on this newgroup. The size is pretty small at 3K. --- sankaran <> wrote: > Is it possible for you to share the code with a > wider audience. > Also provide more info about your compiler flags. > > Regds > JS > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Friday, May 21, 2004 4:23 AM > To: > Subject: [c6x] Function Profile Debug giving lower > cycle count > > Hi All, > > When I compile my code by "function profile debug" I > seem to get lower overall cycle count than with "no > debug info". This is totally unexpected, since some > optimizations should be turned off for it. I have > tried to measure the cycles both with the clock and > using timer registers and consistently found it so. > > This does not happen if I compile the code with > "dwarf > debug info" or "full debug info". In those two cases > the cycle count is higher than that for the "no > debug > info" case, which is very much on expected lines. > > I have checked the asm generated and it looks to be > very similar, especially for linear asm code. > > Has anybody else faced a similar result, and any > possible explanation for it. > > Regards > Piyush > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul > > > __________________________________ > > _____________________________________ > Note: If you do a simple "reply" with your email > client, only the author > of this message will receive your answer. You need > to do a "reply all" > if you want your answer to be distributed to the > entire group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > To Post: Send an email to > > To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x > > Other Groups: http://www.dsprelated.com > > Yahoo! Groups Links > > ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ | |||
| |||
| |||
|
|
Reply by ●May 24, 20042004-05-24
|
Hi Sankaran,Andrew I am using the tools version 4.32. The problem which I mentioned can be seen only for the C version of sample function InterpolateBlockOddEven I had sent.(thought the same problem exists in l-asm too for some other functions). When you compile and run the program, you may see that the C version gives 8-9 cycles less for "function profile debug" than for the corrsponding "without debug" version. When I count the cycles from the generated asm for the function IntepolateBlockOddEven I see the following cycle count division For "Function Profile Debug" 1. Initialization+Prolog = 4 cycles 2. Kernel = 14 cycles(14 scd./3 in ||) 3. Epilog+Exit = 13 cycles For "No Debug" 1. Initialization+Prolog = 7 cycles 2. Kernel = 14 cycles(14 scd./3 in ||) 3. Epilog+Exit = 18 cycles What I noticed is that "No Debug" version seems to avoid using parallism in some places where "function profile" debug uses them. Moreover, I noticed that the RET instruction is being called 5 cycles before in the "function profile debug" version, thus requiring no NOPs. For the "No Debug" version the RETNOP 5, is being called right at the end. So one possible inference is that the difference is not because of of software pipelining but some dependency miscalculation. Is this right. What can be the possible cause,solution for this. Regards Piyush PS: Sankaran, the linear asm code has the inner loop fully unrolled i.e. the entire row is being worked upon in same iteration. The generated asm is however almost identical for both "function profile debug" and "no debug", so the problem is not visible for it. --- sankaran <> wrote: > Piyush, > First of all thanks for the detailed inputs. > I compiled the files > with vfersion 4.31 tools, I am attaching the > scheduled output from the > compiler as a reference, in case your tools do not > produce the same. The > resulting code produces a 7 cycle loop, which means > that for eight > iterations you should get about 56 cycles in core > inner loop + prolog > + epilog + setup overhead. I measured 91 cycles for > ASM code and C: 528 > cycles. Please check compiled output of serial > assembly to make sure you > get a similar software pipelined loop. > > One alternative on the C64x given the number of > registers it has, is to > completely unroll the loop and perform the > computations of all eight > rows > in parallel, and run this modified loop for as many > half-pel > interpolation > cases as you may have, by building such a worklist > ahead of time and > calling > this function once. > > Regds > Jagadeesh Sankaran > > Disclaimer: > The comments in this e-mail are solely my own > opinions and do not imply > any written consent or permission from Texas > Instruments. The views and > opinions in this e-mail are solely my own and do not > constitute any > approval from Texas Instruments. > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Saturday, May 22, 2004 5:02 AM > To: sankaran; > Subject: RE: [c6x] Function Profile Debug giving > lower cycle count > > Hi Sankaran, > > The code I was talking about is a Mpeg4 ASP decoder. > It might not be possible to share the entire code > but > I have extracted a single function, implemented both > in c and l-asm, which you can compile with and > without function profile debug to see the anomaly. > There is a difference of 9 cycles between the two > compilation modes. For the entire decoder the > difference is about 20% > You can see the flags in the attached project file. > > Regards > Piyush > > PS: I hope nobody has a problem with the attaching > zip > files on this newgroup. The size is pretty small at > 3K. > --- sankaran <> wrote: > > Is it possible for you to share the code with a > > wider audience. > > Also provide more info about your compiler flags. > > > > Regds > > JS > > > > -----Original Message----- > > From: piyush kaul [mailto:] > > Sent: Friday, May 21, 2004 4:23 AM > > To: > > Subject: [c6x] Function Profile Debug giving lower > > cycle count > > > > Hi All, > > > > When I compile my code by "function profile debug" > I > > seem to get lower overall cycle count than with > "no > > debug info". This is totally unexpected, since > some > > optimizations should be turned off for it. I have > > tried to measure the cycles both with the clock > and > > using timer registers and consistently found it > so. > > > > This does not happen if I compile the code with > > "dwarf > > debug info" or "full debug info". In those two > cases > > the cycle count is higher than that for the "no > > debug > > info" case, which is very much on expected lines. > > > > I have checked the asm generated and it looks to > be > > very similar, especially for linear asm code. > > > > Has anybody else faced a similar result, and any > > possible explanation for it. > > > > Regards > > Piyush > > > > ===== > > ************************************** > > And---"A blind Understanding!" Heav'n replied. > > > > Piyush Kaul > > http://www.geocities.com/piyushkaul > > > > > > > > > > __________________________________ > > > > > > > > _____________________________________ > > Note: If you do a simple "reply" with your email > > client, only the author > > of this message will receive your answer. You > need > > to do a "reply all" > > if you want your answer to be distributed to the > > entire group. > > > > _____________________________________ > > About this discussion group: > > > > To Join: Send an email to > > > > > > To Post: Send an email to > > > > To Leave: Send an email to > > > > > > Archives: http://www.yahoogroups.com/group/c6x > > > > Other Groups: http://www.dsprelated.com > > > > Yahoo! Groups Links > > > > > > > > > > > > > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul > > > __________________________________ > > ATTACHMENT part 2 application/x-zip-compressed name=profile.zip > ATTACHMENT part 3 application/octet-stream name=testHandAsm.asm ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ |
|
|
Reply by ●May 24, 20042004-05-24
|
Slowing down of the code due to compiling in debug is to be expected. As far as unrolling I suggest you do away with the loop that you have and write the code for all 8 rows explicitly. Then software pipeline across multiple such motion vectors that need half-pel interpolation. Regds JS -----Original Message----- From: piyush kaul [mailto:] Sent: Monday, May 24, 2004 1:14 AM To: sankaran; Andrew Nesterov; Subject: RE: [c6x] Function Profile Debug giving lower cycle count Hi Sankaran,Andrew I am using the tools version 4.32. The problem which I mentioned can be seen only for the C version of sample function InterpolateBlockOddEven I had sent.(thought the same problem exists in l-asm too for some other functions). When you compile and run the program, you may see that the C version gives 8-9 cycles less for "function profile debug" than for the corrsponding "without debug" version. When I count the cycles from the generated asm for the function IntepolateBlockOddEven I see the following cycle count division For "Function Profile Debug" 1. Initialization+Prolog = 4 cycles 2. Kernel = 14 cycles(14 scd./3 in ||) 3. Epilog+Exit = 13 cycles For "No Debug" 1. Initialization+Prolog = 7 cycles 2. Kernel = 14 cycles(14 scd./3 in ||) 3. Epilog+Exit = 18 cycles What I noticed is that "No Debug" version seems to avoid using parallism in some places where "function profile" debug uses them. Moreover, I noticed that the RET instruction is being called 5 cycles before in the "function profile debug" version, thus requiring no NOPs. For the "No Debug" version the RETNOP 5, is being called right at the end. So one possible inference is that the difference is not because of of software pipelining but some dependency miscalculation. Is this right. What can be the possible cause,solution for this. Regards Piyush PS: Sankaran, the linear asm code has the inner loop fully unrolled i.e. the entire row is being worked upon in same iteration. The generated asm is however almost identical for both "function profile debug" and "no debug", so the problem is not visible for it. --- sankaran <> wrote: > Piyush, > First of all thanks for the detailed inputs. > I compiled the files > with vfersion 4.31 tools, I am attaching the > scheduled output from the > compiler as a reference, in case your tools do not > produce the same. The > resulting code produces a 7 cycle loop, which means > that for eight > iterations you should get about 56 cycles in core > inner loop + prolog > + epilog + setup overhead. I measured 91 cycles for > ASM code and C: 528 > cycles. Please check compiled output of serial > assembly to make sure you > get a similar software pipelined loop. > > One alternative on the C64x given the number of > registers it has, is to > completely unroll the loop and perform the > computations of all eight > rows > in parallel, and run this modified loop for as many > half-pel > interpolation > cases as you may have, by building such a worklist > ahead of time and > calling > this function once. > > Regds > Jagadeesh Sankaran > > Disclaimer: > The comments in this e-mail are solely my own > opinions and do not imply > any written consent or permission from Texas > Instruments. The views and > opinions in this e-mail are solely my own and do not > constitute any > approval from Texas Instruments. > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Saturday, May 22, 2004 5:02 AM > To: sankaran; > Subject: RE: [c6x] Function Profile Debug giving > lower cycle count > > Hi Sankaran, > > The code I was talking about is a Mpeg4 ASP decoder. > It might not be possible to share the entire code > but > I have extracted a single function, implemented both > in c and l-asm, which you can compile with and > without function profile debug to see the anomaly. > There is a difference of 9 cycles between the two > compilation modes. For the entire decoder the > difference is about 20% > You can see the flags in the attached project file. > > Regards > Piyush > > PS: I hope nobody has a problem with the attaching > zip > files on this newgroup. The size is pretty small at > 3K. > --- sankaran <> wrote: > > Is it possible for you to share the code with a > > wider audience. > > Also provide more info about your compiler flags. > > > > Regds > > JS > > > > -----Original Message----- > > From: piyush kaul [mailto:] > > Sent: Friday, May 21, 2004 4:23 AM > > To: > > Subject: [c6x] Function Profile Debug giving lower > > cycle count > > > > Hi All, > > > > When I compile my code by "function profile debug" > I > > seem to get lower overall cycle count than with > "no > > debug info". This is totally unexpected, since > some > > optimizations should be turned off for it. I have > > tried to measure the cycles both with the clock > and > > using timer registers and consistently found it > so. > > > > This does not happen if I compile the code with > > "dwarf > > debug info" or "full debug info". In those two > cases > > the cycle count is higher than that for the "no > > debug > > info" case, which is very much on expected lines. > > > > I have checked the asm generated and it looks to > be > > very similar, especially for linear asm code. > > > > Has anybody else faced a similar result, and any > > possible explanation for it. > > > > Regards > > Piyush > > > > ===== > > ************************************** > > And---"A blind Understanding!" Heav'n replied. > > > > Piyush Kaul > > http://www.geocities.com/piyushkaul > > > > > > > > > > __________________________________ > > > > > > > > _____________________________________ > > Note: If you do a simple "reply" with your email > > client, only the author > > of this message will receive your answer. You > need > > to do a "reply all" > > if you want your answer to be distributed to the > > entire group. > > > > _____________________________________ > > About this discussion group: > > > > To Join: Send an email to > > > > > > To Post: Send an email to > > > > To Leave: Send an email to > > > > > > Archives: http://www.yahoogroups.com/group/c6x > > > > Other Groups: http://www.dsprelated.com > > > > Yahoo! Groups Links > > > > > > > > > > > > > > ===== > ************************************** > And---"A blind Understanding!" Heav'n replied. > > Piyush Kaul > http://www.geocities.com/piyushkaul > > > __________________________________ > > ATTACHMENT part 2 application/x-zip-compressed name=profile.zip > ATTACHMENT part 3 application/octet-stream name=testHandAsm.asm ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ |
|
|
Reply by ●May 24, 20042004-05-24
|
Hi Sankaran, That is what I was also saying. "Slowing down is expected for debug". Not for "No debug". The code is faster for the "FP Debug" than for "No Debug". I expect that the function will return with RETNOP,5 for funcion profile debug, for getting clearly demarcated functions, but it is happening the other way round. I am strongly suspecting a bug in the compiler/assembler. Though it seems improbable that it would have escaped till now. I am attaching the generated asm for the same function with and without -gp option. Please have a look at the RET instruction at the end of the function for both. For the total unroll for all the rows, I think I would run out of registers. I am currently using 12 register for a single row. For 8 rows it would become 96 which is clearly not possible. I forsee that reuse of same registers across rows would cause problems in software pipelining w.r.t loop carry paths, as it usually does. Please let me know if you think otherwise. Regards Piyush --- sankaran <> wrote: > Slowing down of the code due to compiling in debug > is to be > expected. As far as unrolling I suggest you do away > with the > loop that you have and write the code for all 8 rows > explicitly. > Then software pipeline across multiple such motion > vectors > that need half-pel interpolation. > > Regds > JS > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Monday, May 24, 2004 1:14 AM > To: sankaran; Andrew Nesterov; > Subject: RE: [c6x] Function Profile Debug giving > lower cycle count > > Hi Sankaran,Andrew > > I am using the tools version 4.32. > The problem which I mentioned can be seen only for > the C version of sample function > InterpolateBlockOddEven I had sent.(thought the same > problem exists in l-asm too for some other > functions). When you compile and run the program, > you > may see that the C version gives 8-9 cycles less for > "function profile debug" than for the corrsponding > "without debug" version. > > When I count the cycles from the generated asm for > the > function IntepolateBlockOddEven I see the following > cycle count division > > For "Function Profile Debug" > 1. Initialization+Prolog = 4 cycles > 2. Kernel = 14 cycles(14 scd./3 in > ||) > > 3. Epilog+Exit = 13 cycles > > For "No Debug" > 1. Initialization+Prolog = 7 cycles > 2. Kernel = 14 cycles(14 scd./3 in > ||) > 3. Epilog+Exit = 18 cycles > What I noticed is that "No Debug" version seems to > avoid using parallism in some places where "function > profile" debug uses them. > > Moreover, I noticed that the RET instruction is > being > called 5 cycles before in the "function profile > debug" version, thus requiring no NOPs. For the "No > Debug" version the RETNOP 5, is being called right > at > the end. > > So one possible inference is that the difference is > not because of of software pipelining but some > dependency miscalculation. Is this right. What can > be the possible cause,solution for this. > > Regards > Piyush > > PS: Sankaran, the linear asm code has the inner loop > fully unrolled i.e. the entire row is being worked > upon in same iteration. The generated asm is however > almost identical for both "function profile debug" > and > "no debug", so the problem is not visible for it. > > --- sankaran <> wrote: > > Piyush, > > First of all thanks for the detailed > inputs. > > I compiled the files > > with vfersion 4.31 tools, I am attaching the > > scheduled output from the > > compiler as a reference, in case your tools do not > > produce the same. The > > resulting code produces a 7 cycle loop, which > means > > that for eight > > iterations you should get about 56 cycles in core > > inner loop + prolog > > + epilog + setup overhead. I measured 91 cycles > for > > ASM code and C: 528 > > cycles. Please check compiled output of serial > > assembly to make sure you > > get a similar software pipelined loop. > > > > One alternative on the C64x given the number of > > registers it has, is to > > completely unroll the loop and perform the > > computations of all eight > > rows > > in parallel, and run this modified loop for as > many > > half-pel > > interpolation > > cases as you may have, by building such a worklist > > ahead of time and > > calling > > this function once. > > > > Regds > > Jagadeesh Sankaran > > > > Disclaimer: > > The comments in this e-mail are solely my own > > opinions and do not imply > > any written consent or permission from Texas > > Instruments. The views and > > opinions in this e-mail are solely my own and do > not > > constitute any > > approval from Texas Instruments. > > > > -----Original Message----- > > From: piyush kaul [mailto:] > > Sent: Saturday, May 22, 2004 5:02 AM > > To: sankaran; > > Subject: RE: [c6x] Function Profile Debug giving > > lower cycle count > > > > Hi Sankaran, > > > > The code I was talking about is a Mpeg4 ASP > decoder. > > It might not be possible to share the entire code > > but > > I have extracted a single function, implemented > both > > in c and l-asm, which you can compile with and > > without function profile debug to see the anomaly. > > There is a difference of 9 cycles between the two > > compilation modes. For the entire decoder the > > difference is about 20% > > You can see the flags in the attached project > file. > > > > Regards > > Piyush > > > > PS: I hope nobody has a problem with the attaching > > zip > > files on this newgroup. The size is pretty small > at > > 3K. > > > > > > --- sankaran <> wrote: > > > Is it possible for you to share the code with a > > > wider audience. > > > Also provide more info about your compiler > flags. > > > > > > Regds > > > JS > > > > > > -----Original Message----- > > > From: piyush kaul [mailto:] > > > Sent: Friday, May 21, 2004 4:23 AM > > > To: > > > Subject: [c6x] Function Profile Debug giving > lower > > > cycle count > > > > > > Hi All, > > > > > > When I compile my code by "function profile > debug" > > I > > > seem to get lower overall cycle count than with > > "no > > > debug info". This is totally unexpected, since > > some > > > optimizations should be turned off for it. I > have > > > tried to measure the cycles both with the clock > > and > > > using timer registers and consistently found it > > so. > > > > > > This does not happen if I compile the code with > > > "dwarf > > > debug info" or "full debug info". In those two > > cases > > > the cycle count is higher than that for the "no > > > debug > > > info" case, which is very much on expected > lines. > === message truncated === ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ | |||
| |||
|
|
Reply by ●May 25, 20042004-05-25
|
The fact that no debug runs slower is indeed strange. As far as unrolling the rows I would still suggest you try it because registers are not single register assignment and multiple values can share the same register over time. So try it!, I have written similar unrolled code for the 8x89 interpolation case for MPEG-2 Regds JS -----Original Message----- From: piyush kaul [mailto:] Sent: Monday, May 24, 2004 2:35 AM To: sankaran; 'Andrew Nesterov'; Subject: RE: [c6x] Function Profile Debug giving lower cycle count Hi Sankaran, That is what I was also saying. "Slowing down is expected for debug". Not for "No debug". The code is faster for the "FP Debug" than for "No Debug". I expect that the function will return with RETNOP,5 for funcion profile debug, for getting clearly demarcated functions, but it is happening the other way round. I am strongly suspecting a bug in the compiler/assembler. Though it seems improbable that it would have escaped till now. I am attaching the generated asm for the same function with and without -gp option. Please have a look at the RET instruction at the end of the function for both. For the total unroll for all the rows, I think I would run out of registers. I am currently using 12 register for a single row. For 8 rows it would become 96 which is clearly not possible. I forsee that reuse of same registers across rows would cause problems in software pipelining w.r.t loop carry paths, as it usually does. Please let me know if you think otherwise. Regards Piyush --- sankaran <> wrote: > Slowing down of the code due to compiling in debug > is to be > expected. As far as unrolling I suggest you do away > with the > loop that you have and write the code for all 8 rows > explicitly. > Then software pipeline across multiple such motion > vectors > that need half-pel interpolation. > > Regds > JS > > -----Original Message----- > From: piyush kaul [mailto:] > Sent: Monday, May 24, 2004 1:14 AM > To: sankaran; Andrew Nesterov; > Subject: RE: [c6x] Function Profile Debug giving > lower cycle count > > Hi Sankaran,Andrew > > I am using the tools version 4.32. > The problem which I mentioned can be seen only for > the C version of sample function > InterpolateBlockOddEven I had sent.(thought the same > problem exists in l-asm too for some other > functions). When you compile and run the program, > you > may see that the C version gives 8-9 cycles less for > "function profile debug" than for the corrsponding > "without debug" version. > > When I count the cycles from the generated asm for > the > function IntepolateBlockOddEven I see the following > cycle count division > > For "Function Profile Debug" > 1. Initialization+Prolog = 4 cycles > 2. Kernel = 14 cycles(14 scd./3 in > ||) > > 3. Epilog+Exit = 13 cycles > > For "No Debug" > 1. Initialization+Prolog = 7 cycles > 2. Kernel = 14 cycles(14 scd./3 in > ||) > 3. Epilog+Exit = 18 cycles > What I noticed is that "No Debug" version seems to > avoid using parallism in some places where "function > profile" debug uses them. > > Moreover, I noticed that the RET instruction is > being > called 5 cycles before in the "function profile > debug" version, thus requiring no NOPs. For the "No > Debug" version the RETNOP 5, is being called right > at > the end. > > So one possible inference is that the difference is > not because of of software pipelining but some > dependency miscalculation. Is this right. What can > be the possible cause,solution for this. > > Regards > Piyush > > PS: Sankaran, the linear asm code has the inner loop > fully unrolled i.e. the entire row is being worked > upon in same iteration. The generated asm is however > almost identical for both "function profile debug" > and > "no debug", so the problem is not visible for it. > > --- sankaran <> wrote: > > Piyush, > > First of all thanks for the detailed > inputs. > > I compiled the files > > with vfersion 4.31 tools, I am attaching the > > scheduled output from the > > compiler as a reference, in case your tools do not > > produce the same. The > > resulting code produces a 7 cycle loop, which > means > > that for eight > > iterations you should get about 56 cycles in core > > inner loop + prolog > > + epilog + setup overhead. I measured 91 cycles > for > > ASM code and C: 528 > > cycles. Please check compiled output of serial > > assembly to make sure you > > get a similar software pipelined loop. > > > > One alternative on the C64x given the number of > > registers it has, is to > > completely unroll the loop and perform the > > computations of all eight > > rows > > in parallel, and run this modified loop for as > many > > half-pel > > interpolation > > cases as you may have, by building such a worklist > > ahead of time and > > calling > > this function once. > > > > Regds > > Jagadeesh Sankaran > > > > Disclaimer: > > The comments in this e-mail are solely my own > > opinions and do not imply > > any written consent or permission from Texas > > Instruments. The views and > > opinions in this e-mail are solely my own and do > not > > constitute any > > approval from Texas Instruments. > > > > -----Original Message----- > > From: piyush kaul [mailto:] > > Sent: Saturday, May 22, 2004 5:02 AM > > To: sankaran; > > Subject: RE: [c6x] Function Profile Debug giving > > lower cycle count > > > > Hi Sankaran, > > > > The code I was talking about is a Mpeg4 ASP > decoder. > > It might not be possible to share the entire code > > but > > I have extracted a single function, implemented > both > > in c and l-asm, which you can compile with and > > without function profile debug to see the anomaly. > > There is a difference of 9 cycles between the two > > compilation modes. For the entire decoder the > > difference is about 20% > > You can see the flags in the attached project > file. > > > > Regards > > Piyush > > > > PS: I hope nobody has a problem with the attaching > > zip > > files on this newgroup. The size is pretty small > at > > 3K. > > > > > > --- sankaran <> wrote: > > > Is it possible for you to share the code with a > > > wider audience. > > > Also provide more info about your compiler > flags. > > > > > > Regds > > > JS > > > > > > -----Original Message----- > > > From: piyush kaul [mailto:] > > > Sent: Friday, May 21, 2004 4:23 AM > > > To: > > > Subject: [c6x] Function Profile Debug giving > lower > > > cycle count > > > > > > Hi All, > > > > > > When I compile my code by "function profile > debug" > > I > > > seem to get lower overall cycle count than with > > "no > > > debug info". This is totally unexpected, since > > some > > > optimizations should be turned off for it. I > have > > > tried to measure the cycles both with the clock > > and > > > using timer registers and consistently found it > > so. > > > > > > This does not happen if I compile the code with > > > "dwarf > > > debug info" or "full debug info". In those two > > cases > > > the cycle count is higher than that for the "no > > > debug > > > info" case, which is very much on expected > lines. > === message truncated === ===== ************************************** And---"A blind Understanding!" Heav'n replied. Piyush Kaul http://www.geocities.com/piyushkaul __________________________________ |






