DSPRelated.com
Forums

Re: Mismatch in Timing measurements

Started by Andrew Nesterov November 15, 2010
Hi Jeff,

Why do you beleive that this is the "measurement issue"? Looking at Andrey's
figues obtained in the best possible RAM placement (1046 as in the theory, 1075
as in the experiment run) me thinks that they do match perfectly. Not sure if
TSC_read() is a call or a macro that simply reads the counter register, I would
guess that the 29 cycles off is a call overhead:

// measure call overhead
T_0 = TSC_read ();
T_C = TSC_read ();
T_V = T_C - T_0;

// benchmark
T_0 = TSC_read ();
fir8 (n, x, h); // or whatever should be measured here
T_C = TSC_read ();

// exec time
T_X = T_C - T_0 - T_V;

assumung that the TSC_read () is a call. If it is the macro, then I'd rather
look at the generated assembly code.

It could also be that the formula does not take into consideration the call
overhead of the filter function itself - very possible. It is usually 10
through 15 cycles to add.

The 1.7... factor for the worse RAM placement map is, as Jagadeesh has noted,
due to memory access path delays. For the mere purpose of benchmarking I would
have kept the original TI's RAM placement and the same array sizes exactly,
which would give comparable numbers.

Rgds,

Andrew

> Re: Mismatch in Timing measurements [1 Attachment]
> Posted by: "Jeff Brower" j...@signalogic.com jbrower888
> Date: Tue Oct 12, 2010 3:51 pm ((PDT))
>
> All-
>
> I wanted to add that I'm still concerned about possible JTAG + USB "background communication", a theory that would fit
> the constant slow-down factor, assuming that CCS4 talks with the chip on a periodic basis. I didn't get any reply to
> my question about "where is Free Run" in CCS4.
>
> A couple of other notes:
>
> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see it in the .map file).
>
> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y is in SL2 (shared mem, onchip). If we had
> pipeline stalls due to internal memory bus conflicts, I would have expected the slow-down factor to get worse for
> increased data lengths.
>
> 3) Andrey says "main issue" but that really should be "measurement issue". Our main issue is still to generate
> fastest possible "natural C code". But without accurate measurement using the EVM C6472, we are stuck between two
> problems, so we're trying to tackle one at a time.
>
> -Jeff
>
> ------------------ Original Message ----------------
> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
> From: "Andrey Krokhin"
> Date: Tue, October 12, 2010 5:23 pm
> To: "Sankaran, Jagadeesh"
> Cc: "Vikram Ragukumar"
> "Jeff Brower"
> "c..."
> ----------------
>
> Hello everyone,
>
> In order to focus on the main issue, the difference between expected and real clock cycles, I tried a completely new
> approach. I created a new project using TI's "benchmark" fir_r8 routine, which has been hand-coded in assembly. I used
> their sample project which just called the fir_r8 function once from main. To get the cycle count, I used the approach
> suggested by Jagadeesh: do a TSC_read() before and after the function call.
>
> The original TI code had this NH = 32, NR = 256. When I ran that code, I got a cycle count of 1075, which is close to
> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran into a memory
> limitation, since TI was placing everything in DMC_SRAM. So, I placed the input and filter arrays in LL2RAM and the
> output array in SL2RAM (shared memory). I ran the program with larger array sizes and the results I got were:
>
> NH: 40000
> NR: 230000
> Expected: 1150000022 cycles
> Actual: 2013247548 cycles
>
> So, it's off by a factor of ~1.75. This factor stays constant when I vary the array size.
>
> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a legacy CCS 3 project), and makes a single
> function call to the fir_r8 assembly routine from main, so there are no overhead issues. The previous discussion about
> whether we are counting the cycles correctly doesn't apply, since I am measuring the cycles directly from hardware.
> Any suggestions on what else might account for the difference? In particular, does placement of the data matter (e.g.
> LL2RAM vs DMC_SRAM?)
>
> Regards,
> Andrey
>> All,
>>
>> The outer loop loop count dominates the inner loop loop count, by a factor of 232/36 = 6x, and further by looking at
> the generated assembly code, we see 55 cycles of overhead for every 8 output samples, so just this generates
>> a 232000/8 * 55 = 1.6 Mcycles.
>>
>> The intrinsic C code does a better job of decreasing outer loop overheads. So, one gets lesser outer loop overhead,
> which is what I think you are seeing.
>>
>> Please make sure that the effects of "cold cache" are taken care off, by running the code twice, ignore the results
> of the first time, then measure the results from the second time.
>>
>> The cycle count formula you are using seems to be off, in the most optimal loop in your assembly code which will be
> the path taken based on your number of inputs and filter taps, we see 32 multiplies in 9 cycles, or 3.5
> multiplies/cycle. So, formula is roughly:
>>
>> x_len/8 * [55 + h_len/4]
>>
>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>
>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ So, should be about 262/700 or 0.33 seconds.
>>
>> Please use the built in 64-bit counter available on all C64x+ cores as it runs lock step with the CPU. This can be
> done by reading TSCL and TSCH registers.
>>
>> I am attaching the relevant code, first call the function TSC_enable() to enable the CPU counter. Then call TSC_read
> which returns the result as a 64-bit value or long long, so the sequence is as follows:
>>
>> long long t0, t1;
>> unsigned int diff;
>>
>> TSC_enable();
>>
>> t0 = TSC_read();
>> benchmark();
>> t1 = TSC_read();
>>
>> diff = (t1 - t0);
>>
>> printf ("diff: %d cycles \n", diff);
>> Regards
>> JS
>>
>> -----Original Message-----
>> From: c... [mailto:c...] On Behalf Of Vikram Ragukumar
>> Sent: Friday, October 08, 2010 3:05 PM
>> To: c...
>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>
>> All,
>>
>> We are in the process of verifying theoretical cycle time calculations, based on information from the generated asm
> file.
>>
>> Following results and questions pertain to the attached conv.c and a section of the generated conv.asm file.
>> conv.c implements a basic convolution function.
>>
>> Our setup details are :
>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>> project
>> - CCS v4.2
>> - SYS/BIOS (BIOSv6)
>> - CGT v7.0.3
>>
>> There appears to be a mismatch in timing measurements, with "actual run time" being about 9 times the "computed run
> time".
>>
>> From the C code,
>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000 Number of inner loop iterations = h_len = 36000
>>
>> From asm file, Cycle time for software pipelined loop is
>>
>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a SINGLE SCHEDULED ITERATION.
>>
>> Hence, computed run time = x_len*(10+trip_cnt*2),
>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>
>> Therefore,
>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>> At 700Mhz, this would correspond to a time of 6s (approx).
>>
>> However actual run time of the program was determined to be 52s. Run time was calculated by inserting SYS/BIOS
> (BIOS6) Clock_getTicks() calls right before and immediately after the outer loop. Also, we made sure that interrupts
> were disabled during execution of the loops, this had no effect on run time measured.
>>
>> (1) Is there something wrong with how "Computed run time" is being calculated or how we are measuring "Actual run
> time" ?
>>
>> (2) As part of the Software Pipeline information, the compiler outputs a line that reads something like "ii = 2
> Schedule found with 6 iterations in parallel". Does the word "iterations" in the above phrase correspond to actual
> inner loop iteration in c code ? If so, is the number 6 in the above example, to be used in computing "expected run
> time" ?
>>
>> Thank you in advance for your help.
>>
>> Regards,
>> Vikram.

_____________________________________
Andrew-

Yes as it turned out, both non-BIOS and BIOS profiling methods (TSC_read() and Clock_getTicks() ) worked fine. There
are notes in there about cycle counting formulas.

This e2e thread gives a summary of where we ended up on this issue (look for the 2nd to last entry):

http://e2e.ti.com/support/development_tools/compiler/f/343/t/66581.aspx

-Jeff

> Why do you beleive that this is the "measurement issue"? Looking at Andrey's
> figues obtained in the best possible RAM placement (1046 as in the theory, 1075
> as in the experiment run) me thinks that they do match perfectly. Not sure if
> TSC_read() is a call or a macro that simply reads the counter register, I would
> guess that the 29 cycles off is a call overhead:
>
> // measure call overhead
> T_0 = TSC_read ();
> T_C = TSC_read ();
> T_V = T_C - T_0;
>
> // benchmark
> T_0 = TSC_read ();
> fir8 (n, x, h); // or whatever should be measured here
> T_C = TSC_read ();
>
> // exec time
> T_X = T_C - T_0 - T_V;
>
> assumung that the TSC_read () is a call. If it is the macro, then I'd rather
> look at the generated assembly code.
>
> It could also be that the formula does not take into consideration the call
> overhead of the filter function itself - very possible. It is usually 10
> through 15 cycles to add.
>
> The 1.7... factor for the worse RAM placement map is, as Jagadeesh has noted,
> due to memory access path delays. For the mere purpose of benchmarking I would
> have kept the original TI's RAM placement and the same array sizes exactly,
> which would give comparable numbers.
>
> Rgds,
>
> Andrew
>
>> Re: Mismatch in Timing measurements [1 Attachment]
>> Posted by: "Jeff Brower" j...@signalogic.com jbrower888
>> Date: Tue Oct 12, 2010 3:51 pm ((PDT))
>>
>> All-
>>
>> I wanted to add that I'm still concerned about possible JTAG + USB "background communication", a theory that would
>> fit
>> the constant slow-down factor, assuming that CCS4 talks with the chip on a periodic basis. I didn't get any reply
>> to
>> my question about "where is Free Run" in CCS4.
>>
>> A couple of other notes:
>>
>> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see it in the .map file).
>>
>> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y is in SL2 (shared mem, onchip). If we had
>> pipeline stalls due to internal memory bus conflicts, I would have expected the slow-down factor to get worse for
>> increased data lengths.
>>
>> 3) Andrey says "main issue" but that really should be "measurement issue". Our main issue is still to generate
>> fastest possible "natural C code". But without accurate measurement using the EVM C6472, we are stuck between two
>> problems, so we're trying to tackle one at a time.
>>
>> -Jeff
>>
>> ------------------ Original Message ----------------
>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>> From: "Andrey Krokhin"
>> Date: Tue, October 12, 2010 5:23 pm
>> To: "Sankaran, Jagadeesh"
>> Cc: "Vikram Ragukumar"
>> "Jeff Brower"
>> "c..."
>> ----------------
>>
>> Hello everyone,
>>
>> In order to focus on the main issue, the difference between expected and real clock cycles, I tried a completely new
>> approach. I created a new project using TI's "benchmark" fir_r8 routine, which has been hand-coded in assembly. I
>> used
>> their sample project which just called the fir_r8 function once from main. To get the cycle count, I used the
>> approach
>> suggested by Jagadeesh: do a TSC_read() before and after the function call.
>>
>> The original TI code had this NH = 32, NR = 256. When I ran that code, I got a cycle count of 1075, which is close
>> to
>> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran into a memory
>> limitation, since TI was placing everything in DMC_SRAM. So, I placed the input and filter arrays in LL2RAM and the
>> output array in SL2RAM (shared memory). I ran the program with larger array sizes and the results I got were:
>>
>> NH: 40000
>> NR: 230000
>> Expected: 1150000022 cycles
>> Actual: 2013247548 cycles
>>
>> So, it's off by a factor of ~1.75. This factor stays constant when I vary the array size.
>>
>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a legacy CCS 3 project), and makes a single
>> function call to the fir_r8 assembly routine from main, so there are no overhead issues. The previous discussion
>> about
>> whether we are counting the cycles correctly doesn't apply, since I am measuring the cycles directly from hardware.
>> Any suggestions on what else might account for the difference? In particular, does placement of the data matter
>> (e.g.
>> LL2RAM vs DMC_SRAM?)
>>
>> Regards,
>> Andrey
>>> All,
>>>
>>> The outer loop loop count dominates the inner loop loop count, by a factor of 232/36 = 6x, and further by looking
>>> at
>> the generated assembly code, we see 55 cycles of overhead for every 8 output samples, so just this generates
>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>
>>> The intrinsic C code does a better job of decreasing outer loop overheads. So, one gets lesser outer loop overhead,
>> which is what I think you are seeing.
>>>
>>> Please make sure that the effects of "cold cache" are taken care off, by running the code twice, ignore the results
>> of the first time, then measure the results from the second time.
>>>
>>> The cycle count formula you are using seems to be off, in the most optimal loop in your assembly code which will be
>> the path taken based on your number of inputs and filter taps, we see 32 multiplies in 9 cycles, or 3.5
>> multiplies/cycle. So, formula is roughly:
>>>
>>> x_len/8 * [55 + h_len/4]
>>>
>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>
>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ So, should be about 262/700 or 0.33
>>> seconds.
>>>
>>> Please use the built in 64-bit counter available on all C64x+ cores as it runs lock step with the CPU. This can be
>> done by reading TSCL and TSCH registers.
>>>
>>> I am attaching the relevant code, first call the function TSC_enable() to enable the CPU counter. Then call
>>> TSC_read
>> which returns the result as a 64-bit value or long long, so the sequence is as follows:
>>>
>>> long long t0, t1;
>>> unsigned int diff;
>>>
>>> TSC_enable();
>>>
>>> t0 = TSC_read();
>>> benchmark();
>>> t1 = TSC_read();
>>>
>>> diff = (t1 - t0);
>>>
>>> printf ("diff: %d cycles \n", diff);
>>>
>>>
>>> Regards
>>> JS
>>>
>>> -----Original Message-----
>>> From: c... [mailto:c...] On Behalf Of Vikram Ragukumar
>>> Sent: Friday, October 08, 2010 3:05 PM
>>> To: c...
>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>
>>> All,
>>>
>>> We are in the process of verifying theoretical cycle time calculations, based on information from the generated asm
>> file.
>>>
>>> Following results and questions pertain to the attached conv.c and a section of the generated conv.asm file.
>>> conv.c implements a basic convolution function.
>>>
>>> Our setup details are :
>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>> project
>>> - CCS v4.2
>>> - SYS/BIOS (BIOSv6)
>>> - CGT v7.0.3
>>>
>>> There appears to be a mismatch in timing measurements, with "actual run time" being about 9 times the "computed run
>> time".
>>>
>>> From the C code,
>>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000 Number of inner loop iterations = h_len >>> 36000
>>>
>>> From asm file, Cycle time for software pipelined loop is
>>>
>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a SINGLE SCHEDULED ITERATION.
>>>
>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>
>>> Therefore,
>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>
>>> However actual run time of the program was determined to be 52s. Run time was calculated by inserting SYS/BIOS
>> (BIOS6) Clock_getTicks() calls right before and immediately after the outer loop. Also, we made sure that interrupts
>> were disabled during execution of the loops, this had no effect on run time measured.
>>>
>>> (1) Is there something wrong with how "Computed run time" is being calculated or how we are measuring "Actual run
>> time" ?
>>>
>>> (2) As part of the Software Pipeline information, the compiler outputs a line that reads something like "ii = 2
>> Schedule found with 6 iterations in parallel". Does the word "iterations" in the above phrase correspond to actual
>> inner loop iteration in c code ? If so, is the number 6 in the above example, to be used in computing "expected run
>> time" ?
>>>
>>> Thank you in advance for your help.
>>>
>>> Regards,
>>> Vikram.

_____________________________________