DSPRelated.com
Forums

RE: Mismatch in Timing measurements

Started by "Sankaran, Jagadeesh" October 13, 2010
The cycle count of "NH*NR/8+22" assumes only CPU cycles and does not and cannout account for memory sub-system effects. When you work on such large arrays, you see additional stall cycles due to memory hierarchy.

Your results will be the best if you can fit in L2 memory of the processor
the section you refer to as DMC_SRAM. The next best set of results will be
when you use combination of DMC_SRAM & SL2RAM or shared level L2 when arrays do not fit.

Absolute worst results will be when everything is linked off chip memory.

There are things you can do to minimize thrashing due to cache when you work on such large arrays, to take advantage of the cache allocation policy and how you link your arrays in external memory.

Regards
JS

-----Original Message-----
From: Andrey Krokhin [mailto:a...@signalogic.com]
Sent: Tuesday, October 12, 2010 5:32 PM
To: Sankaran, Jagadeesh
Cc: Andrey Krokhin; Vikram Ragukumar; Jeff Brower
Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]

Sure Jagadeesh... there it is.

Thanks,
Andrey

> Can you re-name the fir_r8.cmd to fir_r8.txt, as TI mail blocks .cmd
> files.
> I will reply in detail once I get your .cmd.
>
> Regards
> JS
> -----Original Message-----
> From: Andrey Krokhin [mailto:a...@signalogic.com]
> Sent: Tuesday, October 12, 2010 5:24 PM
> To: Sankaran, Jagadeesh
> Cc: Vikram Ragukumar; Jeff Brower; c...
> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>
> The original attachment to this message "fir_r8.cmd" was of a file type
> currently being blocked in the TI environment. Please contact the sender
> of the message for additional information.
> Hello everyone,
>
> In order to focus on the main issue, the difference between expected and
> real clock cycles, I tried a completely new approach. I created a new
> project using TI's "benchmark" fir_r8 routine, which has been hand-coded
> in assembly. I used their sample project which just called the fir_r8
> function once from main. To get the cycle count, I used the approach
> suggested by Jagadeesh: do a TSC_read() before and after the function
> call.
>
> The original TI code had this NH = 32, NR = 256. When I ran that code, I
> got a cycle count of 1075, which is close to what it should be with TI's
> formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran
> into a memory limitation, since TI was placing everything in DMC_SRAM. So,
> I placed the input and filter arrays in LL2RAM and the output array in
> SL2RAM (shared memory). I ran the program with larger array sizes and the
> results I got were:
>
> NH: 40000
> NR: 230000
> Expected: 1150000022 cycles
> Actual: 2013247548 cycles
>
> So, it's off by a factor of ~1.75. This factor stays constant when I vary
> the array size.
>
> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
> legacy CCS 3 project), and makes a single function call to the fir_r8
> assembly routine from main, so there are no overhead issues. The previous
> discussion about whether we are counting the cycles correctly doesn't
> apply, since I am measuring the cycles directly from hardware. Any
> suggestions on what else might account for the difference? In particular,
> does placement of the data matter (e.g. LL2RAM vs DMC_SRAM?)
>
> Regards,
> Andrey
>> All,
>>
>> The outer loop loop count dominates the inner loop loop count, by a
>> factor
>> of 232/36 = 6x, and further by looking at the generated assembly code,
>> we
>> see 55 cycles of overhead for every 8 output samples, so just this
>> generates
>> a 232000/8 * 55 = 1.6 Mcycles.
>>
>> The intrinsic C code does a better job of decreasing outer loop
>> overheads.
>> So, one gets lesser outer loop overhead, which is what I think you are
>> seeing.
>>
>> Please make sure that the effects of "cold cache" are taken care off, by
>> running the code twice, ignore the results of the first time, then
>> measure
>> the results from the second time.
>>
>> The cycle count formula you are using seems to be off, in the most
>> optimal
>> loop in your assembly code which will be the path taken based on your
>> number of inputs and filter taps, we see 32 multiplies in 9 cycles, or
>> 3.5
>> multiplies/cycle. So, formula is roughly:
>>
>> x_len/8 * [55 + h_len/4]
>>
>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>
>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ
>> So, should be about 262/700 or 0.33 seconds.
>>
>> Please use the built in 64-bit counter available on all C64x+ cores as
>> it
>> runs lock step with the CPU. This can be done by reading TSCL and TSCH
>> registers.
>>
>> I am attaching the relevant code, first call the function TSC_enable()
>> to
>> enable the CPU counter. Then call TSC_read which returns the result as a
>> 64-bit value or long long, so the sequence is as follows:
>>
>> long long t0, t1;
>> unsigned int diff;
>>
>> TSC_enable();
>>
>> t0 = TSC_read();
>> benchmark();
>> t1 = TSC_read();
>>
>> diff = (t1 - t0);
>>
>> printf ("diff: %d cycles \n", diff);
>> Regards
>> JS
>>
>> -----Original Message-----
>> From: c... [mailto:c...] On Behalf Of
>> Vikram
>> Ragukumar
>> Sent: Friday, October 08, 2010 3:05 PM
>> To: c...
>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>
>> All,
>>
>> We are in the process of verifying theoretical cycle time calculations,
>> based on information from the generated asm file.
>>
>> Following results and questions pertain to the attached conv.c and a
>> section of the generated conv.asm file.
>> conv.c implements a basic convolution function.
>>
>> Our setup details are :
>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>> project
>> - CCS v4.2
>> - SYS/BIOS (BIOSv6)
>> - CGT v7.0.3
>>
>> There appears to be a mismatch in timing measurements, with "actual run
>> time" being about 9 times the "computed run time".
>>
>> From the C code,
>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000
>> Number of inner loop iterations = h_len = 36000
>>
>> From asm file, Cycle time for software pipelined loop is
>>
>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a
>> SINGLE SCHEDULED ITERATION.
>>
>> Hence, computed run time = x_len*(10+trip_cnt*2),
>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>
>> Therefore,
>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>> At 700Mhz, this would correspond to a time of 6s (approx).
>>
>> However actual run time of the program was determined to be 52s. Run
>> time was calculated by inserting SYS/BIOS (BIOS6) Clock_getTicks() calls
>> right before and immediately after the outer loop. Also, we made sure
>> that interrupts were disabled during execution of the loops, this had no
>> effect on run time measured.
>>
>> (1) Is there something wrong with how "Computed run time" is being
>> calculated or how we are measuring "Actual run time" ?
>>
>> (2) As part of the Software Pipeline information, the compiler outputs a
>> line that reads something like "ii = 2 Schedule found with 6 iterations
>> in parallel". Does the word "iterations" in the above phrase correspond
>> to actual inner loop iteration in c code ? If so, is the number 6 in the
>> above example, to be used in computing "expected run time" ?
>>
>> Thank you in advance for your help.
>>
>> Regards,
>> Vikram.
>>
>>
>> _____________________________________
>

_____________________________________
Jagadeesh-

> The cycle count of "NH*NR/8+22" assumes only CPU cycles and
> does not and cannout account for memory sub-system
> effects. When you work on such large arrays, you see additional
> stall cycles due to memory hierarchy.

Is there a method or CCS tool that we can use to figure out which instructions have incurred the stall cycles?

> Your results will be the best if you can fit in L2 memory of
> the processor the section you refer to as DMC_SRAM. The next
> best set of results will be when you use combination of
> DMC_SRAM & SL2RAM or shared level L2 when arrays do not fit.
> Absolute worst results will be when everything is linked off
> chip memory.

Ok. Of course we wish that L1D memory size was larger than 32k byte.

> There are things you can do to minimize thrashing due to cache
> when you work on such large arrays, to take advantage
> of the cache allocation policy and how you link your arrays
> in external memory.

Ok. Testing with huge arrays located in external mem is on our list, but we have a ways to go yet.

-Jeff

> -----Original Message-----
> From: Andrey Krokhin [mailto:a...@signalogic.com]
> Sent: Tuesday, October 12, 2010 5:32 PM
> To: Sankaran, Jagadeesh
> Cc: Andrey Krokhin; Vikram Ragukumar; Jeff Brower
> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>
> Sure Jagadeesh... there it is.
>
> Thanks,
> Andrey
>
>> Can you re-name the fir_r8.cmd to fir_r8.txt, as TI mail blocks .cmd
>> files.
>> I will reply in detail once I get your .cmd.
>>
>> Regards
>> JS
>> -----Original Message-----
>> From: Andrey Krokhin [mailto:a...@signalogic.com]
>> Sent: Tuesday, October 12, 2010 5:24 PM
>> To: Sankaran, Jagadeesh
>> Cc: Vikram Ragukumar; Jeff Brower; c...
>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>>
>> The original attachment to this message "fir_r8.cmd" was of a file type
>> currently being blocked in the TI environment. Please contact the sender
>> of the message for additional information.
>> Hello everyone,
>>
>> In order to focus on the main issue, the difference between expected and
>> real clock cycles, I tried a completely new approach. I created a new
>> project using TI's "benchmark" fir_r8 routine, which has been hand-coded
>> in assembly. I used their sample project which just called the fir_r8
>> function once from main. To get the cycle count, I used the approach
>> suggested by Jagadeesh: do a TSC_read() before and after the function
>> call.
>>
>> The original TI code had this NH = 32, NR = 256. When I ran that code, I
>> got a cycle count of 1075, which is close to what it should be with TI's
>> formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran
>> into a memory limitation, since TI was placing everything in DMC_SRAM. So,
>> I placed the input and filter arrays in LL2RAM and the output array in
>> SL2RAM (shared memory). I ran the program with larger array sizes and the
>> results I got were:
>>
>> NH: 40000
>> NR: 230000
>> Expected: 1150000022 cycles
>> Actual: 2013247548 cycles
>>
>> So, it's off by a factor of ~1.75. This factor stays constant when I vary
>> the array size.
>>
>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
>> legacy CCS 3 project), and makes a single function call to the fir_r8
>> assembly routine from main, so there are no overhead issues. The previous
>> discussion about whether we are counting the cycles correctly doesn't
>> apply, since I am measuring the cycles directly from hardware. Any
>> suggestions on what else might account for the difference? In particular,
>> does placement of the data matter (e.g. LL2RAM vs DMC_SRAM?)
>>
>> Regards,
>> Andrey
>>> All,
>>>
>>> The outer loop loop count dominates the inner loop loop count, by a
>>> factor
>>> of 232/36 = 6x, and further by looking at the generated assembly code,
>>> we
>>> see 55 cycles of overhead for every 8 output samples, so just this
>>> generates
>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>
>>> The intrinsic C code does a better job of decreasing outer loop
>>> overheads.
>>> So, one gets lesser outer loop overhead, which is what I think you are
>>> seeing.
>>>
>>> Please make sure that the effects of "cold cache" are taken care off, by
>>> running the code twice, ignore the results of the first time, then
>>> measure
>>> the results from the second time.
>>>
>>> The cycle count formula you are using seems to be off, in the most
>>> optimal
>>> loop in your assembly code which will be the path taken based on your
>>> number of inputs and filter taps, we see 32 multiplies in 9 cycles, or
>>> 3.5
>>> multiplies/cycle. So, formula is roughly:
>>>
>>> x_len/8 * [55 + h_len/4]
>>>
>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>
>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ
>>> So, should be about 262/700 or 0.33 seconds.
>>>
>>> Please use the built in 64-bit counter available on all C64x+ cores as
>>> it
>>> runs lock step with the CPU. This can be done by reading TSCL and TSCH
>>> registers.
>>>
>>> I am attaching the relevant code, first call the function TSC_enable()
>>> to
>>> enable the CPU counter. Then call TSC_read which returns the result as a
>>> 64-bit value or long long, so the sequence is as follows:
>>>
>>> long long t0, t1;
>>> unsigned int diff;
>>>
>>> TSC_enable();
>>>
>>> t0 = TSC_read();
>>> benchmark();
>>> t1 = TSC_read();
>>>
>>> diff = (t1 - t0);
>>>
>>> printf ("diff: %d cycles \n", diff);
>>>
>>>
>>> Regards
>>> JS
>>>
>>> -----Original Message-----
>>> From: c... [mailto:c...] On Behalf Of
>>> Vikram
>>> Ragukumar
>>> Sent: Friday, October 08, 2010 3:05 PM
>>> To: c...
>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>
>>> All,
>>>
>>> We are in the process of verifying theoretical cycle time calculations,
>>> based on information from the generated asm file.
>>>
>>> Following results and questions pertain to the attached conv.c and a
>>> section of the generated conv.asm file.
>>> conv.c implements a basic convolution function.
>>>
>>> Our setup details are :
>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>> project
>>> - CCS v4.2
>>> - SYS/BIOS (BIOSv6)
>>> - CGT v7.0.3
>>>
>>> There appears to be a mismatch in timing measurements, with "actual run
>>> time" being about 9 times the "computed run time".
>>>
>>> From the C code,
>>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000
>>> Number of inner loop iterations = h_len = 36000
>>>
>>> From asm file, Cycle time for software pipelined loop is
>>>
>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a
>>> SINGLE SCHEDULED ITERATION.
>>>
>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>
>>> Therefore,
>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>
>>> However actual run time of the program was determined to be 52s. Run
>>> time was calculated by inserting SYS/BIOS (BIOS6) Clock_getTicks() calls
>>> right before and immediately after the outer loop. Also, we made sure
>>> that interrupts were disabled during execution of the loops, this had no
>>> effect on run time measured.
>>>
>>> (1) Is there something wrong with how "Computed run time" is being
>>> calculated or how we are measuring "Actual run time" ?
>>>
>>> (2) As part of the Software Pipeline information, the compiler outputs a
>>> line that reads something like "ii = 2 Schedule found with 6 iterations
>>> in parallel". Does the word "iterations" in the above phrase correspond
>>> to actual inner loop iteration in c code ? If so, is the number 6 in the
>>> above example, to be used in computing "expected run time" ?
>>>
>>> Thank you in advance for your help.
>>>
>>> Regards,
>>> Vikram.

_____________________________________