Jagadeesh-
> The cycle count of "NH*NR/8+22" assumes only CPU
cycles and
> does not and cannout account for memory sub-system
> effects. When you work on such large arrays, you see additional
> stall cycles due to memory hierarchy.
Is there a method or CCS tool that we can use to figure out which instructions
have incurred the stall cycles?
> Your results will be the best if you can fit in L2
memory of
> the processor the section you refer to as DMC_SRAM. The next
> best set of results will be when you use combination of
> DMC_SRAM & SL2RAM or shared level L2 when arrays do not fit.
> Absolute worst results will be when everything is linked off
> chip memory.
Ok. Of course we wish that L1D memory size was larger than 32k byte.
> There are things you can do to minimize thrashing due
to cache
> when you work on such large arrays, to take advantage
> of the cache allocation policy and how you link your arrays
> in external memory.
Ok. Testing with huge arrays located in external mem is on our list, but we
have a ways to go yet.
-Jeff
> -----Original Message-----
> From: Andrey Krokhin [mailto:a...@signalogic.com]
> Sent: Tuesday, October 12, 2010 5:32 PM
> To: Sankaran, Jagadeesh
> Cc: Andrey Krokhin; Vikram Ragukumar; Jeff Brower
> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>
> Sure Jagadeesh... there it is.
>
> Thanks,
> Andrey
>
>> Can you re-name the fir_r8.cmd to fir_r8.txt, as TI mail blocks .cmd
>> files.
>> I will reply in detail once I get your .cmd.
>>
>> Regards
>> JS
>> -----Original Message-----
>> From: Andrey Krokhin [mailto:a...@signalogic.com]
>> Sent: Tuesday, October 12, 2010 5:24 PM
>> To: Sankaran, Jagadeesh
>> Cc: Vikram Ragukumar; Jeff Brower; c...
>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>>
>> The original attachment to this message "fir_r8.cmd" was of a file type
>> currently being blocked in the TI environment. Please contact the sender
>> of the message for additional information.
>> Hello everyone,
>>
>> In order to focus on the main issue, the difference between expected and
>> real clock cycles, I tried a completely new approach. I created a new
>> project using TI's "benchmark" fir_r8 routine, which has been
hand-coded
>> in assembly. I used their sample project which just called the fir_r8
>> function once from main. To get the cycle count, I used the approach
>> suggested by Jagadeesh: do a TSC_read() before and after the function
>> call.
>>
>> The original TI code had this NH = 32, NR = 256. When I ran that code, I
>> got a cycle count of 1075, which is close to what it should be with
TI's
>> formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran
>> into a memory limitation, since TI was placing everything in DMC_SRAM. So,
>> I placed the input and filter arrays in LL2RAM and the output array in
>> SL2RAM (shared memory). I ran the program with larger array sizes and the
>> results I got were:
>>
>> NH: 40000
>> NR: 230000
>> Expected: 1150000022 cycles
>> Actual: 2013247548 cycles
>>
>> So, it's off by a factor of ~1.75. This factor stays constant when I
vary
>> the array size.
>>
>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact,
it's a
>> legacy CCS 3 project), and makes a single function call to the fir_r8
>> assembly routine from main, so there are no overhead issues. The previous
>> discussion about whether we are counting the cycles correctly doesn't
>> apply, since I am measuring the cycles directly from hardware. Any
>> suggestions on what else might account for the difference? In particular,
>> does placement of the data matter (e.g. LL2RAM vs DMC_SRAM?)
>>
>> Regards,
>> Andrey
>>> All,
>>>
>>> The outer loop loop count dominates the inner loop loop count, by a
>>> factor
>>> of 232/36 = 6x, and further by looking at the generated assembly code,
>>> we
>>> see 55 cycles of overhead for every 8 output samples, so just this
>>> generates
>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>
>>> The intrinsic C code does a better job of decreasing outer loop
>>> overheads.
>>> So, one gets lesser outer loop overhead, which is what I think you are
>>> seeing.
>>>
>>> Please make sure that the effects of "cold cache" are taken care off, by
>>> running the code twice, ignore the results of the first time, then
>>> measure
>>> the results from the second time.
>>>
>>> The cycle count formula you are using seems to be off, in the most
>>> optimal
>>> loop in your assembly code which will be the path taken based on your
>>> number of inputs and filter taps, we see 32 multiplies in 9 cycles, or
>>> 3.5
>>> multiplies/cycle. So, formula is roughly:
>>>
>>> x_len/8 * [55 + h_len/4]
>>>
>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>
>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ
>>> So, should be about 262/700 or 0.33 seconds.
>>>
>>> Please use the built in 64-bit counter available on all C64x+ cores as
>>> it
>>> runs lock step with the CPU. This can be done by reading TSCL and TSCH
>>> registers.
>>>
>>> I am attaching the relevant code, first call the function TSC_enable()
>>> to
>>> enable the CPU counter. Then call TSC_read which returns the result as a
>>> 64-bit value or long long, so the sequence is as follows:
>>>
>>> long long t0, t1;
>>> unsigned int diff;
>>>
>>> TSC_enable();
>>>
>>> t0 = TSC_read();
>>> benchmark();
>>> t1 = TSC_read();
>>>
>>> diff = (t1 - t0);
>>>
>>> printf ("diff: %d cycles \n", diff);
>>>
>>>
>>> Regards
>>> JS
>>>
>>> -----Original Message-----
>>> From: c... [mailto:c...] On Behalf Of
>>> Vikram
>>> Ragukumar
>>> Sent: Friday, October 08, 2010 3:05 PM
>>> To: c...
>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>
>>> All,
>>>
>>> We are in the process of verifying theoretical cycle time calculations,
>>> based on information from the generated asm file.
>>>
>>> Following results and questions pertain to the attached conv.c and a
>>> section of the generated conv.asm file.
>>> conv.c implements a basic convolution function.
>>>
>>> Our setup details are :
>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>> project
>>> - CCS v4.2
>>> - SYS/BIOS (BIOSv6)
>>> - CGT v7.0.3
>>>
>>> There appears to be a mismatch in timing measurements, with "actual run
>>> time" being about 9 times the "computed run time".
>>>
>>> From the C code,
>>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000
>>> Number of inner loop iterations = h_len = 36000
>>>
>>> From asm file, Cycle time for software pipelined loop is
>>>
>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a
>>> SINGLE SCHEDULED ITERATION.
>>>
>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>
>>> Therefore,
>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>
>>> However actual run time of the program was determined to be 52s. Run
>>> time was calculated by inserting SYS/BIOS (BIOS6) Clock_getTicks() calls
>>> right before and immediately after the outer loop. Also, we made sure
>>> that interrupts were disabled during execution of the loops, this had no
>>> effect on run time measured.
>>>
>>> (1) Is there something wrong with how "Computed run time" is being
>>> calculated or how we are measuring "Actual run time" ?
>>>
>>> (2) As part of the Software Pipeline information, the compiler outputs a
>>> line that reads something like "ii = 2 Schedule found with 6 iterations
>>> in parallel". Does the word "iterations" in the above phrase correspond
>>> to actual inner loop iteration in c code ? If so, is the number 6 in the
>>> above example, to be used in computing "expected run time" ?
>>>
>>> Thank you in advance for your help.
>>>
>>> Regards,
>>> Vikram.
_____________________________________