Andrew-
Yes as it turned out, both non-BIOS and BIOS profiling methods (TSC_read() and
Clock_getTicks() ) worked fine. There
are notes in there about cycle counting formulas.
This e2e thread gives a summary of where we ended up on this issue (look for the
2nd to last entry):
http://e2e.ti.com/support/development_tools/compiler/f/343/t/66581.aspx
-Jeff
> Why do you beleive that this is the "measurement
issue"? Looking at Andrey's
> figues obtained in the best possible RAM placement (1046 as in the theory,
1075
> as in the experiment run) me thinks that they do match perfectly. Not sure
if
> TSC_read() is a call or a macro that simply reads the counter register, I
would
> guess that the 29 cycles off is a call overhead:
>
> // measure call overhead
> T_0 = TSC_read ();
> T_C = TSC_read ();
> T_V = T_C - T_0;
>
> // benchmark
> T_0 = TSC_read ();
> fir8 (n, x, h); // or whatever should be measured here
> T_C = TSC_read ();
>
> // exec time
> T_X = T_C - T_0 - T_V;
>
> assumung that the TSC_read () is a call. If it is the macro, then I'd
rather
> look at the generated assembly code.
>
> It could also be that the formula does not take into consideration the call
> overhead of the filter function itself - very possible. It is usually 10
> through 15 cycles to add.
>
> The 1.7... factor for the worse RAM placement map is, as Jagadeesh has
noted,
> due to memory access path delays. For the mere purpose of benchmarking I
would
> have kept the original TI's RAM placement and the same array sizes
exactly,
> which would give comparable numbers.
>
> Rgds,
>
> Andrew
>
>> Re: Mismatch in Timing measurements [1 Attachment]
>> Posted by: "Jeff Brower" j...@signalogic.com jbrower888
>> Date: Tue Oct 12, 2010 3:51 pm ((PDT))
>>
>> All-
>>
>> I wanted to add that I'm still concerned about possible JTAG + USB
"background communication", a theory that would
>> fit
>> the constant slow-down factor, assuming that CCS4 talks with the chip on a
periodic basis. I didn't get any reply
>> to
>> my question about "where is Free Run" in CCS4.
>>
>> A couple of other notes:
>>
>> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see it
in the .map file).
>>
>> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y is in
SL2 (shared mem, onchip). If we had
>> pipeline stalls due to internal memory bus conflicts, I would have expected
the slow-down factor to get worse for
>> increased data lengths.
>>
>> 3) Andrey says "main issue" but that really should be "measurement issue".
Our main issue is still to generate
>> fastest possible "natural C code". But without accurate measurement using
the EVM C6472, we are stuck between two
>> problems, so we're trying to tackle one at a time.
>>
>> -Jeff
>>
>> ------------------ Original Message ----------------
>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>> From: "Andrey Krokhin"
>> Date: Tue, October 12, 2010 5:23 pm
>> To: "Sankaran, Jagadeesh"
>> Cc: "Vikram Ragukumar"
>> "Jeff Brower"
>> "c..."
>> ----------------
>>
>> Hello everyone,
>>
>> In order to focus on the main issue, the difference between expected and real
clock cycles, I tried a completely new
>> approach. I created a new project using TI's "benchmark" fir_r8 routine,
which has been hand-coded in assembly. I
>> used
>> their sample project which just called the fir_r8 function once from main. To
get the cycle count, I used the
>> approach
>> suggested by Jagadeesh: do a TSC_read() before and after the function
call.
>>
>> The original TI code had this NH = 32, NR = 256. When I ran that code, I got
a cycle count of 1075, which is close
>> to
>> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried
increasing the array size. I ran into a memory
>> limitation, since TI was placing everything in DMC_SRAM. So, I placed the
input and filter arrays in LL2RAM and the
>> output array in SL2RAM (shared memory). I ran the program with larger array
sizes and the results I got were:
>>
>> NH: 40000
>> NR: 230000
>> Expected: 1150000022 cycles
>> Actual: 2013247548 cycles
>>
>> So, it's off by a factor of ~1.75. This factor stays constant when I
vary the array size.
>>
>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact,
it's a legacy CCS 3 project), and makes a single
>> function call to the fir_r8 assembly routine from main, so there are no
overhead issues. The previous discussion
>> about
>> whether we are counting the cycles correctly doesn't apply, since I am
measuring the cycles directly from hardware.
>> Any suggestions on what else might account for the difference? In particular,
does placement of the data matter
>> (e.g.
>> LL2RAM vs DMC_SRAM?)
>>
>> Regards,
>> Andrey
>>> All,
>>>
>>> The outer loop loop count dominates the inner loop loop count, by a factor
of 232/36 = 6x, and further by looking
>>> at
>> the generated assembly code, we see 55 cycles of overhead for every 8 output
samples, so just this generates
>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>
>>> The intrinsic C code does a better job of decreasing outer loop overheads.
So, one gets lesser outer loop overhead,
>> which is what I think you are seeing.
>>>
>>> Please make sure that the effects of "cold cache" are taken care off, by
running the code twice, ignore the results
>> of the first time, then measure the results from the second time.
>>>
>>> The cycle count formula you are using seems to be off, in the most optimal
loop in your assembly code which will be
>> the path taken based on your number of inputs and filter taps, we see 32
multiplies in 9 cycles, or 3.5
>> multiplies/cycle. So, formula is roughly:
>>>
>>> x_len/8 * [55 + h_len/4]
>>>
>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>
>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ So,
should be about 262/700 or 0.33
>>> seconds.
>>>
>>> Please use the built in 64-bit counter available on all C64x+ cores as it
runs lock step with the CPU. This can be
>> done by reading TSCL and TSCH registers.
>>>
>>> I am attaching the relevant code, first call the function TSC_enable() to
enable the CPU counter. Then call
>>> TSC_read
>> which returns the result as a 64-bit value or long long, so the sequence is
as follows:
>>>
>>> long long t0, t1;
>>> unsigned int diff;
>>>
>>> TSC_enable();
>>>
>>> t0 = TSC_read();
>>> benchmark();
>>> t1 = TSC_read();
>>>
>>> diff = (t1 - t0);
>>>
>>> printf ("diff: %d cycles \n", diff);
>>>
>>>
>>> Regards
>>> JS
>>>
>>> -----Original Message-----
>>> From: c... [mailto:c...] On Behalf Of Vikram Ragukumar
>>> Sent: Friday, October 08, 2010 3:05 PM
>>> To: c...
>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>
>>> All,
>>>
>>> We are in the process of verifying theoretical cycle time calculations,
based on information from the generated asm
>> file.
>>>
>>> Following results and questions pertain to the attached conv.c and a section
of the generated conv.asm file.
>>> conv.c implements a basic convolution function.
>>>
>>> Our setup details are :
>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>> project
>>> - CCS v4.2
>>> - SYS/BIOS (BIOSv6)
>>> - CGT v7.0.3
>>>
>>> There appears to be a mismatch in timing measurements, with "actual run
time" being about 9 times the "computed run
>> time".
>>>
>>> From the C code,
>>> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000
Number of inner loop iterations = h_len >>> 36000
>>>
>>> From asm file, Cycle time for software pipelined loop is
>>>
>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a
SINGLE SCHEDULED ITERATION.
>>>
>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>
>>> Therefore,
>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>
>>> However actual run time of the program was determined to be 52s. Run time
was calculated by inserting SYS/BIOS
>> (BIOS6) Clock_getTicks() calls right before and immediately after the outer
loop. Also, we made sure that interrupts
>> were disabled during execution of the loops, this had no effect on run time
measured.
>>>
>>> (1) Is there something wrong with how "Computed run time" is being
calculated or how we are measuring "Actual run
>> time" ?
>>>
>>> (2) As part of the Software Pipeline information, the compiler outputs a
line that reads something like "ii = 2
>> Schedule found with 6 iterations in parallel". Does the word "iterations" in
the above phrase correspond to actual
>> inner loop iteration in c code ? If so, is the number 6 in the above example,
to be used in computing "expected run
>> time" ?
>>>
>>> Thank you in advance for your help.
>>>
>>> Regards,
>>> Vikram.
_____________________________________