Reply by Jeff Brower October 13, 20102010-10-13
Mike-

> On 10/13/2010 1:28 PM, Jeff Brower wrote:
>> Mike-
>>
>> Thanks for the detailed answer. We're not using an XDS560, but that's a good trick to keep in mind. In our current
>> tests we have SYS/BIOS (the new TI name for DSP/BIOS, actually BIOS v6) disabled, so we can rule that out.
>>
>> Thanks for the heads-up about AET. I don't think the C6472 has this though.
> DS p259 7.20.1 Advanced Event Triggering (AET)

Right, I see this now. AET isn't mentioned in the first few sections of the data sheet, so I had cut my search short.
My bad.

-Jeff

>> I'll report back once I get an answer about Free Run.
>>
>> -Jeff
>>
>>> On 10/12/2010 6:13 PM, Jeff Brower wrote:
>>>> All-
>>>>
>>>> I wanted to add that I'm still concerned about possible JTAG + USB
>>>> "background communication", a theory that would fit
>>>> the constant slow-down factor, assuming that CCS4 talks with the chip
>>>> on a periodic basis.
>>>>
>>> If you have no BPs set, no stdio, and no RTDX, I am pretty sure that the
>>> only JTAG activity is "Are you halted?" polling from CCS. This polling
>>> only reads a status register and checks the CPU halted status. The
>>> status register logic is located entirely in the JTAG clock domain and
>>> does not affect processor operation.
>>> Since we have all been burned by 'things *should* work this way' in the
>>> past, you may be able to verify this empirically.
>>> If you have an XDS560 emulator, run tests as normal.
>>> Then change the JTAG clock rate to 1Mhz and repeat the tests.
>>> If there is any 'JTAG interference', you could see a higher cycle count.
>>> FYI - I haven't worked with the 6472/CCS4 combination - on some
>>> target/emulator combinations you can run into strange side effects if
>>> you set the JTAG clock rate below 1Mhz.
>>> There might be a wild card relating to DSP/BIOS...
>>>> I didn't get any reply to
>>>> my question about "where is Free Run" in CCS4.
>>>>
>>> I have very minimal use with CCS4/DSP so I can't answer the 'Run Free'
>>> question, but I would expect that the wizards of TI would provide an
>>> answer - even if it is "we don't need no stinkin' Free Run" :-)
>>>>
>>>> A couple of other notes:
>>>>
>>>> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see
>>>> it in the .map file).
>>>>
>>>> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y
>>>> is in SL2 (shared mem, onchip). If we had
>>>> pipeline stalls due to internal memory bus conflicts, I would have
>>>> expected the slow-down factor to get worse for
>>>> increased data lengths.
>>>>
>>> My _opinion_ is...
>>> if pipeline stalls are a problem in a synchronous test case, it is
>>> very possible that they will occur at a synchronous rate and provide a
>>> constant overhead [or slow down factor].
>>> If your device has AET, you should be able to count pipeline stalls.
>>> Note:
>>> When you select 'Run' the actual AET setup takes place prior to 'pulling
>>> the trigger' for the run. All detection, counting, etc. is handled by HW
>>> and incurs 0 overhead.
>>>
>>> mikedunn
>>>>
>>>> 3) Andrey says "main issue" but that really should be "measurement
>>>> issue". Our main issue is still to generate
>>>> fastest possible "natural C code". But without accurate measurement
>>>> using the EVM C6472, we are stuck between two
>>>> problems, so we're trying to tackle one at a time.
>>>>
>>>> -Jeff
>>>>
>>>> ------------------ Original Message ----------------
>>>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>>>> From: "Andrey Krokhin" >>>> >
>>>> Date: Tue, October 12, 2010 5:23 pm
>>>> To: "Sankaran, Jagadeesh">
>>>> Cc: "Vikram Ragukumar" >>>> >
>>>> "Jeff Brower">
>>>> "c..."
>>>> >
>>>> ----------------
>>>>
>>>> Hello everyone,
>>>>
>>>> In order to focus on the main issue, the difference between expected
>>>> and real clock cycles, I tried a completely new
>>>> approach. I created a new project using TI's "benchmark" fir_r8
>>>> routine, which has been hand-coded in assembly. I used
>>>> their sample project which just called the fir_r8 function once from
>>>> main. To get the cycle count, I used the approach
>>>> suggested by Jagadeesh: do a TSC_read() before and after the function
>>>> call.
>>>>
>>>> The original TI code had this NH = 32, NR = 256. When I ran that code,
>>>> I got a cycle count of 1075, which is close to
>>>> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried
>>>> increasing the array size. I ran into a memory
>>>> limitation, since TI was placing everything in DMC_SRAM. So, I placed
>>>> the input and filter arrays in LL2RAM and the
>>>> output array in SL2RAM (shared memory). I ran the program with larger
>>>> array sizes and the results I got were:
>>>>
>>>> NH: 40000
>>>> NR: 230000
>>>> Expected: 1150000022 cycles
>>>> Actual: 2013247548 cycles
>>>>
>>>> So, it's off by a factor of ~1.75. This factor stays constant when I
>>>> vary the array size.
>>>>
>>>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
>>>> legacy CCS 3 project), and makes a single
>>>> function call to the fir_r8 assembly routine from main, so there are
>>>> no overhead issues. The previous discussion about
>>>> whether we are counting the cycles correctly doesn't apply, since I am
>>>> measuring the cycles directly from hardware.
>>>> Any suggestions on what else might account for the difference? In
>>>> particular, does placement of the data matter (e.g.
>>>> LL2RAM vs DMC_SRAM?)
>>>>
>>>> Regards,
>>>> Andrey
>>>>
>>>>> All,
>>>>>
>>>>> The outer loop loop count dominates the inner loop loop count, by a
>>>> factor of 232/36 = 6x, and further by looking at
>>>> the generated assembly code, we see 55 cycles of overhead for every 8
>>>> output samples, so just this generates
>>>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>>>
>>>>> The intrinsic C code does a better job of decreasing outer loop
>>>> overheads. So, one gets lesser outer loop overhead,
>>>> which is what I think you are seeing.
>>>>> Please make sure that the effects of "cold cache" are taken care
>>>> off, by running the code twice, ignore the results
>>>> of the first time, then measure the results from the second time.
>>>>> The cycle count formula you are using seems to be off, in the most
>>>> optimal loop in your assembly code which will be
>>>> the path taken based on your number of inputs and filter taps, we see
>>>> 32 multiplies in 9 cycles, or 3.5
>>>> multiplies/cycle. So, formula is roughly:
>>>>> x_len/8 * [55 + h_len/4]
>>>>>
>>>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>>>
>>>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6
>>>> MHZ So, should be about 262/700 or 0.33 seconds.
>>>>> Please use the built in 64-bit counter available on all C64x+ cores
>>>> as it runs lock step with the CPU. This can be
>>>> done by reading TSCL and TSCH registers.
>>>>> I am attaching the relevant code, first call the function
>>>> TSC_enable() to enable the CPU counter. Then call TSC_read
>>>> which returns the result as a 64-bit value or long long, so the
>>>> sequence is as follows:
>>>>> long long t0, t1;
>>>>> unsigned int diff;
>>>>>
>>>>> TSC_enable();
>>>>>
>>>>> t0 = TSC_read();
>>>>> benchmark();
>>>>> t1 = TSC_read();
>>>>>
>>>>> diff = (t1 - t0);
>>>>>
>>>>> printf ("diff: %d cycles \n", diff);
>>>>>
>>>>>
>>>>> Regards
>>>>> JS
>>>>>
>>>>> -----Original Message-----
>>>>> From: c...
>>>> [mailto:c...] On Behalf
>>>> Of Vikram Ragukumar
>>>>> Sent: Friday, October 08, 2010 3:05 PM
>>>>> To: c...
>>>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>>>
>>>>> All,
>>>>>
>>>>> We are in the process of verifying theoretical cycle time
>>>> calculations, based on information from the generated asm
>>>> file.
>>>>> Following results and questions pertain to the attached conv.c and a
>>>> section of the generated conv.asm file.
>>>>> conv.c implements a basic convolution function.
>>>>>
>>>>> Our setup details are :
>>>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>>>> project
>>>>> - CCS v4.2
>>>>> - SYS/BIOS (BIOSv6)
>>>>> - CGT v7.0.3
>>>>>
>>>>> There appears to be a mismatch in timing measurements, with "actual
>>>> run time" being about 9 times the "computed run
>>>> time".
>>>>> From the C code,
>>>>> Number of outer loop iterations = x_len + h_len = 196000+36000 >>>> 232000 Number of inner loop iterations = h_len = 36000
>>>>> From asm file, Cycle time for software pipelined loop is
>>>>>
>>>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions
>>>> for a SINGLE SCHEDULED ITERATION.
>>>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>>>
>>>>> Therefore,
>>>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>>>
>>>>> However actual run time of the program was determined to be 52s. Run
>>>> time was calculated by inserting SYS/BIOS
>>>> (BIOS6) Clock_getTicks() calls right before and immediately after the
>>>> outer loop. Also, we made sure that interrupts
>>>> were disabled during execution of the loops, this had no effect on run
>>>> time measured.
>>>>> (1) Is there something wrong with how "Computed run time" is being
>>>> calculated or how we are measuring "Actual run
>>>> time" ?
>>>>> (2) As part of the Software Pipeline information, the compiler
>>>> outputs a line that reads something like "ii = 2
>>>> Schedule found with 6 iterations in parallel". Does the word
>>>> "iterations" in the above phrase correspond to actual
>>>> inner loop iteration in c code ? If so, is the number 6 in the above
>>>> example, to be used in computing "expected run
>>>> time" ?
>>>>> Thank you in advance for your help.
>>>>>
>>>>> Regards,
>>>>> Vikram.
>>>

_____________________________________
Reply by mikedunn October 13, 20102010-10-13
Jeff,

On 10/13/2010 1:28 PM, Jeff Brower wrote:
> Mike-
>
> Thanks for the detailed answer. We're not using an XDS560, but that's a good trick to keep in mind. In our current
> tests we have SYS/BIOS (the new TI name for DSP/BIOS, actually BIOS v6) disabled, so we can rule that out.
>
> Thanks for the heads-up about AET. I don't think the C6472 has this though.
DS p259 7.20.1 Advanced Event Triggering (AET)

mikedunn
> I'll report back once I get an answer about Free Run.
>
> -Jeff
>
>> On 10/12/2010 6:13 PM, Jeff Brower wrote:
>>> All-
>>>
>>> I wanted to add that I'm still concerned about possible JTAG + USB
>>> "background communication", a theory that would fit
>>> the constant slow-down factor, assuming that CCS4 talks with the chip
>>> on a periodic basis.
>>>
>> If you have no BPs set, no stdio, and no RTDX, I am pretty sure that the
>> only JTAG activity is "Are you halted?" polling from CCS. This polling
>> only reads a status register and checks the CPU halted status. The
>> status register logic is located entirely in the JTAG clock domain and
>> does not affect processor operation.
>> Since we have all been burned by 'things *should* work this way' in the
>> past, you may be able to verify this empirically.
>> If you have an XDS560 emulator, run tests as normal.
>> Then change the JTAG clock rate to 1Mhz and repeat the tests.
>> If there is any 'JTAG interference', you could see a higher cycle count.
>> FYI - I haven't worked with the 6472/CCS4 combination - on some
>> target/emulator combinations you can run into strange side effects if
>> you set the JTAG clock rate below 1Mhz.
>> There might be a wild card relating to DSP/BIOS...
>>> I didn't get any reply to
>>> my question about "where is Free Run" in CCS4.
>>>
>> I have very minimal use with CCS4/DSP so I can't answer the 'Run Free'
>> question, but I would expect that the wizards of TI would provide an
>> answer - even if it is "we don't need no stinkin' Free Run" :-)
>>>
>>> A couple of other notes:
>>>
>>> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see
>>> it in the .map file).
>>>
>>> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y
>>> is in SL2 (shared mem, onchip). If we had
>>> pipeline stalls due to internal memory bus conflicts, I would have
>>> expected the slow-down factor to get worse for
>>> increased data lengths.
>>>
>> My _opinion_ is...
>> if pipeline stalls are a problem in a synchronous test case, it is
>> very possible that they will occur at a synchronous rate and provide a
>> constant overhead [or slow down factor].
>> If your device has AET, you should be able to count pipeline stalls.
>> Note:
>> When you select 'Run' the actual AET setup takes place prior to 'pulling
>> the trigger' for the run. All detection, counting, etc. is handled by HW
>> and incurs 0 overhead.
>>
>> mikedunn
>>>
>>> 3) Andrey says "main issue" but that really should be "measurement
>>> issue". Our main issue is still to generate
>>> fastest possible "natural C code". But without accurate measurement
>>> using the EVM C6472, we are stuck between two
>>> problems, so we're trying to tackle one at a time.
>>>
>>> -Jeff
>>>
>>> ------------------ Original Message ----------------
>>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>>> From: "Andrey Krokhin" >>> >
>>> Date: Tue, October 12, 2010 5:23 pm
>>> To: "Sankaran, Jagadeesh">
>>> Cc: "Vikram Ragukumar" >>> >
>>> "Jeff Brower">
>>> "c..."
>>> >
>>> ----------------
>>>
>>> Hello everyone,
>>>
>>> In order to focus on the main issue, the difference between expected
>>> and real clock cycles, I tried a completely new
>>> approach. I created a new project using TI's "benchmark" fir_r8
>>> routine, which has been hand-coded in assembly. I used
>>> their sample project which just called the fir_r8 function once from
>>> main. To get the cycle count, I used the approach
>>> suggested by Jagadeesh: do a TSC_read() before and after the function
>>> call.
>>>
>>> The original TI code had this NH = 32, NR = 256. When I ran that code,
>>> I got a cycle count of 1075, which is close to
>>> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried
>>> increasing the array size. I ran into a memory
>>> limitation, since TI was placing everything in DMC_SRAM. So, I placed
>>> the input and filter arrays in LL2RAM and the
>>> output array in SL2RAM (shared memory). I ran the program with larger
>>> array sizes and the results I got were:
>>>
>>> NH: 40000
>>> NR: 230000
>>> Expected: 1150000022 cycles
>>> Actual: 2013247548 cycles
>>>
>>> So, it's off by a factor of ~1.75. This factor stays constant when I
>>> vary the array size.
>>>
>>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
>>> legacy CCS 3 project), and makes a single
>>> function call to the fir_r8 assembly routine from main, so there are
>>> no overhead issues. The previous discussion about
>>> whether we are counting the cycles correctly doesn't apply, since I am
>>> measuring the cycles directly from hardware.
>>> Any suggestions on what else might account for the difference? In
>>> particular, does placement of the data matter (e.g.
>>> LL2RAM vs DMC_SRAM?)
>>>
>>> Regards,
>>> Andrey
>>>
>>>> All,
>>>>
>>>> The outer loop loop count dominates the inner loop loop count, by a
>>> factor of 232/36 = 6x, and further by looking at
>>> the generated assembly code, we see 55 cycles of overhead for every 8
>>> output samples, so just this generates
>>>> a 232000/8 * 55 = 1.6 Mcycles.
>>>>
>>>> The intrinsic C code does a better job of decreasing outer loop
>>> overheads. So, one gets lesser outer loop overhead,
>>> which is what I think you are seeing.
>>>> Please make sure that the effects of "cold cache" are taken care
>>> off, by running the code twice, ignore the results
>>> of the first time, then measure the results from the second time.
>>>> The cycle count formula you are using seems to be off, in the most
>>> optimal loop in your assembly code which will be
>>> the path taken based on your number of inputs and filter taps, we see
>>> 32 multiplies in 9 cycles, or 3.5
>>> multiplies/cycle. So, formula is roughly:
>>>> x_len/8 * [55 + h_len/4]
>>>>
>>>> It is working on 8 o/p samples and 4 filter taps in parallel.
>>>>
>>>> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6
>>> MHZ So, should be about 262/700 or 0.33 seconds.
>>>> Please use the built in 64-bit counter available on all C64x+ cores
>>> as it runs lock step with the CPU. This can be
>>> done by reading TSCL and TSCH registers.
>>>> I am attaching the relevant code, first call the function
>>> TSC_enable() to enable the CPU counter. Then call TSC_read
>>> which returns the result as a 64-bit value or long long, so the
>>> sequence is as follows:
>>>> long long t0, t1;
>>>> unsigned int diff;
>>>>
>>>> TSC_enable();
>>>>
>>>> t0 = TSC_read();
>>>> benchmark();
>>>> t1 = TSC_read();
>>>>
>>>> diff = (t1 - t0);
>>>>
>>>> printf ("diff: %d cycles \n", diff);
>>>>
>>>>
>>>> Regards
>>>> JS
>>>>
>>>> -----Original Message-----
>>>> From: c...
>>> [mailto:c...] On Behalf
>>> Of Vikram Ragukumar
>>>> Sent: Friday, October 08, 2010 3:05 PM
>>>> To: c...
>>>> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>>>>
>>>> All,
>>>>
>>>> We are in the process of verifying theoretical cycle time
>>> calculations, based on information from the generated asm
>>> file.
>>>> Following results and questions pertain to the attached conv.c and a
>>> section of the generated conv.asm file.
>>>> conv.c implements a basic convolution function.
>>>>
>>>> Our setup details are :
>>>> - EVM6472, used to run .out file generated from a SYS/BIOS based
>>>> project
>>>> - CCS v4.2
>>>> - SYS/BIOS (BIOSv6)
>>>> - CGT v7.0.3
>>>>
>>>> There appears to be a mismatch in timing measurements, with "actual
>>> run time" being about 9 times the "computed run
>>> time".
>>>> From the C code,
>>>> Number of outer loop iterations = x_len + h_len = 196000+36000 >>> 232000 Number of inner loop iterations = h_len = 36000
>>>> From asm file, Cycle time for software pipelined loop is
>>>>
>>>> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions
>>> for a SINGLE SCHEDULED ITERATION.
>>>> Hence, computed run time = x_len*(10+trip_cnt*2),
>>>> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>>>>
>>>> Therefore,
>>>> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>>>> At 700Mhz, this would correspond to a time of 6s (approx).
>>>>
>>>> However actual run time of the program was determined to be 52s. Run
>>> time was calculated by inserting SYS/BIOS
>>> (BIOS6) Clock_getTicks() calls right before and immediately after the
>>> outer loop. Also, we made sure that interrupts
>>> were disabled during execution of the loops, this had no effect on run
>>> time measured.
>>>> (1) Is there something wrong with how "Computed run time" is being
>>> calculated or how we are measuring "Actual run
>>> time" ?
>>>> (2) As part of the Software Pipeline information, the compiler
>>> outputs a line that reads something like "ii = 2
>>> Schedule found with 6 iterations in parallel". Does the word
>>> "iterations" in the above phrase correspond to actual
>>> inner loop iteration in c code ? If so, is the number 6 in the above
>>> example, to be used in computing "expected run
>>> time" ?
>>>> Thank you in advance for your help.
>>>>
>>>> Regards,
>>>> Vikram.
>>>

_____________________________________
Reply by Jeff Brower October 13, 20102010-10-13
Mike-

Thanks for the detailed answer. We're not using an XDS560, but that's a good trick to keep in mind. In our current
tests we have SYS/BIOS (the new TI name for DSP/BIOS, actually BIOS v6) disabled, so we can rule that out.

Thanks for the heads-up about AET. I don't think the C6472 has this though.

I'll report back once I get an answer about Free Run.

-Jeff

>
> On 10/12/2010 6:13 PM, Jeff Brower wrote:
>>
>> All-
>>
>> I wanted to add that I'm still concerned about possible JTAG + USB
>> "background communication", a theory that would fit
>> the constant slow-down factor, assuming that CCS4 talks with the chip
>> on a periodic basis.
>>
> If you have no BPs set, no stdio, and no RTDX, I am pretty sure that the
> only JTAG activity is "Are you halted?" polling from CCS. This polling
> only reads a status register and checks the CPU halted status. The
> status register logic is located entirely in the JTAG clock domain and
> does not affect processor operation.
> Since we have all been burned by 'things *should* work this way' in the
> past, you may be able to verify this empirically.
> If you have an XDS560 emulator, run tests as normal.
> Then change the JTAG clock rate to 1Mhz and repeat the tests.
> If there is any 'JTAG interference', you could see a higher cycle count.
> FYI - I haven't worked with the 6472/CCS4 combination - on some
> target/emulator combinations you can run into strange side effects if
> you set the JTAG clock rate below 1Mhz.
> There might be a wild card relating to DSP/BIOS...
>>
>> I didn't get any reply to
>> my question about "where is Free Run" in CCS4.
>>
> I have very minimal use with CCS4/DSP so I can't answer the 'Run Free'
> question, but I would expect that the wizards of TI would provide an
> answer - even if it is "we don't need no stinkin' Free Run" :-)
>> A couple of other notes:
>>
>> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see
>> it in the .map file).
>>
>> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y
>> is in SL2 (shared mem, onchip). If we had
>> pipeline stalls due to internal memory bus conflicts, I would have
>> expected the slow-down factor to get worse for
>> increased data lengths.
>>
> My _opinion_ is...
> if pipeline stalls are a problem in a synchronous test case, it is
> very possible that they will occur at a synchronous rate and provide a
> constant overhead [or slow down factor].
> If your device has AET, you should be able to count pipeline stalls.
> Note:
> When you select 'Run' the actual AET setup takes place prior to 'pulling
> the trigger' for the run. All detection, counting, etc. is handled by HW
> and incurs 0 overhead.
>
> mikedunn
>> 3) Andrey says "main issue" but that really should be "measurement
>> issue". Our main issue is still to generate
>> fastest possible "natural C code". But without accurate measurement
>> using the EVM C6472, we are stuck between two
>> problems, so we're trying to tackle one at a time.
>>
>> -Jeff
>>
>> ------------------ Original Message ----------------
>> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
>> From: "Andrey Krokhin" >> >
>> Date: Tue, October 12, 2010 5:23 pm
>> To: "Sankaran, Jagadeesh" >
>> Cc: "Vikram Ragukumar" >> >
>> "Jeff Brower" >
>> "c... "
>> >
>> ----------------
>>
>> Hello everyone,
>>
>> In order to focus on the main issue, the difference between expected
>> and real clock cycles, I tried a completely new
>> approach. I created a new project using TI's "benchmark" fir_r8
>> routine, which has been hand-coded in assembly. I used
>> their sample project which just called the fir_r8 function once from
>> main. To get the cycle count, I used the approach
>> suggested by Jagadeesh: do a TSC_read() before and after the function
>> call.
>>
>> The original TI code had this NH = 32, NR = 256. When I ran that code,
>> I got a cycle count of 1075, which is close to
>> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried
>> increasing the array size. I ran into a memory
>> limitation, since TI was placing everything in DMC_SRAM. So, I placed
>> the input and filter arrays in LL2RAM and the
>> output array in SL2RAM (shared memory). I ran the program with larger
>> array sizes and the results I got were:
>>
>> NH: 40000
>> NR: 230000
>> Expected: 1150000022 cycles
>> Actual: 2013247548 cycles
>>
>> So, it's off by a factor of ~1.75. This factor stays constant when I
>> vary the array size.
>>
>> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
>> legacy CCS 3 project), and makes a single
>> function call to the fir_r8 assembly routine from main, so there are
>> no overhead issues. The previous discussion about
>> whether we are counting the cycles correctly doesn't apply, since I am
>> measuring the cycles directly from hardware.
>> Any suggestions on what else might account for the difference? In
>> particular, does placement of the data matter (e.g.
>> LL2RAM vs DMC_SRAM?)
>>
>> Regards,
>> Andrey
>>
>> > All,
>> >
>> > The outer loop loop count dominates the inner loop loop count, by a
>> factor of 232/36 = 6x, and further by looking at
>> the generated assembly code, we see 55 cycles of overhead for every 8
>> output samples, so just this generates
>> > a 232000/8 * 55 = 1.6 Mcycles.
>> >
>> > The intrinsic C code does a better job of decreasing outer loop
>> overheads. So, one gets lesser outer loop overhead,
>> which is what I think you are seeing.
>> >
>> > Please make sure that the effects of "cold cache" are taken care
>> off, by running the code twice, ignore the results
>> of the first time, then measure the results from the second time.
>> >
>> > The cycle count formula you are using seems to be off, in the most
>> optimal loop in your assembly code which will be
>> the path taken based on your number of inputs and filter taps, we see
>> 32 multiplies in 9 cycles, or 3.5
>> multiplies/cycle. So, formula is roughly:
>> >
>> > x_len/8 * [55 + h_len/4]
>> >
>> > It is working on 8 o/p samples and 4 filter taps in parallel.
>> >
>> > Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6
>> MHZ So, should be about 262/700 or 0.33 seconds.
>> >
>> > Please use the built in 64-bit counter available on all C64x+ cores
>> as it runs lock step with the CPU. This can be
>> done by reading TSCL and TSCH registers.
>> >
>> > I am attaching the relevant code, first call the function
>> TSC_enable() to enable the CPU counter. Then call TSC_read
>> which returns the result as a 64-bit value or long long, so the
>> sequence is as follows:
>> >
>> > long long t0, t1;
>> > unsigned int diff;
>> >
>> > TSC_enable();
>> >
>> > t0 = TSC_read();
>> > benchmark();
>> > t1 = TSC_read();
>> >
>> > diff = (t1 - t0);
>> >
>> > printf ("diff: %d cycles \n", diff);
>> >
>> >
>> > Regards
>> > JS
>> >
>> > -----Original Message-----
>> > From: c...
>> [mailto:c... ] On Behalf
>> Of Vikram Ragukumar
>> > Sent: Friday, October 08, 2010 3:05 PM
>> > To: c...
>> > Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>> >
>> > All,
>> >
>> > We are in the process of verifying theoretical cycle time
>> calculations, based on information from the generated asm
>> file.
>> >
>> > Following results and questions pertain to the attached conv.c and a
>> section of the generated conv.asm file.
>> > conv.c implements a basic convolution function.
>> >
>> > Our setup details are :
>> > - EVM6472, used to run .out file generated from a SYS/BIOS based
>> > project
>> > - CCS v4.2
>> > - SYS/BIOS (BIOSv6)
>> > - CGT v7.0.3
>> >
>> > There appears to be a mismatch in timing measurements, with "actual
>> run time" being about 9 times the "computed run
>> time".
>> >
>> > From the C code,
>> > Number of outer loop iterations = x_len + h_len = 196000+36000 >> 232000 Number of inner loop iterations = h_len = 36000
>> >
>> > From asm file, Cycle time for software pipelined loop is
>> >
>> > Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions
>> for a SINGLE SCHEDULED ITERATION.
>> >
>> > Hence, computed run time = x_len*(10+trip_cnt*2),
>> > trip_cnt = h_len/4 (2 DOTP2's per iteration)
>> >
>> > Therefore,
>> > Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
>> > At 700Mhz, this would correspond to a time of 6s (approx).
>> >
>> > However actual run time of the program was determined to be 52s. Run
>> time was calculated by inserting SYS/BIOS
>> (BIOS6) Clock_getTicks() calls right before and immediately after the
>> outer loop. Also, we made sure that interrupts
>> were disabled during execution of the loops, this had no effect on run
>> time measured.
>> >
>> > (1) Is there something wrong with how "Computed run time" is being
>> calculated or how we are measuring "Actual run
>> time" ?
>> >
>> > (2) As part of the Software Pipeline information, the compiler
>> outputs a line that reads something like "ii = 2
>> Schedule found with 6 iterations in parallel". Does the word
>> "iterations" in the above phrase correspond to actual
>> inner loop iteration in c code ? If so, is the number 6 in the above
>> example, to be used in computing "expected run
>> time" ?
>> >
>> > Thank you in advance for your help.
>> >
>> > Regards,
>> > Vikram.
>>
>

_____________________________________
Reply by mikedunn October 13, 20102010-10-13
Hello Jeff,

On 10/12/2010 6:13 PM, Jeff Brower wrote:
>
> All-
>
> I wanted to add that I'm still concerned about possible JTAG + USB
> "background communication", a theory that would fit
> the constant slow-down factor, assuming that CCS4 talks with the chip
> on a periodic basis.
>
If you have no BPs set, no stdio, and no RTDX, I am pretty sure that the
only JTAG activity is "Are you halted?" polling from CCS. This polling
only reads a status register and checks the CPU halted status. The
status register logic is located entirely in the JTAG clock domain and
does not affect processor operation.
Since we have all been burned by 'things *should* work this way' in the
past, you may be able to verify this empirically.
If you have an XDS560 emulator, run tests as normal.
Then change the JTAG clock rate to 1Mhz and repeat the tests.
If there is any 'JTAG interference', you could see a higher cycle count.
FYI - I haven't worked with the 6472/CCS4 combination - on some
target/emulator combinations you can run into strange side effects if
you set the JTAG clock rate below 1Mhz.
There might be a wild card relating to DSP/BIOS...
>
> I didn't get any reply to
> my question about "where is Free Run" in CCS4.
>
I have very minimal use with CCS4/DSP so I can't answer the 'Run Free'
question, but I would expect that the wizards of TI would provide an
answer - even if it is "we don't need no stinkin' Free Run" :-)
> A couple of other notes:
>
> 1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see
> it in the .map file).
>
> 2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y
> is in SL2 (shared mem, onchip). If we had
> pipeline stalls due to internal memory bus conflicts, I would have
> expected the slow-down factor to get worse for
> increased data lengths.
>
My _opinion_ is...
if pipeline stalls are a problem in a synchronous test case, it is
very possible that they will occur at a synchronous rate and provide a
constant overhead [or slow down factor].
If your device has AET, you should be able to count pipeline stalls.
Note:
When you select 'Run' the actual AET setup takes place prior to 'pulling
the trigger' for the run. All detection, counting, etc. is handled by HW
and incurs 0 overhead.

mikedunn
> 3) Andrey says "main issue" but that really should be "measurement
> issue". Our main issue is still to generate
> fastest possible "natural C code". But without accurate measurement
> using the EVM C6472, we are stuck between two
> problems, so we're trying to tackle one at a time.
>
> -Jeff
>
> ------------------ Original Message ----------------
> Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
> From: "Andrey Krokhin" > >
> Date: Tue, October 12, 2010 5:23 pm
> To: "Sankaran, Jagadeesh" >
> Cc: "Vikram Ragukumar" > >
> "Jeff Brower" >
> "c... "
> >
> ----------------
>
> Hello everyone,
>
> In order to focus on the main issue, the difference between expected
> and real clock cycles, I tried a completely new
> approach. I created a new project using TI's "benchmark" fir_r8
> routine, which has been hand-coded in assembly. I used
> their sample project which just called the fir_r8 function once from
> main. To get the cycle count, I used the approach
> suggested by Jagadeesh: do a TSC_read() before and after the function
> call.
>
> The original TI code had this NH = 32, NR = 256. When I ran that code,
> I got a cycle count of 1075, which is close to
> what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried
> increasing the array size. I ran into a memory
> limitation, since TI was placing everything in DMC_SRAM. So, I placed
> the input and filter arrays in LL2RAM and the
> output array in SL2RAM (shared memory). I ran the program with larger
> array sizes and the results I got were:
>
> NH: 40000
> NR: 230000
> Expected: 1150000022 cycles
> Actual: 2013247548 cycles
>
> So, it's off by a factor of ~1.75. This factor stays constant when I
> vary the array size.
>
> TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a
> legacy CCS 3 project), and makes a single
> function call to the fir_r8 assembly routine from main, so there are
> no overhead issues. The previous discussion about
> whether we are counting the cycles correctly doesn't apply, since I am
> measuring the cycles directly from hardware.
> Any suggestions on what else might account for the difference? In
> particular, does placement of the data matter (e.g.
> LL2RAM vs DMC_SRAM?)
>
> Regards,
> Andrey
>
> > All,
> >
> > The outer loop loop count dominates the inner loop loop count, by a
> factor of 232/36 = 6x, and further by looking at
> the generated assembly code, we see 55 cycles of overhead for every 8
> output samples, so just this generates
> > a 232000/8 * 55 = 1.6 Mcycles.
> >
> > The intrinsic C code does a better job of decreasing outer loop
> overheads. So, one gets lesser outer loop overhead,
> which is what I think you are seeing.
> >
> > Please make sure that the effects of "cold cache" are taken care
> off, by running the code twice, ignore the results
> of the first time, then measure the results from the second time.
> >
> > The cycle count formula you are using seems to be off, in the most
> optimal loop in your assembly code which will be
> the path taken based on your number of inputs and filter taps, we see
> 32 multiplies in 9 cycles, or 3.5
> multiplies/cycle. So, formula is roughly:
> >
> > x_len/8 * [55 + h_len/4]
> >
> > It is working on 8 o/p samples and 4 filter taps in parallel.
> >
> > Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6
> MHZ So, should be about 262/700 or 0.33 seconds.
> >
> > Please use the built in 64-bit counter available on all C64x+ cores
> as it runs lock step with the CPU. This can be
> done by reading TSCL and TSCH registers.
> >
> > I am attaching the relevant code, first call the function
> TSC_enable() to enable the CPU counter. Then call TSC_read
> which returns the result as a 64-bit value or long long, so the
> sequence is as follows:
> >
> > long long t0, t1;
> > unsigned int diff;
> >
> > TSC_enable();
> >
> > t0 = TSC_read();
> > benchmark();
> > t1 = TSC_read();
> >
> > diff = (t1 - t0);
> >
> > printf ("diff: %d cycles \n", diff);
> >
> >
> > Regards
> > JS
> >
> > -----Original Message-----
> > From: c...
> [mailto:c... ] On Behalf
> Of Vikram Ragukumar
> > Sent: Friday, October 08, 2010 3:05 PM
> > To: c...
> > Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
> >
> > All,
> >
> > We are in the process of verifying theoretical cycle time
> calculations, based on information from the generated asm
> file.
> >
> > Following results and questions pertain to the attached conv.c and a
> section of the generated conv.asm file.
> > conv.c implements a basic convolution function.
> >
> > Our setup details are :
> > - EVM6472, used to run .out file generated from a SYS/BIOS based
> > project
> > - CCS v4.2
> > - SYS/BIOS (BIOSv6)
> > - CGT v7.0.3
> >
> > There appears to be a mismatch in timing measurements, with "actual
> run time" being about 9 times the "computed run
> time".
> >
> > From the C code,
> > Number of outer loop iterations = x_len + h_len = 196000+36000 > 232000 Number of inner loop iterations = h_len = 36000
> >
> > From asm file, Cycle time for software pipelined loop is
> >
> > Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions
> for a SINGLE SCHEDULED ITERATION.
> >
> > Hence, computed run time = x_len*(10+trip_cnt*2),
> > trip_cnt = h_len/4 (2 DOTP2's per iteration)
> >
> > Therefore,
> > Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
> > At 700Mhz, this would correspond to a time of 6s (approx).
> >
> > However actual run time of the program was determined to be 52s. Run
> time was calculated by inserting SYS/BIOS
> (BIOS6) Clock_getTicks() calls right before and immediately after the
> outer loop. Also, we made sure that interrupts
> were disabled during execution of the loops, this had no effect on run
> time measured.
> >
> > (1) Is there something wrong with how "Computed run time" is being
> calculated or how we are measuring "Actual run
> time" ?
> >
> > (2) As part of the Software Pipeline information, the compiler
> outputs a line that reads something like "ii = 2
> Schedule found with 6 iterations in parallel". Does the word
> "iterations" in the above phrase correspond to actual
> inner loop iteration in c code ? If so, is the number 6 in the above
> example, to be used in computing "expected run
> time" ?
> >
> > Thank you in advance for your help.
> >
> > Regards,
> > Vikram.
Reply by Jeff Brower October 12, 20102010-10-12
All-

I wanted to add that I'm still concerned about possible JTAG + USB "background communication", a theory that would fit
the constant slow-down factor, assuming that CCS4 talks with the chip on a periodic basis. I didn't get any reply to
my question about "where is Free Run" in CCS4.

A couple of other notes:

1) Andrey's x, h, and y data are aligned to multiple of 8 (we can see it in the .map file).

2) Code (.text) is in L1P (onchip), x and h are in LL2 (onchip), and y is in SL2 (shared mem, onchip). If we had
pipeline stalls due to internal memory bus conflicts, I would have expected the slow-down factor to get worse for
increased data lengths.

3) Andrey says "main issue" but that really should be "measurement issue". Our main issue is still to generate
fastest possible "natural C code". But without accurate measurement using the EVM C6472, we are stuck between two
problems, so we're trying to tackle one at a time.

-Jeff

------------------ Original Message ----------------
Subject: RE: [c6x] Mismatch in Timing measurements [1 Attachment]
From: "Andrey Krokhin"
Date: Tue, October 12, 2010 5:23 pm
To: "Sankaran, Jagadeesh"
Cc: "Vikram Ragukumar"
"Jeff Brower"
"c..."
----------------

Hello everyone,

In order to focus on the main issue, the difference between expected and real clock cycles, I tried a completely new
approach. I created a new project using TI's "benchmark" fir_r8 routine, which has been hand-coded in assembly. I used
their sample project which just called the fir_r8 function once from main. To get the cycle count, I used the approach
suggested by Jagadeesh: do a TSC_read() before and after the function call.

The original TI code had this NH = 32, NR = 256. When I ran that code, I got a cycle count of 1075, which is close to
what it should be with TI's formula: NH*NR/8+22 = 1046. Next, I tried increasing the array size. I ran into a memory
limitation, since TI was placing everything in DMC_SRAM. So, I placed the input and filter arrays in LL2RAM and the
output array in SL2RAM (shared memory). I ran the program with larger array sizes and the results I got were:

NH: 40000
NR: 230000
Expected: 1150000022 cycles
Actual: 2013247548 cycles

So, it's off by a factor of ~1.75. This factor stays constant when I vary the array size.

TI's sample project for fir_r8 doesn't use SYS/BIOS (in fact, it's a legacy CCS 3 project), and makes a single
function call to the fir_r8 assembly routine from main, so there are no overhead issues. The previous discussion about
whether we are counting the cycles correctly doesn't apply, since I am measuring the cycles directly from hardware.
Any suggestions on what else might account for the difference? In particular, does placement of the data matter (e.g.
LL2RAM vs DMC_SRAM?)

Regards,
Andrey
> All,
>
> The outer loop loop count dominates the inner loop loop count, by a factor of 232/36 = 6x, and further by looking at
the generated assembly code, we see 55 cycles of overhead for every 8 output samples, so just this generates
> a 232000/8 * 55 = 1.6 Mcycles.
>
> The intrinsic C code does a better job of decreasing outer loop overheads. So, one gets lesser outer loop overhead,
which is what I think you are seeing.
>
> Please make sure that the effects of "cold cache" are taken care off, by running the code twice, ignore the results
of the first time, then measure the results from the second time.
>
> The cycle count formula you are using seems to be off, in the most optimal loop in your assembly code which will be
the path taken based on your number of inputs and filter taps, we see 32 multiplies in 9 cycles, or 3.5
multiplies/cycle. So, formula is roughly:
>
> x_len/8 * [55 + h_len/4]
>
> It is working on 8 o/p samples and 4 filter taps in parallel.
>
> Based on this: 232000/8 * [55 + 36000/4] = 1.6 MHZ + 261 MHZ = 262.6 MHZ So, should be about 262/700 or 0.33 seconds.
>
> Please use the built in 64-bit counter available on all C64x+ cores as it runs lock step with the CPU. This can be
done by reading TSCL and TSCH registers.
>
> I am attaching the relevant code, first call the function TSC_enable() to enable the CPU counter. Then call TSC_read
which returns the result as a 64-bit value or long long, so the sequence is as follows:
>
> long long t0, t1;
> unsigned int diff;
>
> TSC_enable();
>
> t0 = TSC_read();
> benchmark();
> t1 = TSC_read();
>
> diff = (t1 - t0);
>
> printf ("diff: %d cycles \n", diff);
> Regards
> JS
>
> -----Original Message-----
> From: c... [mailto:c...] On Behalf Of Vikram Ragukumar
> Sent: Friday, October 08, 2010 3:05 PM
> To: c...
> Subject: [c6x] Mismatch in Timing measurements [2 Attachments]
>
> All,
>
> We are in the process of verifying theoretical cycle time calculations, based on information from the generated asm
file.
>
> Following results and questions pertain to the attached conv.c and a section of the generated conv.asm file.
> conv.c implements a basic convolution function.
>
> Our setup details are :
> - EVM6472, used to run .out file generated from a SYS/BIOS based
> project
> - CCS v4.2
> - SYS/BIOS (BIOSv6)
> - CGT v7.0.3
>
> There appears to be a mismatch in timing measurements, with "actual run time" being about 9 times the "computed run
time".
>
> From the C code,
> Number of outer loop iterations = x_len + h_len = 196000+36000 = 232000 Number of inner loop iterations = h_len = 36000
>
> From asm file, Cycle time for software pipelined loop is
>
> Total cycles (est) = 10 + trip_cnt * 2, with 2 DOTP2 instructions for a SINGLE SCHEDULED ITERATION.
>
> Hence, computed run time = x_len*(10+trip_cnt*2),
> trip_cnt = h_len/4 (2 DOTP2's per iteration)
>
> Therefore,
> Computed run time = x_len*(10+h_len/2) = 4178320000 cycles.
> At 700Mhz, this would correspond to a time of 6s (approx).
>
> However actual run time of the program was determined to be 52s. Run time was calculated by inserting SYS/BIOS
(BIOS6) Clock_getTicks() calls right before and immediately after the outer loop. Also, we made sure that interrupts
were disabled during execution of the loops, this had no effect on run time measured.
>
> (1) Is there something wrong with how "Computed run time" is being calculated or how we are measuring "Actual run
time" ?
>
> (2) As part of the Software Pipeline information, the compiler outputs a line that reads something like "ii = 2
Schedule found with 6 iterations in parallel". Does the word "iterations" in the above phrase correspond to actual
inner loop iteration in c code ? If so, is the number 6 in the above example, to be used in computing "expected run
time" ?
>
> Thank you in advance for your help.
>
> Regards,
> Vikram.

_____________________________________