DSPRelated.com
Forums

C64X+ DSP runs inefficiently - How to solve?

Started by Coskun AYYILDIZ April 25, 2011
Hi everyone,

When my code seemed to be slower than I expected it to be, I wrote a simple code and saw the following problem:

My test code is the following:

int i, p1, p2, p3;

void main()
{

*(int*)0x01840000 = 0x02;// use 64KB of L2 cache
*(int*)0x01840020 = 0x04;// use 32KB of L1P cache

*(int*)0x01840040 = 0x07;// use 32KB of L1D cache

*(int*)0x01840200 = 0x01;// MAR register, make 0x80000000 memory region cachable.
p1 = 0xCDF54785;// p1 and p2 values are random
p2 = 0x28D9A25C;
p3 = 0;

for(i = 0; i < 10000; i++)
{
p3 += (p2 - p1);
}
return 0;
}

After executing this little piece of code, I read the values of TSCL and TSCH registers to see how many cycles were spent. And, I saw 107153 cycles.
So, in order to make 10000 summations, the compiler spent 107153 cycles. So, in average, 10.7 cycles for 1 summation.

Obviously, this is something unexpected and it is ruining my code. How can I make the DSP do this task as expected?

It is really important to solve this problem and I'll appreciate your helps.

ps: I am building the code with -O3 and -ms3.

Thanks.
You may want to ignore the cycle count for the first time, as the cache is cold, so you are measuring cycles both for allocating to L2 and then to L1P or L1D.
Measuring from the second time onwards will give you steady state performance.

Regards
JS

From: c... [mailto:c...] On Behalf Of Coskun AYYILDIZ
Sent: Monday, April 25, 2011 7:35 AM
To: c...
Subject: [c6x] C64X+ DSP runs inefficiently - How to solve?

Hi everyone,

When my code seemed to be slower than I expected it to be, I wrote a simple code and saw the following problem:

My test code is the following:

int i, p1, p2, p3;

void main()
{

*(int*)0x01840000 = 0x02; // use 64KB of L2 cache
*(int*)0x01840020 = 0x04; // use 32KB of L1P cache
*(int*)0x01840040 = 0x07; // use 32KB of L1D cache
*(int*)0x01840200 = 0x01; // MAR register, make 0x80000000 memory region cachable.

p1 = 0xCDF54785; // p1 and p2 values are random
p2 = 0x28D9A25C;
p3 = 0;

for(i = 0; i < 10000; i++)
{
p3 += (p2 - p1);
}
return 0;
}

After executing this little piece of code, I read the values of TSCL and TSCH registers to see how many cycles were spent. And, I saw 107153 cycles.
So, in order to make 10000 summations, the compiler spent 107153 cycles. So, in average, 10.7 cycles for 1 summation.

Obviously, this is something unexpected and it is ruining my code. How can I make the DSP do this task as expected?

It is really important to solve this problem and I'll appreciate your helps.

ps: I am building the code with -O3 and -ms3.

Thanks.
Coskun-

> When my code seemed to be slower than I expected it to
> be, I wrote a simple code and saw the following problem:
>
> My test code is the following:
>
> int i, p1, p2, p3;
>
> void main()
> {
>
> *(int*)0x01840000 = 0x02;// use 64KB of L2 cache
> *(int*)0x01840020 = 0x04;// use 32KB of L1P cache
>
> *(int*)0x01840040 = 0x07;// use 32KB of L1D cache
>
> *(int*)0x01840200 = 0x01;// MAR register, make 0x80000000 memory region cachable.
> p1 = 0xCDF54785;// p1 and p2 values are random
> p2 = 0x28D9A25C;
> p3 = 0;
>
> for(i = 0; i < 10000; i++)
> {
> p3 += (p2 - p1);
> }
> return 0;
> }
>
> After executing this little piece of code, I read the values of TSCL and TSCH registers to see how many cycles were
> spent. And, I saw 107153 cycles.
> So, in order to make 10000 summations, the compiler spent 107153 cycles. So, in average, 10.7 cycles for 1 summation.
>
> Obviously, this is something unexpected and it is ruining my code. How can I make the DSP do this task as expected?
>
> It is really important to solve this problem and I'll appreciate your helps.
>
> ps: I am building the code with -O3 and -ms3.

1) Isn't -ms3 the option for "size most critical" ? I think it disables software pipelining, which you don't want if
you're worried about speed.

2) Your whole code can be rewritten:

p3 = 10000*0x5AE45AD7;

and you have -O3 enabled, so I would have thought your code takes no cycles. Have you looked at your generated asm
code to see what's really going on?

-Jeff

_____________________________________
Coskun,

It is nice that you are setting certain key registers.
However, the example seems to be missing a few key details.
Such as ...
The location in memory, as set in the .cmd file, the the value i, P1, P2, P3 are
located.

I would suggest putting some study into the 'circular code buffer' that is built
into the DSP, linear assembly, if the variables are being kept in registers
and/or being saved after each iteration/update.

It does not help to set all the key code/data areas if the code is not using them.

As I starter, I would suggest adding the 'register' parameter before each
variable, and make the unchanging values into constants
Something like
int register i;
int register P3;
const int P2 = 0xCDF54785;
const int P1 = 0xCDF54785;

Then apply the #pragma must iterate() to the loop
If you do not want to define 'register' nor 'const'
then I suggest properly aligning the data via something like
#pragma align (i,32);
#pragma align (p1,32);
#pragma align (p2,32);
#pragma align (p3,32);

NOTE: I do not have the syntax of the #pragma's correct, so you will have to
look them up.

You will also have to decide exactly where the data is to be located, I suggest
the data portion of the L1 cache.
You will also have to decide exactly where the code is to be located, I suggest
the code portion of the L1 cache.

For each directive/variable/code that you provide,your .cmd file will need to
properly place that item in the memory map.

However, I notice that the code does nothing with the p1 variable.
Therefore, I suspect the max optimization would eliminate the calculation of P3.
Then it would realize the P2 and P1 are not being used and may optimize them out
of existence.

Another detail that would speed up your code is to move the calculation of
P3 = (P2-P1) out of the loop

Now your loop looks like
For i=0;i<10000;i++)
{
;
}

Now, without any of the items still in the loop (there are none) the optimizer
may eliminate the loop all together, unless you add the '#pragma mustiterate();
to the loop.

BTW:
My understanding is the TSCL,TSCH registers do not necessarily start at 0
Therefore, what you really need to do is something like this:
unsigned int startTime = TSCL + (TSCH<<16),
unsigned int stopTime=0;

and at the end of your code:
stopTime = TSCL + (TSCH<<16);

then
calculate the difference elapsedTime = stopTime - startTime
Otherwise, I think you would get all the DSP boot cycles included in the
TSCL+(TSCH<<16) value.

---------- Original Message -----------
From: Coskun AYYILDIZ
To: "c..."
Sent: Mon, 25 Apr 2011 05:34:33 -0700 (PDT)
Subject: [c6x] C64X+ DSP runs inefficiently - How to solve?

> Hi everyone,
>
> When my code seemed to be slower than I expected it to be, I wrote a
> simple code and saw the following problem:
>
> My test code is the following:
>
> int i, p1, p2, p3;
>
> void main()
> {
>
>   *(int*)0x01840000 = 0x02;// use 64KB of L2 cache
>   *(int*)0x01840020 = 0x04;// use 32KB of L1P cache
>
>   *(int*)0x01840040 = 0x07;// use 32KB of L1D cache
>
>   *(int*)0x01840200 = 0x01;// MAR register, make 0x80000000 memory
> region cachable.
>
>   p1 = 0xCDF54785;// p1 and p2 values are random
>   p2 = 0x28D9A25C;
>   p3 = 0;
>
>   for(i = 0; i < 10000; i++)
>   {
>     p3 += (p2 - p1);
>   }
>   return 0;
> }
>
> After executing this little piece of code, I read the values of TSCL
> and TSCH registers to see how many cycles were spent. And, I saw
> 107153 cycles.  So, in order to make 10000 summations, the compiler
> spent 107153 cycles. So, in average, 10.7 cycles for 1 summation.
>
> Obviously, this is something unexpected and it is ruining my code. How
> can I make the DSP do this task as expected?
>
> It is really important to solve this problem and I'll appreciate your helps.
>
> ps: I am building the code with -O3 and -ms3.
>
> Thanks.
------- End of Original Message -------

_____________________________________
Coskun,
On 4/25/2011 8:24 PM, Jeff Brower wrote:
>
> Coskun-
>
> > When my code seemed to be slower than I expected it to
> > be, I wrote a simple code and saw the following problem:
> >
> > My test code is the following:
> >
> > int i, p1, p2, p3;
> >
> > void main()
> > {
> >
> > *(int*)0x01840000 = 0x02;// use 64KB of L2 cache
> > *(int*)0x01840020 = 0x04;// use 32KB of L1P cache
> >
> > *(int*)0x01840040 = 0x07;// use 32KB of L1D cache
> >
> > *(int*)0x01840200 = 0x01;// MAR register, make 0x80000000 memory
> region cachable.
> >
> >
> > p1 = 0xCDF54785;// p1 and p2 values are random
> > p2 = 0x28D9A25C;
> > p3 = 0;
> >
> > for(i = 0; i < 10000; i++)
> > {
> > p3 += (p2 - p1);
> > }
> > return 0;
> > }
> >
> > After executing this little piece of code, I read the values of TSCL
> and TSCH registers to see how many cycles were
> > spent. And, I saw 107153 cycles.
> > So, in order to make 10000 summations, the compiler spent 107153
> cycles. So, in average, 10.7 cycles for 1 summation.
> >
> > Obviously, this is something unexpected and it is ruining my code.
> How can I make the DSP do this task as expected?
> >
> > It is really important to solve this problem and I'll appreciate
> your helps.
> >
> > ps: I am building the code with -O3 and -ms3.
>
> 1) Isn't -ms3 the option for "size most critical" ? I think it
> disables software pipelining, which you don't want if
> you're worried about speed.
>
> 2) Your whole code can be rewritten:
>
> p3 = 10000*0x5AE45AD7;
>

Jeff, I think that this was written as a test loop to fool the compiler
into running.
> and you have -O3 enabled, so I would have thought your code takes no
> cycles. Have you looked at your generated asm
> code to see what's really going on?
>

Coskun, you took the step to create a test loop [fixed environment].
Now you need to analyze what is going on. Take Jeff's suggestion and
look at the asm code that is generated to analyze the problem.
The majority of "the DSP is running slow" problems are fixed by
1. changing the code
2. rearranging the code to be more 'memory access friendly'
3. correcting the PLL setup/clock dividers to make the DSP run at the
speed that it 'should have been running'

mikedunn
> -Jeff
Coskun,

I see a few possibilities.
1) the instruction to set the L2CFG register is missing a trailing ';'
2) the instruction to set the L1PCFG register is missing a trailing ';'
3) the instruction to set the L2DCFG register is missing a trailing ';'
4) the instruction to set the MAR register is missing a trailing ';'
-I'm surprised that it compiled.
--or maybe the cut/paste operation was faulty?

You might change the transfer size to 'long long int' rather than 'int', to cut
the number of transfers by 2.
you definitely need to move the '720*576 / 4' calculation out of the loop (which
the optimizer may have done for you)

you could repeat this instruction:
*(output_buffer++) = *(input_buffer++);
2 or 4 or 8 or 16 times in the body of the loop to reduce the number of times
the loop overhead is executed.

you could make use of the 'linear assembly' techniques, which mostly means the
addition of the appropriate prolog and epilog to the loop.
(I'm a little vague on the details, but it is demonstrated in the Rulph Chassing
book volume 2.)

you could indicate 'restrict' on the input buffer and output buffer, so the run
time code is not checking/allowing for buffer overlap.

you could use the (E)DMA facilities to perform the actual transfer.

you might want to read:


R. Williams

---------- Original Message -----------

> /******************************************START OF APPLICATION
CODE*************************************************/
>
> #include
> #include "c6x.h"
>
> void main()
> {
>   *((int*) 0x01840000) = 0x02// L2CFG REGISTER: 64KB of L2 Cache
>   *((int*) 0x01840020) = 0x04// L1PCFG REGISTER: 32KB of L1P Cache
>   *((int*) 0x01840040) = 0x04// L1DCFG REGISTER: 32KB of L1D Cache
>
>   *((int*) 0x01848200) = 0x1// MAR TEGISTER: Memory between 0x80000000
> and 0x81000000 is cachable.
>
>   register unsigned int* input_buffer = (unsigned int*)0x80010000;
>   register unsigned int* output_buffer = (unsigned int*)0x80029500;
>   int start_cycle, end_cycle, i;
>
>   start_cycle = TSCL;
>   #pragma UNROLL(16)// is this usage useful?
>   #pragma MUST_ITERATE(720*576/4, 720*576/4, 16)// is this usage
> useful?   
for(i = 0; i < 720*576 / 4; i++)   
{   
 *(output_buffer++) = *(input_buffer++);  
}  
end_cycle = TSCL;  
printf("Cycle per pixel is: %f", (float)(end_cycle - start_cycle) /
(720*576)); }
> /*********************************************END OF APPLICATION
CODE*************************************************/
> /**********************************************START OF LINKER.CMD
FILE*************************************************/
> -c
> -heap 0x1000
> -stack 0x1000
>
> MEMORY{
>   IRAM:origin = 0x80000000, len = 0xFFFF
> }
>
> SECTIONS
> {
>
> vectors:> IRAM
> .cinit:> IRAM
> .text:> IRAM
> .stack:> IRAM
> .bss:> IRAM
> .const:> IRAM
> .data:> IRAM
> .far:> IRAM
> .switch:> IRAM
> .sysmem:> IRAM
> .tables:> IRAM
> .cio:> IRAM
> }
> /***********************************************END OF LINKER.CMD
FILE*************************************************/
>
> I am building this time with -O3, and with --no_bad_aliases option.
>
> When I looked at the generated ASM code, I saw that for the for loop,
> 8 instructions are generated however, the compiler handles these 8
> instructions with 20 cycles. Isn't the DSP supposed to handle multiple
> instructions per cycle? As a result, 6,61 cycles are spent per pixel.
> This means, 3,43 ms is spent for one frame. This is not an acceptable
> amount of time.
>
> Q1) Is there a way that I can read/write more than 4 bytes from/to the
> buffer? (like 128 or 256 bits?) Q2) How can I make DSP do multiple
> instructions per cycle? Q3) Is it enough to set the L1D and L1P cache
> sizes? How can I make sure that the data and code is actually using
> the L1P and L1D caches?
>
> Thanks for your replies in advance. Hopefully I'll manage this problem
> soon.
>
> Coskun.
------- End of Original Message -------

_____________________________________
Hi everyone, thanks for your messages,

I am still having the problem. Now, I am giving a better example to express my problem better.

I am reading an image of size 720x576 from the input buffer and writing the image "without any processing" to another buffer. I am measuring cycle spent per pixel and see that 6.61 cycles are spent per pixel, which is way much more I expect.
My code and the .cmd file is seen below:

/******************************************START OF APPLICATION CODE*************************************************/
#include
#include "c6x.h"

void main()
{
*((int*) 0x01840000) = 0x02// L2CFG REGISTER: 64KB of L2 Cache
*((int*) 0x01840020) = 0x04// L1PCFG REGISTER: 32KB of L1P Cache
*((int*) 0x01840040) = 0x04// L1DCFG REGISTER: 32KB of L1D Cache

*((int*) 0x01848200) = 0x1// MAR TEGISTER: Memory between 0x80000000 and 0x81000000 is cachable.

register unsigned int* input_buffer = (unsigned int*)0x80010000;
register unsigned int* output_buffer = (unsigned int*)0x80029500;
int start_cycle, end_cycle, i;

start_cycle = TSCL;
#pragma UNROLL(16)// is this usage useful?
#pragma MUST_ITERATE(720*576/4, 720*576/4, 16)// is this usage useful?
for(i = 0; i < 720*576 / 4; i++)
{
*(output_buffer++) = *(input_buffer++);
}
end_cycle = TSCL;
printf("Cycle per pixel is: %f", (float)(end_cycle - start_cycle) / (720*576));
}
/*********************************************END OF APPLICATION CODE*************************************************/
/**********************************************START OF LINKER.CMD FILE*************************************************/
-c
-heap 0x1000
-stack 0x1000

MEMORY{
IRAM:origin = 0x80000000, len = 0xFFFF
}

SECTIONS
{

vectors:> IRAM
.cinit:> IRAM
.text:> IRAM
.stack:> IRAM
.bss:> IRAM
.const:> IRAM
.data:> IRAM
.far:> IRAM
.switch:> IRAM
.sysmem:> IRAM
.tables:> IRAM
.cio:> IRAM
}
/***********************************************END OF LINKER.CMD FILE*************************************************/

I am building this time with -O3, and with --no_bad_aliases option.

When I looked at the generated ASM code, I saw that for the for loop, 8 instructions are generated however, the compiler handles these 8 instructions with 20 cycles. Isn't the DSP supposed to handle multiple instructions per cycle?
As a result, 6,61 cycles are spent per pixel. This means, 3,43 ms is spent for one frame. This is not an acceptable amount of time.

Q1) Is there a way that I can read/write more than 4 bytes from/to the buffer? (like 128 or 256 bits?)
Q2) How can I make DSP do multiple instructions per cycle?
Q3) Is it enough to set the L1D and L1P cache sizes? How can I make sure that the data and code is actually using the L1P and L1D caches?

Thanks for your replies in advance. Hopefully I'll manage this problem soon.

Coskun.
Few quick things to check:
a) Input_buffer and Output_buffer are indeed in the address range of 0x8000_0000 and 0x8100_0000, as one MAR only controls a 16 MB boundary.

b) If it is outside this range, 0x0184 8200 - 0x0184 823C MAR128 - MAR143, please use additional MAR bits.

c) Your address for MAR128 is correct.

d) You can load/store a maximum of 128 bits per cycle, you can cast the pointer as unsigned long long and _amem8(&input_buffer[i]); using _loll() and _hill() to get low 32- and high 32-bits.

Regards
JS
From: Coskun AYYILDIZ [mailto:c...@yahoo.com]
Sent: Thursday, April 28, 2011 1:48 AM
To: Richard Williams; c...; j...@signalogic.com; m...@gmail.com; Sankaran, Jagadeesh
Subject: Re: [c6x] C64X+ DSP runs inefficiently - How to solve?

Hi everyone, thanks for your messages,

I am still having the problem. Now, I am giving a better example to express my problem better.

I am reading an image of size 720x576 from the input buffer and writing the image "without any processing" to another buffer. I am measuring cycle spent per pixel and see that 6.61 cycles are spent per pixel, which is way much more I expect.
My code and the .cmd file is seen below:

/******************************************START OF APPLICATION CODE*************************************************/

#include
#include "c6x.h"

void main()
{
*((int*) 0x01840000) = 0x02 // L2CFG REGISTER: 64KB of L2 Cache
*((int*) 0x01840020) = 0x04 // L1PCFG REGISTER: 32KB of L1P Cache
*((int*) 0x01840040) = 0x04 // L1DCFG REGISTER: 32KB of L1D Cache

*((int*) 0x01848200) = 0x1 // MAR TEGISTER: Memory between 0x80000000 and 0x81000000 is cachable.

register unsigned int* input_buffer = (unsigned int*)0x80010000;
register unsigned int* output_buffer = (unsigned int*)0x80029500;
int start_cycle, end_cycle, i;

start_cycle = TSCL;
#pragma UNROLL(16) // is this usage useful?
#pragma MUST_ITERATE(720*576/4, 720*576/4, 16) // is this usage useful?
for(i = 0; i < 720*576 / 4; i++)
{
*(output_buffer++) = *(input_buffer++);
}
end_cycle = TSCL;
printf("Cycle per pixel is: %f", (float)(end_cycle - start_cycle) / (720*576));
}
/*********************************************END OF APPLICATION CODE*************************************************/
/**********************************************START OF LINKER.CMD FILE*************************************************/
-c
-heap 0x1000
-stack 0x1000

MEMORY{
IRAM : origin = 0x80000000, len = 0xFFFF
}

SECTIONS
{

vectors :> IRAM
.cinit :> IRAM
.text :> IRAM
.stack :> IRAM
.bss :> IRAM
.const :> IRAM
.data :> IRAM
.far :> IRAM
.switch :> IRAM
.sysmem :> IRAM
.tables :> IRAM
.cio :> IRAM
}
/***********************************************END OF LINKER.CMD FILE*************************************************/

I am building this time with -O3, and with --no_bad_aliases option.

When I looked at the generated ASM code, I saw that for the for loop, 8 instructions are generated however, the compiler handles these 8 instructions with 20 cycles. Isn't the DSP supposed to handle multiple instructions per cycle?
As a result, 6,61 cycles are spent per pixel. This means, 3,43 ms is spent for one frame. This is not an acceptable amount of time.

Q1) Is there a way that I can read/write more than 4 bytes from/to the buffer? (like 128 or 256 bits?)
Q2) How can I make DSP do multiple instructions per cycle?
Q3) Is it enough to set the L1D and L1P cache sizes? How can I make sure that the data and code is actually using the L1P and L1D caches?

Thanks for your replies in advance. Hopefully I'll manage this problem soon.

Coskun.