DSPRelated.com
Forums

Cycle count in Blackfin BF535

Started by Dhaval Parekh March 28, 2006
Hi All,
I would highly appreciate if any one can answer my query.
I have written a small program in C to check the cycle consumption for different operations in Blackfin (BF535)
I am using the demo version of Visual DSP++ 3.5 simulator for this purpose.
The code which I have written is as follows...

void main(void)
{
long int a,b,i1,i2,i3,i5,i6;
a=4;
b=2;
i1 = a+b;
i2 = a*b;
i3 = a-b;
i5 = a>>1;
i6 = b<<1;
}

As the aim of this program is just to check the cycle count I have kept it as simple as possible.
Now blackfin DSPs are said to have ability to perform all these operations in single cycle. But the cycle count that I got seems to be totally different.
For the addition it took 25 cycles, for subtraction it took 9 cycles, for multiplication it again took 25 cycles. For shifting it took 16 cycles each.
(I found these cycles by putting breakpoint after each operation and looking at the cycles count each time)
I also saw the assembly code for the program generated by the simulator which looks something like this.

R3 = 4 ;
[ FP + -100 ] = R3 ;
R2 = 2 ;
[ FP + -96 ] = R2 ;
W [ FP + -32 ] = R3 ;
W [ FP + -28 ] = R2 ;
R1 = 4 ;
[ FP + -68 ] = R1 ;
R0 = 6 ;
[ FP + -36 ] = R0 ;
R7 = 20 ;
[ FP + -64 ] = R7 ;
R6.L = 26214 ;
R6.H = 16518 ;
[ FP + -124 ] = R6 ;
R2.L = -13107 ;
R2.H = 16396 ;
[ FP + -120 ] = R2 ;
R3 = [ FP + -100 ] ;
R5 = [ FP + -96 ] ;
R4 = R3 + R5 ;
[ FP + -92 ] = R4 ;
R3 *= R5 ;
[ FP + -88 ] = R3 ;
R7 = [ FP + -100 ] ;
R1 = R7 - R5 ;
[ FP + -84 ] = R1 ;
R3 = [ FP + -100 ] ;
R3 >>>= 0x1 ;
[ FP + -76 ] = R3 ;
R2 = [ FP + -96 ] ;
R2 <<= 0x1 ;
[ FP + -72 ] = R2 ;

Kindly tell me if I am doing any mistake and if the cycle count which I got is the minimum or it can be reduced by anyway.
Any help is highly appreciated.
Thanks in advance. and Sorry for such a long mail.

DP

---------------------------------
Blab-away for as little as 1/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
Hi DP,

My very first advise to you would be to use the actual Hardware (and not
the simulator) if you are looking more seriously for benchmarks.

Secondly, while benchmarking the code, you should not step through the code
(stepping through the code will not give you correct result)... rather you
could have some macros to (a) clear the cycle count registers (b) to read
the cycle count registers.... Now before you start the compuatations call
the marcro-a and after having finished the computations call the macro-(b).
If you are unable to write these macros, you can download the code for
EE-271 from following link, one of the SW will contains similar macros:
http://www.analog.com/processors/resources/technicalLibrary/appNotes.html

Thirdly, the situation in your code has been complicated because you are
accessing local variables and compiler is not able to optimize the code....
You could use following modifications to the code:
(a) Declare the variables (operands) as global...
(b) for instructions like i1 = a + b; the variables a and b should be
stored in different internal memory banks (this is a costraint on dual DAG
access - Refer to the BF HRM for more information on this)...

Lastly, with the above modifications you will need to analyze the compiler
generated code.... Further, I assume that you are building the project with
(a) Compiler optimizations enabled - for maximum speed (b) Release mode (not
debug mode)... Please do so, if this is not the case..
Regards,
kunal
On 3/28/06, Dhaval Parekh wrote:
>
> Hi All,
> I would highly appreciate if any one can answer my query.
> I have written a small program in C to check the cycle consumption for
> different operations in Blackfin (BF535)
> I am using the demo version of Visual DSP++ 3.5 simulator for this
> purpose.
> The code which I have written is as follows...
>
> void main(void)
> {
> long int a,b,i1,i2,i3,i5,i6;
> a=4;
> b=2;
> i1 = a+b;
> i2 = a*b;
> i3 = a-b;
> i5 = a>>1;
> i6 = b<<1;
> }
>
> As the aim of this program is just to check the cycle count I have kept it
> as simple as possible.
> Now blackfin DSPs are said to have ability to perform all these operations
> in single cycle. But the cycle count that I got seems to be totally
> different.
> For the addition it took 25 cycles, for subtraction it took 9 cycles, for
> multiplication it again took 25 cycles. For shifting it took 16 cycles each.
> (I found these cycles by putting breakpoint after each operation and
> looking at the cycles count each time)
> I also saw the assembly code for the program generated by the simulator
> which looks something like this.
>
> R3 = 4 ;
> [ FP + -100 ] = R3 ;
> R2 = 2 ;
> [ FP + -96 ] = R2 ;
> W [ FP + -32 ] = R3 ;
> W [ FP + -28 ] = R2 ;
> R1 = 4 ;
> [ FP + -68 ] = R1 ;
> R0 = 6 ;
> [ FP + -36 ] = R0 ;
> R7 = 20 ;
> [ FP + -64 ] = R7 ;
> R6.L = 26214 ;
> R6.H = 16518 ;
> [ FP + -124 ] = R6 ;
> R2.L = -13107 ;
> R2.H = 16396 ;
> [ FP + -120 ] = R2 ;
> R3 = [ FP + -100 ] ;
> R5 = [ FP + -96 ] ;
> R4 = R3 + R5 ;
> [ FP + -92 ] = R4 ;
> R3 *= R5 ;
> [ FP + -88 ] = R3 ;
> R7 = [ FP + -100 ] ;
> R1 = R7 - R5 ;
> [ FP + -84 ] = R1 ;
> R3 = [ FP + -100 ] ;
> R3 >>>= 0x1 ;
> [ FP + -76 ] = R3 ;
> R2 = [ FP + -96 ] ;
> R2 <<= 0x1 ;
> [ FP + -72 ] = R2 ;
>
> Kindly tell me if I am doing any mistake and if the cycle count which I
> got is the minimum or it can be reduced by anyway.
> Any help is highly appreciated.
> Thanks in advance. and Sorry for such a long mail.
>
> DP
>
There is already cycle count register available in BlackFin. You can use it
in the following way. (I am sure about it, in BF531/2/3)

Reset the cycles count register before any operation/function
#include
sysreg_write(reg_SYSCFG, sysreg_read(reg_SYSCFG) | SYSCFG_CCCEN);
sysreg_write(reg_CYCLES, 0);

Get the cycles count after any operation/function
printf("%d\n",sysreg_read(reg_CYCLES));

Regards,
Nitin

Date: Tue, 28 Mar 2006 06:16:45 -0800 (PST)
From: Dhaval Parekh
Subject: Cycle count in Blackfin BF535

Hi All,
I would highly appreciate if any one can answer my query.
I have written a small program in C to check the cycle consumption for
different operations in Blackfin (BF535)
I am using the demo version of Visual DSP++ 3.5 simulator for this purpose.
The code which I have written is as follows...

void main(void)
{
long int a,b,i1,i2,i3,i5,i6;
a==4;
b==2;
i1 == a+b;
i2 == a*b;
i3 == a-b;
i5 == a>>1;
i6 == b<<1;
}

As the aim of this program is just to check the cycle count I have kept it as
simple as possible.
Now blackfin DSPs are said to have ability to perform all these operations in
single cycle. But the cycle count that I got seems to be totally different.
For the addition it took 25 cycles, for subtraction it took 9 cycles, for
multiplication it again took 25 cycles. For shifting it took 16 cycles each.
(I found these cycles by putting breakpoint after each operation and looking
at the cycles count each time)
I also saw the assembly code for the program generated by the simulator which
looks something like this.

R3 == 4 ;
[ FP + -100 ] == R3 ;
R2 == 2 ;
[ FP + -96 ] == R2 ;
W [ FP + -32 ] == R3 ;
W [ FP + -28 ] == R2 ;
R1 == 4 ;
[ FP + -68 ] == R1 ;
R0 == 6 ;
[ FP + -36 ] == R0 ;
R7 == 20 ;
[ FP + -64 ] == R7 ;
R6.L == 26214 ;
R6.H == 16518 ;
[ FP + -124 ] == R6 ;
R2.L == -13107 ;
R2.H == 16396 ;
[ FP + -120 ] == R2 ;
R3 == [ FP + -100 ] ;
R5 == [ FP + -96 ] ;
R4 == R3 + R5 ;
[ FP + -92 ] == R4 ;
R3 *== R5 ;
[ FP + -88 ] == R3 ;
R7 == [ FP + -100 ] ;
R1 == R7 - R5 ;
[ FP + -84 ] == R1 ;
R3 == [ FP + -100 ] ;
R3 >>>== 0x1 ;
[ FP + -76 ] == R3 ;
R2 == [ FP + -96 ] ;
R2 <<== 0x1 ;
[ FP + -72 ] == R2 ;

Kindly tell me if I am doing any mistake and if the cycle count which I got
is the minimum or it can be reduced by anyway.
Any help is highly appreciated.
Thanks in advance. and Sorry for such a long mail.

DP
Hi Dhaval,
Your observation may be right.
The cycle count can be reduced to a greater extent by some hand optimization.

regards,
-Sharath Malve

rekh wrote:
Hi All,
I would highly appreciate if any one can answer my query.
I have written a small program in C to check the cycle consumption for different operations in Blackfin (BF535)
I am using the demo version of Visual DSP++ 3.5 simulator for this purpose.
The code which I have written is as follows...

void main(void)
{
long int a,b,i1,i2,i3,i5,i6;
a=4;
b=2;
i1 = a+b;
i2 = a*b;
i3 = a-b;
i5 = a>>1;
i6 = b<<1;
}

As the aim of this program is just to check the cycle count I have kept it as simple as possible.
Now blackfin DSPs are said to have ability to perform all these operations in single cycle. But the cycle count that I got seems to be totally different.
For the addition it took 25 cycles, for subtraction it took 9 cycles, for multiplication it again took 25 cycles. For shifting it took 16 cycles each.
(I found these cycles by putting breakpoint after each operation and looking at the cycles count each time)
I also saw the assembly code for the program generated by the simulator which looks something like this.

R3 = 4 ;
[ FP + -100 ] = R3 ;
R2 = 2 ;
[ FP + -96 ] = R2 ;
W [ FP + -32 ] = R3 ;
W [ FP + -28 ] = R2 ;
R1 = 4 ;
[ FP + -68 ] = R1 ;
R0 = 6 ;
[ FP + -36 ] = R0 ;
R7 = 20 ;
[ FP + -64 ] = R7 ;
R6.L = 26214 ;
R6.H = 16518 ;
[ FP + -124 ] = R6 ;
R2.L = -13107 ;
R2.H = 16396 ;
[ FP + -120 ] = R2 ;
R3 = [ FP + -100 ] ;
R5 = [ FP + -96 ] ;
R4 = R3 + R5 ;
[ FP + -92 ] = R4 ;
R3 *= R5 ;
[ FP + -88 ] = R3 ;
R7 = [ FP + -100 ] ;
R1 = R7 - R5 ;
[ FP + -84 ] = R1 ;
R3 = [ FP + -100 ] ;
R3 >>>= 0x1 ;
[ FP + -76 ] = R3 ;
R2 = [ FP + -96 ] ;
R2 <<= 0x1 ;
[ FP + -72 ] = R2 ;

Kindly tell me if I am doing any mistake and if the cycle count which I got is the minimum or it can be reduced by anyway.
Any help is highly appreciated.
Thanks in advance. and Sorry for such a long mail.

DP

On Wed, 29 Mar 2006 Kunal Singh wrote :
>Hi DP,
>
> My very first advise to you would be to use the actual Hardware (and not
>the simulator) if you are looking more seriously for benchmarks.
>
> Secondly, while benchmarking the code, you should not step through the code
>(stepping through the code will not give you correct result)... rather you
>could have some macros to (a) clear the cycle count registers (b) to read
>the cycle count registers.... Now before you start the compuatations call
>the marcro-a and after having finished the computations call the macro-(b).
>If you are unable to write these macros, you can download the code for
>EE-271 from following link, one of the SW will contains similar macros:
>http://www.analog.com/processors/resources/technicalLibrary/appNotes.html
>
> Thirdly, the situation in your code has been complicated because you are
>accessing local variables and compiler is not able to optimize the code....
>You could use following modifications to the code:
> (a) Declare the variables (operands) as global...
> (b) for instructions like i1 = a + b; the variables a and b should be
>stored in different internal memory banks (this is a costraint on dual DAG
>access - Refer to the BF HRM for more information on this)...
>
> Lastly, with the above modifications you will need to analyze the compiler
>generated code.... Further, I assume that you are building the project with
>(a) Compiler optimizations enabled - for maximum speed (b) Release mode (not
>debug mode)... Please do so, if this is not the case..
>Regards,
>kunal
>On 3/28/06, Dhaval Parekh wrote:
> >
> > Hi All,
> > I would highly appreciate if any one can answer my query.
> > I have written a small program in C to check the cycle consumption for
> > different operations in Blackfin (BF535)
> > I am using the demo version of Visual DSP++ 3.5 simulator for this
> > purpose.
> > The code which I have written is as follows...
> >
> > void main(void)
> > {
> > long int a,b,i1,i2,i3,i5,i6;
> > a=4;
> > b=2;
> > i1 = a+b;
> > i2 = a*b;
> > i3 = a-b;
> > i5 = a>>1;
> > i6 = b<<1;
> > }
> >
> > As the aim of this program is just to check the cycle count I have kept it
> > as simple as possible.
> > Now blackfin DSPs are said to have ability to perform all these operations
> > in single cycle. But the cycle count that I got seems to be totally
> > different.
> > For the addition it took 25 cycles, for subtraction it took 9 cycles, for
> > multiplication it again took 25 cycles. For shifting it took 16 cycles each.
> > (I found these cycles by putting breakpoint after each operation and
> > looking at the cycles count each time)
> > I also saw the assembly code for the program generated by the simulator
> > which looks something like this.
> >
> > R3 = 4 ;
> > [ FP + -100 ] = R3 ;
> > R2 = 2 ;
> > [ FP + -96 ] = R2 ;
> > W [ FP + -32 ] = R3 ;
> > W [ FP + -28 ] = R2 ;
> > R1 = 4 ;
> > [ FP + -68 ] = R1 ;
> > R0 = 6 ;
> > [ FP + -36 ] = R0 ;
> > R7 = 20 ;
> > [ FP + -64 ] = R7 ;
> > R6.L = 26214 ;
> > R6.H = 16518 ;
> > [ FP + -124 ] = R6 ;
> > R2.L = -13107 ;
> > R2.H = 16396 ;
> > [ FP + -120 ] = R2 ;
> > R3 = [ FP + -100 ] ;
> > R5 = [ FP + -96 ] ;
> > R4 = R3 + R5 ;
> > [ FP + -92 ] = R4 ;
> > R3 *= R5 ;
> > [ FP + -88 ] = R3 ;
> > R7 = [ FP + -100 ] ;
> > R1 = R7 - R5 ;
> > [ FP + -84 ] = R1 ;
> > R3 = [ FP + -100 ] ;
> > R3 >>>= 0x1 ;
> > [ FP + -76 ] = R3 ;
> > R2 = [ FP + -96 ] ;
> > R2 <<= 0x1 ;
> > [ FP + -72 ] = R2 ;
> >
> > Kindly tell me if I am doing any mistake and if the cycle count which I
> > got is the minimum or it can be reduced by anyway.
> > Any help is highly appreciated.
> > Thanks in advance. and Sorry for such a long mail.
> >
> > DP
> >
> >
> >

I dont think this whole operation will take 1 cycle even if one will write in assembly.
These are not the parallel operations which can be performed as far as I remember.

i1=(a+b)|i2=(a-b);
i3 = a*b|NOP|store(i1)|store(i2);
i4 = a<<1|store(i3);
i5 = b<<1|store(i4);

also the location of i1 & i2 should be different else there will be stalling.If you have another multiply operation you would place in the NOP.

You can use add/sub or mul and shift operation as below.
a1& a2 are accumulators & i1 and i2 are index registers.

If i1 & i2 are adresses assigned to different
a1 += r1*r2(f)|a2+=r3*r4(uf)|*i1 = a1.l|*i2.l

with the f option a1 & a2 results are shifted needed for fractional operation.

Check the optimisation level set for the c-code.

On Wed, 29 Mar 2006 Kunal Singh wrote :
>Hi DP,
>
> My very first advise to you would be to use the actual Hardware (and not
>the simulator) if you are looking more seriously for benchmarks.
>
> Secondly, while benchmarking the code, you should not step through the code
>(stepping through the code will not give you correct result)... rather you
>could have some macros to (a) clear the cycle count registers (b) to read
>the cycle count registers.... Now before you start the compuatations call
>the marcro-a and after having finished the computations call the macro-(b).
>If you are unable to write these macros, you can download the code for
>EE-271 from following link, one of the SW will contains similar macros:
>http://www.analog.com/processors/resources/technicalLibrary/appNotes.html
>
> Thirdly, the situation in your code has been complicated because you are
>accessing local variables and compiler is not able to optimize the code....
>You could use following modifications to the code:
> (a) Declare the variables (operands) as global...
> (b) for instructions like i1 = a + b; the variables a and b should be
>stored in different internal memory banks (this is a costraint on dual DAG
>access - Refer to the BF HRM for more information on this)...
>
> Lastly, with the above modifications you will need to analyze the compiler
>generated code.... Further, I assume that you are building the project with
>(a) Compiler optimizations enabled - for maximum speed (b) Release mode (not
>debug mode)... Please do so, if this is not the case..
>Regards,
>kunal
>On 3/28/06, Dhaval Parekh wrote:
> >
> > Hi All,
> > I would highly appreciate if any one can answer my query.
> > I have written a small program in C to check the cycle consumption for
> > different operations in Blackfin (BF535)
> > I am using the demo version of Visual DSP++ 3.5 simulator for this
> > purpose.
> > The code which I have written is as follows...
> >
> > void main(void)
> > {
> > long int a,b,i1,i2,i3,i5,i6;
> > a=4;
> > b=2;
> > i1 = a+b;
> > i2 = a*b;
> > i3 = a-b;
> > i5 = a>>1;
> > i6 = b<<1;
> > }
> >
> > As the aim of this program is just to check the cycle count I have kept it
> > as simple as possible.
> > Now blackfin DSPs are said to have ability to perform all these operations
> > in single cycle. But the cycle count that I got seems to be totally
> > different.
> > For the addition it took 25 cycles, for subtraction it took 9 cycles, for
> > multiplication it again took 25 cycles. For shifting it took 16 cycles each.
> > (I found these cycles by putting breakpoint after each operation and
> > looking at the cycles count each time)
> > I also saw the assembly code for the program generated by the simulator
> > which looks something like this.
> >
> > R3 = 4 ;
> > [ FP + -100 ] = R3 ;
> > R2 = 2 ;
> > [ FP + -96 ] = R2 ;
> > W [ FP + -32 ] = R3 ;
> > W [ FP + -28 ] = R2 ;
> > R1 = 4 ;
> > [ FP + -68 ] = R1 ;
> > R0 = 6 ;
> > [ FP + -36 ] = R0 ;
> > R7 = 20 ;
> > [ FP + -64 ] = R7 ;
> > R6.L = 26214 ;
> > R6.H = 16518 ;
> > [ FP + -124 ] = R6 ;
> > R2.L = -13107 ;
> > R2.H = 16396 ;
> > [ FP + -120 ] = R2 ;
> > R3 = [ FP + -100 ] ;
> > R5 = [ FP + -96 ] ;
> > R4 = R3 + R5 ;
> > [ FP + -92 ] = R4 ;
> > R3 *= R5 ;
> > [ FP + -88 ] = R3 ;
> > R7 = [ FP + -100 ] ;
> > R1 = R7 - R5 ;
> > [ FP + -84 ] = R1 ;
> > R3 = [ FP + -100 ] ;
> > R3 >>>= 0x1 ;
> > [ FP + -76 ] = R3 ;
> > R2 = [ FP + -96 ] ;
> > R2 <<= 0x1 ;
> > [ FP + -72 ] = R2 ;
> >
> > Kindly tell me if I am doing any mistake and if the cycle count which I
> > got is the minimum or it can be reduced by anyway.
> > Any help is highly appreciated.
> > Thanks in advance. and Sorry for such a long mail.
> >
> > DP
> >
> >
> >