Hi, all
Nowadays I am reading the documents about optimization:
spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
spru198i TMS320C6000 Programmer’s Guide.pdf
spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
C64x+ is a special architecture, and instructions has different latencies
depending on the type fo inctructions. At first, I think refining C/C++ code
with pragmas and programming with intrinsics can solve the problem of
optimization. But I got some ppt from internet. It seems assembly code is
necessary sometimes.
For your experience, do you think assembly code/linear assembly code is
necessary?Under what conditions and for what application?
Thanks in advance.
Jogging
_____________________________________
Is Assembly code/linear assembly code necessary?
Started by ●March 30, 2009
Reply by ●March 31, 20092009-03-31
Hi,
It is difficult to give an answer without a complete understanding of the real-time deadlines of the system.
Let me take a swing at it in the most general fashion.
If your are looking for an average issue slot usage of 6-7 or above (out of a total possible 8) for inner loops/kernel of your algorithm, then there might be a need for pipelined/linear assembly (more likely that you would need pipelined assembly). But pipelined assembly takes long time to develop and difficult to maintain, by an order.
Linear assembly is much easier to code and somewhat easier to maintain than pipelined assembly. Linear assembly has given me outputs of the order 5 to 6 (average) for the inner loops. But it is possible that out of 10 cycles, two or three might be at, 4 out of a total of 8 per cycle.
Obviously you have to make an initial target mapping analysis of your
requirement by mapping the loads/stores and arithmetic of your algorithm into C6X VLIW instruction set capabilities, keeping in mind all restrictions of the processor (cross path stalls etc..). If you are using existing library functions, this might become a little difficult. But in general this analysis gives you a good idea of what is achievable.
In general intrinsics with good compiler directives (pragmas),does the
job for most applications. C6X provides very efficient pragmas for
optimizations C6X intrinsics with good pragma's easily give you an average issue slot usage 4-6.
Pragma's are critical, but also critical are usage of type qualifiers like restrict, const etc..
In addition C6X provide pragma's to align memory elements. Completely avoiding unaligned accesses, can be a benefit. In addition,C6X compiler provides good debug info, (I think you need to turn it on), on what exactly can improve the algorithm performance. For example, if there are excessive register to memory pills.
Regards
--- On Sun, 3/29/09, j...@gmail.com wrote:
From: j...@gmail.com
Subject: [c6x] Is Assembly code/linear assembly code necessary?
To: c...
Date: Sunday, March 29, 2009, 7:41 PM
Hi, all
Nowadays I am reading the documents about optimization:
spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
spru198i TMS320C6000 Programmer’s Guide.pdf
spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
C64x+ is a special architecture, and instructions has different latencies
depending on the type fo inctructions. At first, I think refining C/C++ code
with pragmas and programming with intrinsics can solve the problem of
optimization. But I got some ppt from internet. It seems assembly code is
necessary sometimes.
For your experience, do you think assembly code/linear assembly code is
necessary?Under what conditions and for what application?
Thanks in advance.
Jogging
_____________________________________
It is difficult to give an answer without a complete understanding of the real-time deadlines of the system.
Let me take a swing at it in the most general fashion.
If your are looking for an average issue slot usage of 6-7 or above (out of a total possible 8) for inner loops/kernel of your algorithm, then there might be a need for pipelined/linear assembly (more likely that you would need pipelined assembly). But pipelined assembly takes long time to develop and difficult to maintain, by an order.
Linear assembly is much easier to code and somewhat easier to maintain than pipelined assembly. Linear assembly has given me outputs of the order 5 to 6 (average) for the inner loops. But it is possible that out of 10 cycles, two or three might be at, 4 out of a total of 8 per cycle.
Obviously you have to make an initial target mapping analysis of your
requirement by mapping the loads/stores and arithmetic of your algorithm into C6X VLIW instruction set capabilities, keeping in mind all restrictions of the processor (cross path stalls etc..). If you are using existing library functions, this might become a little difficult. But in general this analysis gives you a good idea of what is achievable.
In general intrinsics with good compiler directives (pragmas),does the
job for most applications. C6X provides very efficient pragmas for
optimizations C6X intrinsics with good pragma's easily give you an average issue slot usage 4-6.
Pragma's are critical, but also critical are usage of type qualifiers like restrict, const etc..
In addition C6X provide pragma's to align memory elements. Completely avoiding unaligned accesses, can be a benefit. In addition,C6X compiler provides good debug info, (I think you need to turn it on), on what exactly can improve the algorithm performance. For example, if there are excessive register to memory pills.
Regards
--- On Sun, 3/29/09, j...@gmail.com wrote:
From: j...@gmail.com
Subject: [c6x] Is Assembly code/linear assembly code necessary?
To: c...
Date: Sunday, March 29, 2009, 7:41 PM
Hi, all
Nowadays I am reading the documents about optimization:
spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
spru198i TMS320C6000 Programmer’s Guide.pdf
spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
C64x+ is a special architecture, and instructions has different latencies
depending on the type fo inctructions. At first, I think refining C/C++ code
with pragmas and programming with intrinsics can solve the problem of
optimization. But I got some ppt from internet. It seems assembly code is
necessary sometimes.
For your experience, do you think assembly code/linear assembly code is
necessary?Under what conditions and for what application?
Thanks in advance.
Jogging
_____________________________________
Reply by ●April 22, 20092009-04-22
Hi,
Thanks for your opinion. I agree with you completely.
Recently I find memory access may influence the performance more than
assembly code.
In order to learn more about the memory access effect, I do some tests.
I run the IMG_perimeter function from imglib library on DM6437 EVM.
In the example, test program runs the function in c and then the function in
assembly code.
At first, I put the data in L2 RAM, the resulting time is below:
IMG_perimeter asm cycle: 1029
IMG_perimeter c cycle: 2941
Then I put the data in external memory DDR2, the resulting time is below.
IMG_perimeter asm cycle: 6250
IMG_perimeter c cycle: 13234
We can see that if the data is put in L2 RAM, the time can be reduced
from 13234 to 2941. It is much better than assembly code optimization
which reduces time from 13234 to 6250.
Before I pay my attention to assembly code optimization, and haven't
found memory access effect.
My another question is that: memory access latency is multiple cycles in the
C64x+ pipeline.
For load instruction, it needs five cycles to obtain data. If queue or tree
data structure is used,
I don't know how to optimize it. Can anyone share his experience with it?
Thanks in advance.
Jogging
On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> Hi,
> It is difficult to give an answer without a complete understanding of the
> real-time deadlines of the system.
>
> Let me take a swing at it in the most general fashion.
>
> If your are looking for an average issue slot usage of 6-7 or above (out of
> a total possible 8) for inner loops/kernel of your algorithm, then there
> might be a need for pipelined/linear assembly (more likely that you would
> need pipelined assembly). But pipelined assembly takes long time to develop
> and difficult to maintain, by an order.
>
> Linear assembly is much easier to code and somewhat easier to maintain than
> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
> (average) for the inner loops. But it is possible that out of 10 cycles, two
> or three might be at, 4 out of a total of 8 per cycle.
>
> Obviously you have to make an initial target mapping analysis of your
> requirement by mapping the loads/stores and arithmetic of your algorithm
> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
> of the processor (cross path stalls etc..). If you are using existing
> library functions, this might become a little difficult. But in general this
> analysis gives you a good idea of what is achievable.
>
> In general intrinsics with good compiler directives (pragmas),does the job
> for most applications. C6X provides very efficient pragmas for optimizations
> C6X intrinsics with good pragma's easily give you an average issue slot
> usage 4-6.
>
> Pragma's are critical, but also critical are usage of type qualifiers like
> restrict, const etc..
> In addition C6X provide pragma's to align memory elements. Completely
> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> provides good debug info, (I think you need to turn it on), on what exactly
> can improve the algorithm performance. For example, if there are excessive
> register to memory pills.
> Regards
>
> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>
> From: j...@gmail.com
> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> To: c...
> Date: Sunday, March 29, 2009, 7:41 PM
> Hi, all
> Nowadays I am reading the documents about optimization:
> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> spru198i TMS320C6000 Programmers Guide.pdf
> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>
> C64x+ is a special architecture, and instructions has different latencies
> depending on the type fo inctructions. At first, I think refining C/C++ code
> with pragmas and programming with intrinsics can solve the problem of
> optimization. But I got some ppt from internet. It seems assembly code is
> necessary sometimes.
> For your experience, do you think assembly code/linear assembly code is
> necessary?Under what conditions and for what application?
>
> Thanks in advance.
> Jogging
>
_____________________________________
Thanks for your opinion. I agree with you completely.
Recently I find memory access may influence the performance more than
assembly code.
In order to learn more about the memory access effect, I do some tests.
I run the IMG_perimeter function from imglib library on DM6437 EVM.
In the example, test program runs the function in c and then the function in
assembly code.
At first, I put the data in L2 RAM, the resulting time is below:
IMG_perimeter asm cycle: 1029
IMG_perimeter c cycle: 2941
Then I put the data in external memory DDR2, the resulting time is below.
IMG_perimeter asm cycle: 6250
IMG_perimeter c cycle: 13234
We can see that if the data is put in L2 RAM, the time can be reduced
from 13234 to 2941. It is much better than assembly code optimization
which reduces time from 13234 to 6250.
Before I pay my attention to assembly code optimization, and haven't
found memory access effect.
My another question is that: memory access latency is multiple cycles in the
C64x+ pipeline.
For load instruction, it needs five cycles to obtain data. If queue or tree
data structure is used,
I don't know how to optimize it. Can anyone share his experience with it?
Thanks in advance.
Jogging
On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> Hi,
> It is difficult to give an answer without a complete understanding of the
> real-time deadlines of the system.
>
> Let me take a swing at it in the most general fashion.
>
> If your are looking for an average issue slot usage of 6-7 or above (out of
> a total possible 8) for inner loops/kernel of your algorithm, then there
> might be a need for pipelined/linear assembly (more likely that you would
> need pipelined assembly). But pipelined assembly takes long time to develop
> and difficult to maintain, by an order.
>
> Linear assembly is much easier to code and somewhat easier to maintain than
> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
> (average) for the inner loops. But it is possible that out of 10 cycles, two
> or three might be at, 4 out of a total of 8 per cycle.
>
> Obviously you have to make an initial target mapping analysis of your
> requirement by mapping the loads/stores and arithmetic of your algorithm
> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
> of the processor (cross path stalls etc..). If you are using existing
> library functions, this might become a little difficult. But in general this
> analysis gives you a good idea of what is achievable.
>
> In general intrinsics with good compiler directives (pragmas),does the job
> for most applications. C6X provides very efficient pragmas for optimizations
> C6X intrinsics with good pragma's easily give you an average issue slot
> usage 4-6.
>
> Pragma's are critical, but also critical are usage of type qualifiers like
> restrict, const etc..
> In addition C6X provide pragma's to align memory elements. Completely
> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> provides good debug info, (I think you need to turn it on), on what exactly
> can improve the algorithm performance. For example, if there are excessive
> register to memory pills.
> Regards
>
> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>
> From: j...@gmail.com
> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> To: c...
> Date: Sunday, March 29, 2009, 7:41 PM
> Hi, all
> Nowadays I am reading the documents about optimization:
> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> spru198i TMS320C6000 Programmers Guide.pdf
> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>
> C64x+ is a special architecture, and instructions has different latencies
> depending on the type fo inctructions. At first, I think refining C/C++ code
> with pragmas and programming with intrinsics can solve the problem of
> optimization. But I got some ppt from internet. It seems assembly code is
> necessary sometimes.
> For your experience, do you think assembly code/linear assembly code is
> necessary?Under what conditions and for what application?
>
> Thanks in advance.
> Jogging
>
_____________________________________
Reply by ●April 22, 20092009-04-22
Hi,
I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
Where were the data to process? in internal SDRAM as well?
I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
Regards
> To: r...@yahoo.com
> CC: c...
> From: j...@gmail.com
> Date: Wed, 22 Apr 2009 20:38:20 +0800
> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
> > Hi,
> > It is difficult to give an answer without a complete understanding of the
> > real-time deadlines of the system.
> >
> > Let me take a swing at it in the most general fashion.
> >
> > If your are looking for an average issue slot usage of 6-7 or above (out of
> > a total possible 8) for inner loops/kernel of your algorithm, then there
> > might be a need for pipelined/linear assembly (more likely that you would
> > need pipelined assembly). But pipelined assembly takes long time to develop
> > and difficult to maintain, by an order.
> >
> > Linear assembly is much easier to code and somewhat easier to maintain than
> > pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
> > (average) for the inner loops. But it is possible that out of 10 cycles, two
> > or three might be at, 4 out of a total of 8 per cycle.
> >
> > Obviously you have to make an initial target mapping analysis of your
> > requirement by mapping the loads/stores and arithmetic of your algorithm
> > into C6X VLIW instruction set capabilities, keeping in mind all restrictions
> > of the processor (cross path stalls etc..). If you are using existing
> > library functions, this might become a little difficult. But in general this
> > analysis gives you a good idea of what is achievable.
> >
> > In general intrinsics with good compiler directives (pragmas),does the job
> > for most applications. C6X provides very efficient pragmas for optimizations
> > C6X intrinsics with good pragma's easily give you an average issue slot
> > usage 4-6.
> >
> > Pragma's are critical, but also critical are usage of type qualifiers like
> > restrict, const etc..
> > In addition C6X provide pragma's to align memory elements. Completely
> > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> > provides good debug info, (I think you need to turn it on), on what exactly
> > can improve the algorithm performance. For example, if there are excessive
> > register to memory pills.
> >
> >
> > Regards
> >
> > --- On *Sun, 3/29/09, j...@gmail.com *wrote:
> >
> > From: j...@gmail.com
> > Subject: [c6x] Is Assembly code/linear assembly code necessary?
> > To: c...
> > Date: Sunday, March 29, 2009, 7:41 PM
> >
> >
> > Hi, all
> >
> >
> > Nowadays I am reading the documents about optimization:
> > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> > spru198i TMS320C6000 Programmers Guide.pdf
> > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
> >
> > C64x+ is a special architecture, and instructions has different latencies
> > depending on the type fo inctructions. At first, I think refining C/C++ code
> > with pragmas and programming with intrinsics can solve the problem of
> > optimization. But I got some ppt from internet. It seems assembly code is
> > necessary sometimes.
> > For your experience, do you think assembly code/linear assembly code is
> > necessary?Under what conditions and for what application?
> >
> > Thanks in advance.
> > Jogging
> >
_____________________________________
I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
Where were the data to process? in internal SDRAM as well?
I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
Regards
> To: r...@yahoo.com
> CC: c...
> From: j...@gmail.com
> Date: Wed, 22 Apr 2009 20:38:20 +0800
> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
> > Hi,
> > It is difficult to give an answer without a complete understanding of the
> > real-time deadlines of the system.
> >
> > Let me take a swing at it in the most general fashion.
> >
> > If your are looking for an average issue slot usage of 6-7 or above (out of
> > a total possible 8) for inner loops/kernel of your algorithm, then there
> > might be a need for pipelined/linear assembly (more likely that you would
> > need pipelined assembly). But pipelined assembly takes long time to develop
> > and difficult to maintain, by an order.
> >
> > Linear assembly is much easier to code and somewhat easier to maintain than
> > pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
> > (average) for the inner loops. But it is possible that out of 10 cycles, two
> > or three might be at, 4 out of a total of 8 per cycle.
> >
> > Obviously you have to make an initial target mapping analysis of your
> > requirement by mapping the loads/stores and arithmetic of your algorithm
> > into C6X VLIW instruction set capabilities, keeping in mind all restrictions
> > of the processor (cross path stalls etc..). If you are using existing
> > library functions, this might become a little difficult. But in general this
> > analysis gives you a good idea of what is achievable.
> >
> > In general intrinsics with good compiler directives (pragmas),does the job
> > for most applications. C6X provides very efficient pragmas for optimizations
> > C6X intrinsics with good pragma's easily give you an average issue slot
> > usage 4-6.
> >
> > Pragma's are critical, but also critical are usage of type qualifiers like
> > restrict, const etc..
> > In addition C6X provide pragma's to align memory elements. Completely
> > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> > provides good debug info, (I think you need to turn it on), on what exactly
> > can improve the algorithm performance. For example, if there are excessive
> > register to memory pills.
> >
> >
> > Regards
> >
> > --- On *Sun, 3/29/09, j...@gmail.com *wrote:
> >
> > From: j...@gmail.com
> > Subject: [c6x] Is Assembly code/linear assembly code necessary?
> > To: c...
> > Date: Sunday, March 29, 2009, 7:41 PM
> >
> >
> > Hi, all
> >
> >
> > Nowadays I am reading the documents about optimization:
> > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> > spru198i TMS320C6000 Programmers Guide.pdf
> > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
> >
> > C64x+ is a special architecture, and instructions has different latencies
> > depending on the type fo inctructions. At first, I think refining C/C++ code
> > with pragmas and programming with intrinsics can solve the problem of
> > optimization. But I got some ppt from internet. It seems assembly code is
> > necessary sometimes.
> > For your experience, do you think assembly code/linear assembly code is
> > necessary?Under what conditions and for what application?
> >
> > Thanks in advance.
> > Jogging
> >
_____________________________________
Reply by ●April 22, 20092009-04-22
Hi,
I assure that external memory is cacheable because I obtain three sets
of figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet <
c...@hotmail.com> wrote:
> Hi,
>
> I have some doubts on your figures, are you sure you had Cache enabled when
> running in external memory?
> Where were the data to process? in internal SDRAM as well?
> I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal
> RAM is internal RAM, it sounds 2 different things to me.
>
> Normally with some good pragmas and optimise instructions to the compiler
> you can get the same result as assembly code, but for far less efforts.
>
> Regards
> > To: r...@yahoo.com
> > CC: c...
> > From: j...@gmail.com
> > Date: Wed, 22 Apr 2009 20:38:20 +0800
> > Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> >
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with it?
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> >
> > > Hi,
> > > It is difficult to give an answer without a complete understanding of
> the
> > > real-time deadlines of the system.
> > >
> > > Let me take a swing at it in the most general fashion.
> > >
> > > If your are looking for an average issue slot usage of 6-7 or above
> (out of
> > > a total possible 8) for inner loops/kernel of your algorithm, then
> there
> > > might be a need for pipelined/linear assembly (more likely that you
> would
> > > need pipelined assembly). But pipelined assembly takes long time to
> develop
> > > and difficult to maintain, by an order.
> > >
> > > Linear assembly is much easier to code and somewhat easier to maintain
> than
> > > pipelined assembly. Linear assembly has given me outputs of the order 5
> to 6
> > > (average) for the inner loops. But it is possible that out of 10
> cycles, two
> > > or three might be at, 4 out of a total of 8 per cycle.
> > >
> > > Obviously you have to make an initial target mapping analysis of your
> > > requirement by mapping the loads/stores and arithmetic of your
> algorithm
> > > into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> > > of the processor (cross path stalls etc..). If you are using existing
> > > library functions, this might become a little difficult. But in general
> this
> > > analysis gives you a good idea of what is achievable.
> > >
> > > In general intrinsics with good compiler directives (pragmas),does the
> job
> > > for most applications. C6X provides very efficient pragmas for
> optimizations
> > > C6X intrinsics with good pragma's easily give you an average issue slot
> > > usage 4-6.
> > >
> > > Pragma's are critical, but also critical are usage of type qualifiers
> like
> > > restrict, const etc..
> > > In addition C6X provide pragma's to align memory elements. Completely
> > > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> > > provides good debug info, (I think you need to turn it on), on what
> exactly
> > > can improve the algorithm performance. For example, if there are
> excessive
> > > register to memory pills.
> > >
> > >
> > > Regards
> > >
> > > --- On *Sun, 3/29/09, j...@gmail.com > >*wrote:
> > >
> > > From: j...@gmail.com
> > > Subject: [c6x] Is Assembly code/linear assembly code necessary?
> > > To: c...
> > > Date: Sunday, March 29, 2009, 7:41 PM
> > >
> > >
> > > Hi, all
> > >
> > >
> > > Nowadays I am reading the documents about optimization:
> > > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> > > spru198i TMS320C6000 Programmers Guide.pdf
> > > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> > >
> > > C64x+ is a special architecture, and instructions has different
> latencies
> > > depending on the type fo inctructions. At first, I think refining C/C++
> code
> > > with pragmas and programming with intrinsics can solve the problem of
> > > optimization. But I got some ppt from internet. It seems assembly code
> is
> > > necessary sometimes.
> > > For your experience, do you think assembly code/linear assembly code is
> > > necessary?Under what conditions and for what application?
> > >
> > > Thanks in advance.
> > > Jogging
> > >
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
>
I assure that external memory is cacheable because I obtain three sets
of figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet <
c...@hotmail.com> wrote:
> Hi,
>
> I have some doubts on your figures, are you sure you had Cache enabled when
> running in external memory?
> Where were the data to process? in internal SDRAM as well?
> I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal
> RAM is internal RAM, it sounds 2 different things to me.
>
> Normally with some good pragmas and optimise instructions to the compiler
> you can get the same result as assembly code, but for far less efforts.
>
> Regards
> > To: r...@yahoo.com
> > CC: c...
> > From: j...@gmail.com
> > Date: Wed, 22 Apr 2009 20:38:20 +0800
> > Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> >
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with it?
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> >
> > > Hi,
> > > It is difficult to give an answer without a complete understanding of
> the
> > > real-time deadlines of the system.
> > >
> > > Let me take a swing at it in the most general fashion.
> > >
> > > If your are looking for an average issue slot usage of 6-7 or above
> (out of
> > > a total possible 8) for inner loops/kernel of your algorithm, then
> there
> > > might be a need for pipelined/linear assembly (more likely that you
> would
> > > need pipelined assembly). But pipelined assembly takes long time to
> develop
> > > and difficult to maintain, by an order.
> > >
> > > Linear assembly is much easier to code and somewhat easier to maintain
> than
> > > pipelined assembly. Linear assembly has given me outputs of the order 5
> to 6
> > > (average) for the inner loops. But it is possible that out of 10
> cycles, two
> > > or three might be at, 4 out of a total of 8 per cycle.
> > >
> > > Obviously you have to make an initial target mapping analysis of your
> > > requirement by mapping the loads/stores and arithmetic of your
> algorithm
> > > into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> > > of the processor (cross path stalls etc..). If you are using existing
> > > library functions, this might become a little difficult. But in general
> this
> > > analysis gives you a good idea of what is achievable.
> > >
> > > In general intrinsics with good compiler directives (pragmas),does the
> job
> > > for most applications. C6X provides very efficient pragmas for
> optimizations
> > > C6X intrinsics with good pragma's easily give you an average issue slot
> > > usage 4-6.
> > >
> > > Pragma's are critical, but also critical are usage of type qualifiers
> like
> > > restrict, const etc..
> > > In addition C6X provide pragma's to align memory elements. Completely
> > > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> > > provides good debug info, (I think you need to turn it on), on what
> exactly
> > > can improve the algorithm performance. For example, if there are
> excessive
> > > register to memory pills.
> > >
> > >
> > > Regards
> > >
> > > --- On *Sun, 3/29/09, j...@gmail.com > >*wrote:
> > >
> > > From: j...@gmail.com
> > > Subject: [c6x] Is Assembly code/linear assembly code necessary?
> > > To: c...
> > > Date: Sunday, March 29, 2009, 7:41 PM
> > >
> > >
> > > Hi, all
> > >
> > >
> > > Nowadays I am reading the documents about optimization:
> > > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> > > spru198i TMS320C6000 Programmers Guide.pdf
> > > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> > >
> > > C64x+ is a special architecture, and instructions has different
> latencies
> > > depending on the type fo inctructions. At first, I think refining C/C++
> code
> > > with pragmas and programming with intrinsics can solve the problem of
> > > optimization. But I got some ppt from internet. It seems assembly code
> is
> > > necessary sometimes.
> > > For your experience, do you think assembly code/linear assembly code is
> > > necessary?Under what conditions and for what application?
> > >
> > > Thanks in advance.
> > > Jogging
> > >
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
>
Reply by ●April 23, 20092009-04-23
jogging,
On Wed, Apr 22, 2009 at 7:38 AM, jogging song wrote:
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
Check out 'delay slots' and 'load instructions' in spru732c. If you
look at the assembly code generated by the C compiler, you will
probably see that it makes use of the delay slots.
Q1. Are you comparing optimized [by the compiler] C code with assembly code??
mikedunn
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
>> Hi,
>> It is difficult to give an answer without a complete understanding of the
>> real-time deadlines of the system.
>>
>> Let me take a swing at it in the most general fashion.
>>
>> If your are looking for an average issue slot usage of 6-7 or above (out of
>> a total possible 8) for inner loops/kernel of your algorithm, then there
>> might be a need for pipelined/linear assembly (more likely that you would
>> need pipelined assembly). But pipelined assembly takes long time to develop
>> and difficult to maintain, by an order.
>>
>> Linear assembly is much easier to code and somewhat easier to maintain than
>> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> (average) for the inner loops. But it is possible that out of 10 cycles, two
>> or three might be at, 4 out of a total of 8 per cycle.
>>
>> Obviously you have to make an initial target mapping analysis of your
>> requirement by mapping the loads/stores and arithmetic of your algorithm
>> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> of the processor (cross path stalls etc..). If you are using existing
>> library functions, this might become a little difficult. But in general this
>> analysis gives you a good idea of what is achievable.
>>
>> In general intrinsics with good compiler directives (pragmas),does the job
>> for most applications. C6X provides very efficient pragmas for optimizations
>> C6X intrinsics with good pragma's easily give you an average issue slot
>> usage 4-6.
>>
>> Pragma's are critical, but also critical are usage of type qualifiers like
>> restrict, const etc..
>> In addition C6X provide pragma's to align memory elements. Completely
>> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> provides good debug info, (I think you need to turn it on), on what exactly
>> can improve the algorithm performance. For example, if there are excessive
>> register to memory pills.
>> Regards
>>
>> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>>
>> From: j...@gmail.com
>> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> To: c...
>> Date: Sunday, March 29, 2009, 7:41 PM
>> Hi, all
>> Nowadays I am reading the documents about optimization:
>> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> spru198i TMS320C6000 Programmers Guide.pdf
>> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>>
>> C64x+ is a special architecture, and instructions has different latencies
>> depending on the type fo inctructions. At first, I think refining C/C++ code
>> with pragmas and programming with intrinsics can solve the problem of
>> optimization. But I got some ppt from internet. It seems assembly code is
>> necessary sometimes.
>> For your experience, do you think assembly code/linear assembly code is
>> necessary?Under what conditions and for what application?
>>
>> Thanks in advance.
>> Jogging
>>
>
>
> _____________________________________
>
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________
On Wed, Apr 22, 2009 at 7:38 AM, jogging song wrote:
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
Check out 'delay slots' and 'load instructions' in spru732c. If you
look at the assembly code generated by the C compiler, you will
probably see that it makes use of the delay slots.
Q1. Are you comparing optimized [by the compiler] C code with assembly code??
mikedunn
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
>> Hi,
>> It is difficult to give an answer without a complete understanding of the
>> real-time deadlines of the system.
>>
>> Let me take a swing at it in the most general fashion.
>>
>> If your are looking for an average issue slot usage of 6-7 or above (out of
>> a total possible 8) for inner loops/kernel of your algorithm, then there
>> might be a need for pipelined/linear assembly (more likely that you would
>> need pipelined assembly). But pipelined assembly takes long time to develop
>> and difficult to maintain, by an order.
>>
>> Linear assembly is much easier to code and somewhat easier to maintain than
>> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> (average) for the inner loops. But it is possible that out of 10 cycles, two
>> or three might be at, 4 out of a total of 8 per cycle.
>>
>> Obviously you have to make an initial target mapping analysis of your
>> requirement by mapping the loads/stores and arithmetic of your algorithm
>> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> of the processor (cross path stalls etc..). If you are using existing
>> library functions, this might become a little difficult. But in general this
>> analysis gives you a good idea of what is achievable.
>>
>> In general intrinsics with good compiler directives (pragmas),does the job
>> for most applications. C6X provides very efficient pragmas for optimizations
>> C6X intrinsics with good pragma's easily give you an average issue slot
>> usage 4-6.
>>
>> Pragma's are critical, but also critical are usage of type qualifiers like
>> restrict, const etc..
>> In addition C6X provide pragma's to align memory elements. Completely
>> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> provides good debug info, (I think you need to turn it on), on what exactly
>> can improve the algorithm performance. For example, if there are excessive
>> register to memory pills.
>> Regards
>>
>> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>>
>> From: j...@gmail.com
>> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> To: c...
>> Date: Sunday, March 29, 2009, 7:41 PM
>> Hi, all
>> Nowadays I am reading the documents about optimization:
>> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> spru198i TMS320C6000 Programmers Guide.pdf
>> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>>
>> C64x+ is a special architecture, and instructions has different latencies
>> depending on the type fo inctructions. At first, I think refining C/C++ code
>> with pragmas and programming with intrinsics can solve the problem of
>> optimization. But I got some ppt from internet. It seems assembly code is
>> necessary sometimes.
>> For your experience, do you think assembly code/linear assembly code is
>> necessary?Under what conditions and for what application?
>>
>> Thanks in advance.
>> Jogging
>>
>
>
> _____________________________________
>
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________
Reply by ●April 23, 20092009-04-23
christophe,
On Wed, Apr 22, 2009 at 11:23 AM, christophe blouet
wrote:
> Hi,
>
> I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
>
> Where were the data to process? in internal SDRAM as well?
If we are being picky about terminology [I do not care for the term
'L2 RAM'], should we not say 'internal SDRAM'?? :-)
mikedunn
>
> I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
>
> Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
>
> Regards
>> To: r...@yahoo.com
>> CC: c...
>> From: j...@gmail.com
>> Date: Wed, 22 Apr 2009 20:38:20 +0800
>> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>>
>> Hi,
>> Thanks for your opinion. I agree with you completely.
>> Recently I find memory access may influence the performance more than
>> assembly code.
>> In order to learn more about the memory access effect, I do some tests.
>> I run the IMG_perimeter function from imglib library on DM6437 EVM.
>> In the example, test program runs the function in c and then the function in
>> assembly code.
>> At first, I put the data in L2 RAM, the resulting time is below:
>> IMG_perimeter asm cycle: 1029
>> IMG_perimeter c cycle: 2941
>>
>> Then I put the data in external memory DDR2, the resulting time is below.
>> IMG_perimeter asm cycle: 6250
>> IMG_perimeter c cycle: 13234
>>
>> We can see that if the data is put in L2 RAM, the time can be reduced
>> from 13234 to 2941. It is much better than assembly code optimization
>> which reduces time from 13234 to 6250.
>>
>> Before I pay my attention to assembly code optimization, and haven't
>> found memory access effect.
>>
>> My another question is that: memory access latency is multiple cycles in the
>> C64x+ pipeline.
>> For load instruction, it needs five cycles to obtain data. If queue or tree
>> data structure is used,
>> I don't know how to optimize it. Can anyone share his experience with it?
>>
>> Thanks in advance.
>> Jogging
>>
>> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>>
>> > Hi,
>> > It is difficult to give an answer without a complete understanding of the
>> > real-time deadlines of the system.
>> >
>> > Let me take a swing at it in the most general fashion.
>> >
>> > If your are looking for an average issue slot usage of 6-7 or above (out of
>> > a total possible 8) for inner loops/kernel of your algorithm, then there
>> > might be a need for pipelined/linear assembly (more likely that you would
>> > need pipelined assembly). But pipelined assembly takes long time to develop
>> > and difficult to maintain, by an order.
>> >
>> > Linear assembly is much easier to code and somewhat easier to maintain than
>> > pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> > (average) for the inner loops. But it is possible that out of 10 cycles, two
>> > or three might be at, 4 out of a total of 8 per cycle.
>> >
>> > Obviously you have to make an initial target mapping analysis of your
>> > requirement by mapping the loads/stores and arithmetic of your algorithm
>> > into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> > of the processor (cross path stalls etc..). If you are using existing
>> > library functions, this might become a little difficult. But in general this
>> > analysis gives you a good idea of what is achievable.
>> >
>> > In general intrinsics with good compiler directives (pragmas),does the job
>> > for most applications. C6X provides very efficient pragmas for optimizations
>> > C6X intrinsics with good pragma's easily give you an average issue slot
>> > usage 4-6.
>> >
>> > Pragma's are critical, but also critical are usage of type qualifiers like
>> > restrict, const etc..
>> > In addition C6X provide pragma's to align memory elements. Completely
>> > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> > provides good debug info, (I think you need to turn it on), on what exactly
>> > can improve the algorithm performance. For example, if there are excessive
>> > register to memory pills.
>> >
>> >
>> > Regards
>> >
>> > --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>> >
>> > From: j...@gmail.com
>> > Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> > To: c...
>> > Date: Sunday, March 29, 2009, 7:41 PM
>> >
>> >
>> > Hi, all
>> >
>> >
>> > Nowadays I am reading the documents about optimization:
>> > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> > spru198i TMS320C6000 Programmers Guide.pdf
>> > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>> >
>> > C64x+ is a special architecture, and instructions has different latencies
>> > depending on the type fo inctructions. At first, I think refining C/C++ code
>> > with pragmas and programming with intrinsics can solve the problem of
>> > optimization. But I got some ppt from internet. It seems assembly code is
>> > necessary sometimes.
>> > For your experience, do you think assembly code/linear assembly code is
>> > necessary?Under what conditions and for what application?
>> >
>> > Thanks in advance.
>> > Jogging
>> >
>
>
> _____________________________________
>
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________
On Wed, Apr 22, 2009 at 11:23 AM, christophe blouet
wrote:
> Hi,
>
> I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
>
> Where were the data to process? in internal SDRAM as well?
If we are being picky about terminology [I do not care for the term
'L2 RAM'], should we not say 'internal SDRAM'?? :-)
mikedunn
>
> I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
>
> Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
>
> Regards
>> To: r...@yahoo.com
>> CC: c...
>> From: j...@gmail.com
>> Date: Wed, 22 Apr 2009 20:38:20 +0800
>> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>>
>> Hi,
>> Thanks for your opinion. I agree with you completely.
>> Recently I find memory access may influence the performance more than
>> assembly code.
>> In order to learn more about the memory access effect, I do some tests.
>> I run the IMG_perimeter function from imglib library on DM6437 EVM.
>> In the example, test program runs the function in c and then the function in
>> assembly code.
>> At first, I put the data in L2 RAM, the resulting time is below:
>> IMG_perimeter asm cycle: 1029
>> IMG_perimeter c cycle: 2941
>>
>> Then I put the data in external memory DDR2, the resulting time is below.
>> IMG_perimeter asm cycle: 6250
>> IMG_perimeter c cycle: 13234
>>
>> We can see that if the data is put in L2 RAM, the time can be reduced
>> from 13234 to 2941. It is much better than assembly code optimization
>> which reduces time from 13234 to 6250.
>>
>> Before I pay my attention to assembly code optimization, and haven't
>> found memory access effect.
>>
>> My another question is that: memory access latency is multiple cycles in the
>> C64x+ pipeline.
>> For load instruction, it needs five cycles to obtain data. If queue or tree
>> data structure is used,
>> I don't know how to optimize it. Can anyone share his experience with it?
>>
>> Thanks in advance.
>> Jogging
>>
>> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>>
>> > Hi,
>> > It is difficult to give an answer without a complete understanding of the
>> > real-time deadlines of the system.
>> >
>> > Let me take a swing at it in the most general fashion.
>> >
>> > If your are looking for an average issue slot usage of 6-7 or above (out of
>> > a total possible 8) for inner loops/kernel of your algorithm, then there
>> > might be a need for pipelined/linear assembly (more likely that you would
>> > need pipelined assembly). But pipelined assembly takes long time to develop
>> > and difficult to maintain, by an order.
>> >
>> > Linear assembly is much easier to code and somewhat easier to maintain than
>> > pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> > (average) for the inner loops. But it is possible that out of 10 cycles, two
>> > or three might be at, 4 out of a total of 8 per cycle.
>> >
>> > Obviously you have to make an initial target mapping analysis of your
>> > requirement by mapping the loads/stores and arithmetic of your algorithm
>> > into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> > of the processor (cross path stalls etc..). If you are using existing
>> > library functions, this might become a little difficult. But in general this
>> > analysis gives you a good idea of what is achievable.
>> >
>> > In general intrinsics with good compiler directives (pragmas),does the job
>> > for most applications. C6X provides very efficient pragmas for optimizations
>> > C6X intrinsics with good pragma's easily give you an average issue slot
>> > usage 4-6.
>> >
>> > Pragma's are critical, but also critical are usage of type qualifiers like
>> > restrict, const etc..
>> > In addition C6X provide pragma's to align memory elements. Completely
>> > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> > provides good debug info, (I think you need to turn it on), on what exactly
>> > can improve the algorithm performance. For example, if there are excessive
>> > register to memory pills.
>> >
>> >
>> > Regards
>> >
>> > --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>> >
>> > From: j...@gmail.com
>> > Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> > To: c...
>> > Date: Sunday, March 29, 2009, 7:41 PM
>> >
>> >
>> > Hi, all
>> >
>> >
>> > Nowadays I am reading the documents about optimization:
>> > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> > spru198i TMS320C6000 Programmers Guide.pdf
>> > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>> >
>> > C64x+ is a special architecture, and instructions has different latencies
>> > depending on the type fo inctructions. At first, I think refining C/C++ code
>> > with pragmas and programming with intrinsics can solve the problem of
>> > optimization. But I got some ppt from internet. It seems assembly code is
>> > necessary sometimes.
>> > For your experience, do you think assembly code/linear assembly code is
>> > necessary?Under what conditions and for what application?
>> >
>> > Thanks in advance.
>> > Jogging
>> >
>
>
> _____________________________________
>
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________
Reply by ●April 23, 20092009-04-23
Ok, sounds good for the Cache enabled, but how big is your Cache? it can change
the results if your program is big. but if it's a small one, once loaded in
Cache you wouldn't see much difference between internal SDRAM ;-) and
external DDR.Really have a look on the C optimisations, by giving a minimum loop
number, the compiler will expand the number of calculations in one loop and then
your code won't suffer of pipeline delay. I got the same results using this
method as the best optimised routine in asm.
Regards
Date: Thu, 23 Apr 2009 09:45:30 +0800
Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
From: j...@gmail.com
To: c...@hotmail.com
CC: r...@yahoo.com; c...
Hi,
I assure that external memory is cacheable because I obtain three sets of figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet wrote:
Hi,
I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
Where were the data to process? in internal SDRAM as well?
I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
Regards
> To: r...@yahoo.com
> CC: c...
> From: j...@gmail.com
> Date: Wed, 22 Apr 2009 20:38:20 +0800
> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
>> Hi,
>> It is difficult to give an answer without a complete understanding of the
>> real-time deadlines of the system.
>>
>> Let me take a swing at it in the most general fashion.
>>
>> If your are looking for an average issue slot usage of 6-7 or above (out of
>> a total possible 8) for inner loops/kernel of your algorithm, then there
>> might be a need for pipelined/linear assembly (more likely that you would
>> need pipelined assembly). But pipelined assembly takes long time to develop
>> and difficult to maintain, by an order.
>>
>> Linear assembly is much easier to code and somewhat easier to maintain than
>> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> (average) for the inner loops. But it is possible that out of 10 cycles, two
>> or three might be at, 4 out of a total of 8 per cycle.
>>
>> Obviously you have to make an initial target mapping analysis of your
>> requirement by mapping the loads/stores and arithmetic of your algorithm
>> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> of the processor (cross path stalls etc..). If you are using existing
>> library functions, this might become a little difficult. But in general this
>> analysis gives you a good idea of what is achievable.
>>
>> In general intrinsics with good compiler directives (pragmas),does the job
>> for most applications. C6X provides very efficient pragmas for optimizations
>> C6X intrinsics with good pragma's easily give you an average issue slot
>> usage 4-6.
>>
>> Pragma's are critical, but also critical are usage of type qualifiers like
>> restrict, const etc..
>> In addition C6X provide pragma's to align memory elements. Completely
>> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> provides good debug info, (I think you need to turn it on), on what exactly
>> can improve the algorithm performance. For example, if there are excessive
>> register to memory pills.
>> Regards
>>
>> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>>
>> From: j...@gmail.com
>> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> To: c...
>> Date: Sunday, March 29, 2009, 7:41 PM
>> Hi, all
>> Nowadays I am reading the documents about optimization:
>> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> spru198i TMS320C6000 Programmers Guide.pdf
>> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>>
>> C64x+ is a special architecture, and instructions has different latencies
>> depending on the type fo inctructions. At first, I think refining C/C++ code
>> with pragmas and programming with intrinsics can solve the problem of
>> optimization. But I got some ppt from internet. It seems assembly code is
>> necessary sometimes.
>> For your experience, do you think assembly code/linear assembly code is
>> necessary?Under what conditions and for what application?
>>
>> Thanks in advance.
>> Jogging
>>
>
>
> _____________________________________
>
>
> Individual Email | Traditional
>
>
>
> http://docs.yahoo.com/info/terms/
>
Regards
Date: Thu, 23 Apr 2009 09:45:30 +0800
Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
From: j...@gmail.com
To: c...@hotmail.com
CC: r...@yahoo.com; c...
Hi,
I assure that external memory is cacheable because I obtain three sets of figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet wrote:
Hi,
I have some doubts on your figures, are you sure you had Cache enabled when running in external memory?
Where were the data to process? in internal SDRAM as well?
I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is internal RAM, it sounds 2 different things to me.
Normally with some good pragmas and optimise instructions to the compiler you can get the same result as assembly code, but for far less efforts.
Regards
> To: r...@yahoo.com
> CC: c...
> From: j...@gmail.com
> Date: Wed, 22 Apr 2009 20:38:20 +0800
> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>
>> Hi,
>> It is difficult to give an answer without a complete understanding of the
>> real-time deadlines of the system.
>>
>> Let me take a swing at it in the most general fashion.
>>
>> If your are looking for an average issue slot usage of 6-7 or above (out of
>> a total possible 8) for inner loops/kernel of your algorithm, then there
>> might be a need for pipelined/linear assembly (more likely that you would
>> need pipelined assembly). But pipelined assembly takes long time to develop
>> and difficult to maintain, by an order.
>>
>> Linear assembly is much easier to code and somewhat easier to maintain than
>> pipelined assembly. Linear assembly has given me outputs of the order 5 to 6
>> (average) for the inner loops. But it is possible that out of 10 cycles, two
>> or three might be at, 4 out of a total of 8 per cycle.
>>
>> Obviously you have to make an initial target mapping analysis of your
>> requirement by mapping the loads/stores and arithmetic of your algorithm
>> into C6X VLIW instruction set capabilities, keeping in mind all restrictions
>> of the processor (cross path stalls etc..). If you are using existing
>> library functions, this might become a little difficult. But in general this
>> analysis gives you a good idea of what is achievable.
>>
>> In general intrinsics with good compiler directives (pragmas),does the job
>> for most applications. C6X provides very efficient pragmas for optimizations
>> C6X intrinsics with good pragma's easily give you an average issue slot
>> usage 4-6.
>>
>> Pragma's are critical, but also critical are usage of type qualifiers like
>> restrict, const etc..
>> In addition C6X provide pragma's to align memory elements. Completely
>> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> provides good debug info, (I think you need to turn it on), on what exactly
>> can improve the algorithm performance. For example, if there are excessive
>> register to memory pills.
>> Regards
>>
>> --- On *Sun, 3/29/09, j...@gmail.com *wrote:
>>
>> From: j...@gmail.com
>> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> To: c...
>> Date: Sunday, March 29, 2009, 7:41 PM
>> Hi, all
>> Nowadays I am reading the documents about optimization:
>> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> spru198i TMS320C6000 Programmers Guide.pdf
>> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>>
>> C64x+ is a special architecture, and instructions has different latencies
>> depending on the type fo inctructions. At first, I think refining C/C++ code
>> with pragmas and programming with intrinsics can solve the problem of
>> optimization. But I got some ppt from internet. It seems assembly code is
>> necessary sometimes.
>> For your experience, do you think assembly code/linear assembly code is
>> necessary?Under what conditions and for what application?
>>
>> Thanks in advance.
>> Jogging
>>
>
>
> _____________________________________
>
>
> Individual Email | Traditional
>
>
>
> http://docs.yahoo.com/info/terms/
>
Reply by ●April 23, 20092009-04-23
Hi, Michael
At first I hope to know the reason why the linear assembly code is
necessary.
I can provide information to c compiler with pragma and restrict.
Intrinsics can be used to instruction selection. So in my opinion linear
assembly code
is not necessary. The benefit of assembly code is instruction selection.
With pragma,
restrict and intrinsics I can implement the most function of assembly code.
I work on optimization for a while, and find memory access is more
important,
because it influences the performance greatly.
So the first step of the workflow of improving the performance of C should
be
improve memory access pattern.
I have no experience of using DMA on C64x+. Can anyone share his experience
of
using DMA. How does DMA improve the performance. I find DMA is not part of
DSP/BIOS.
I want to know whether DMA can be used without DSP/BIOS.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn wrote:
> jogging,
>
> On Wed, Apr 22, 2009 at 7:38 AM, jogging song
> wrote:
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with it?
>
> Check out 'delay slots' and 'load instructions' in spru732c. If you
> look at the assembly code generated by the C compiler, you will
> probably see that it makes use of the delay slots.
> Q1. Are you comparing optimized [by the compiler] C code with assembly
> code??
>
> mikedunn
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> >
> >> Hi,
> >> It is difficult to give an answer without a complete understanding of
> the
> >> real-time deadlines of the system.
> >>
> >> Let me take a swing at it in the most general fashion.
> >>
> >> If your are looking for an average issue slot usage of 6-7 or above (out
> of
> >> a total possible 8) for inner loops/kernel of your algorithm, then there
> >> might be a need for pipelined/linear assembly (more likely that you
> would
> >> need pipelined assembly). But pipelined assembly takes long time to
> develop
> >> and difficult to maintain, by an order.
> >>
> >> Linear assembly is much easier to code and somewhat easier to maintain
> than
> >> pipelined assembly. Linear assembly has given me outputs of the order 5
> to 6
> >> (average) for the inner loops. But it is possible that out of 10 cycles,
> two
> >> or three might be at, 4 out of a total of 8 per cycle.
> >>
> >> Obviously you have to make an initial target mapping analysis of your
> >> requirement by mapping the loads/stores and arithmetic of your algorithm
> >> into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> >> of the processor (cross path stalls etc..). If you are using existing
> >> library functions, this might become a little difficult. But in general
> this
> >> analysis gives you a good idea of what is achievable.
> >>
> >> In general intrinsics with good compiler directives (pragmas),does the
> job
> >> for most applications. C6X provides very efficient pragmas for
> optimizations
> >> C6X intrinsics with good pragma's easily give you an average issue slot
> >> usage 4-6.
> >>
> >> Pragma's are critical, but also critical are usage of type qualifiers
> like
> >> restrict, const etc..
> >> In addition C6X provide pragma's to align memory elements. Completely
> >> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> >> provides good debug info, (I think you need to turn it on), on what
> exactly
> >> can improve the algorithm performance. For example, if there are
> excessive
> >> register to memory pills.
> >>
> >>
> >> Regards
> >>
> >> --- On *Sun, 3/29/09, j...@gmail.com > >*wrote:
> >>
> >> From: j...@gmail.com
> >> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> >> To: c...
> >> Date: Sunday, March 29, 2009, 7:41 PM
> >>
> >>
> >> Hi, all
> >>
> >>
> >> Nowadays I am reading the documents about optimization:
> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> >> spru198i TMS320C6000 Programmers Guide.pdf
> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> >>
> >> C64x+ is a special architecture, and instructions has different
> latencies
> >> depending on the type fo inctructions. At first, I think refining C/C++
> code
> >> with pragmas and programming with intrinsics can solve the problem of
> >> optimization. But I got some ppt from internet. It seems assembly code
> is
> >> necessary sometimes.
> >> For your experience, do you think assembly code/linear assembly code is
> >> necessary?Under what conditions and for what application?
> >>
> >> Thanks in advance.
> >> Jogging
> >>
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>
At first I hope to know the reason why the linear assembly code is
necessary.
I can provide information to c compiler with pragma and restrict.
Intrinsics can be used to instruction selection. So in my opinion linear
assembly code
is not necessary. The benefit of assembly code is instruction selection.
With pragma,
restrict and intrinsics I can implement the most function of assembly code.
I work on optimization for a while, and find memory access is more
important,
because it influences the performance greatly.
So the first step of the workflow of improving the performance of C should
be
improve memory access pattern.
I have no experience of using DMA on C64x+. Can anyone share his experience
of
using DMA. How does DMA improve the performance. I find DMA is not part of
DSP/BIOS.
I want to know whether DMA can be used without DSP/BIOS.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn wrote:
> jogging,
>
> On Wed, Apr 22, 2009 at 7:38 AM, jogging song
> wrote:
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with it?
>
> Check out 'delay slots' and 'load instructions' in spru732c. If you
> look at the assembly code generated by the C compiler, you will
> probably see that it makes use of the delay slots.
> Q1. Are you comparing optimized [by the compiler] C code with assembly
> code??
>
> mikedunn
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
> >
> >> Hi,
> >> It is difficult to give an answer without a complete understanding of
> the
> >> real-time deadlines of the system.
> >>
> >> Let me take a swing at it in the most general fashion.
> >>
> >> If your are looking for an average issue slot usage of 6-7 or above (out
> of
> >> a total possible 8) for inner loops/kernel of your algorithm, then there
> >> might be a need for pipelined/linear assembly (more likely that you
> would
> >> need pipelined assembly). But pipelined assembly takes long time to
> develop
> >> and difficult to maintain, by an order.
> >>
> >> Linear assembly is much easier to code and somewhat easier to maintain
> than
> >> pipelined assembly. Linear assembly has given me outputs of the order 5
> to 6
> >> (average) for the inner loops. But it is possible that out of 10 cycles,
> two
> >> or three might be at, 4 out of a total of 8 per cycle.
> >>
> >> Obviously you have to make an initial target mapping analysis of your
> >> requirement by mapping the loads/stores and arithmetic of your algorithm
> >> into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> >> of the processor (cross path stalls etc..). If you are using existing
> >> library functions, this might become a little difficult. But in general
> this
> >> analysis gives you a good idea of what is achievable.
> >>
> >> In general intrinsics with good compiler directives (pragmas),does the
> job
> >> for most applications. C6X provides very efficient pragmas for
> optimizations
> >> C6X intrinsics with good pragma's easily give you an average issue slot
> >> usage 4-6.
> >>
> >> Pragma's are critical, but also critical are usage of type qualifiers
> like
> >> restrict, const etc..
> >> In addition C6X provide pragma's to align memory elements. Completely
> >> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> >> provides good debug info, (I think you need to turn it on), on what
> exactly
> >> can improve the algorithm performance. For example, if there are
> excessive
> >> register to memory pills.
> >>
> >>
> >> Regards
> >>
> >> --- On *Sun, 3/29/09, j...@gmail.com > >*wrote:
> >>
> >> From: j...@gmail.com
> >> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> >> To: c...
> >> Date: Sunday, March 29, 2009, 7:41 PM
> >>
> >>
> >> Hi, all
> >>
> >>
> >> Nowadays I am reading the documents about optimization:
> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> >> spru198i TMS320C6000 Programmers Guide.pdf
> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> >>
> >> C64x+ is a special architecture, and instructions has different
> latencies
> >> depending on the type fo inctructions. At first, I think refining C/C++
> code
> >> with pragmas and programming with intrinsics can solve the problem of
> >> optimization. But I got some ppt from internet. It seems assembly code
> is
> >> necessary sometimes.
> >> For your experience, do you think assembly code/linear assembly code is
> >> necessary?Under what conditions and for what application?
> >>
> >> Thanks in advance.
> >> Jogging
> >>
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>
Reply by ●April 23, 20092009-04-23
jogging,
On Thu, Apr 23, 2009 at 4:21 AM, jogging song wrote:
> Hi, Michael
> At first I hope to know the reason why the linear assembly code is
> necessary.
Maybe you misunderstood. I am not saying that coding in assembly is necessary.
What is necessary is to understand what assembly code is generated by
the C compiler.
You might effectively optimize C code by carefully using pragmas,
intrinsics, and restrict. IMO, You cannot evaluate the effectiveness
of pragmas, intrinsics, and restrict without looking at before and
after versions of the assembly listing.
> I can provide information to c compiler with pragma and restrict.
> Intrinsics can be used to instruction selection. So in my opinion linear
> assembly code
> is not necessary. The benefit of assembly code is instruction selection.
and sequence.
> With pragma,
> restrict and intrinsics I can implement the most function of assembly code.
>
> I work on optimization for a while, and find memory access is more
> important,
> because it influences the performance greatly.
> So the first step of the workflow of improving the performance of C should
> be
> improve memory access pattern.
>
> I have no experience of using DMA on C64x+. Can anyone share his experience
> of
> using DMA. How does DMA improve the performance. I find DMA is not part of
> DSP/BIOS.
DSP/BIOS supports DMA. Lookup 'Direct Memory Access' at wikipedia.
The short version is that DMA uses a state machine to perform memory
[or peripheral] accesses while the CPU is executing instructions.
mikedunn
> I want to know whether DMA can be used without DSP/BIOS.
>
> Best Regards
> Jogging
>
> On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn
> wrote:
>>
>> jogging,
>>
>> On Wed, Apr 22, 2009 at 7:38 AM, jogging song
>> wrote:
>> > Hi,
>> > Thanks for your opinion. I agree with you completely.
>> > Recently I find memory access may influence the performance more than
>> > assembly code.
>> > In order to learn more about the memory access effect, I do some tests.
>> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
>> > In the example, test program runs the function in c and then the
>> > function in
>> > assembly code.
>> > At first, I put the data in L2 RAM, the resulting time is below:
>> > IMG_perimeter asm cycle: 1029
>> > IMG_perimeter c cycle: 2941
>> >
>> > Then I put the data in external memory DDR2, the resulting time is
>> > below.
>> > IMG_perimeter asm cycle: 6250
>> > IMG_perimeter c cycle: 13234
>> >
>> > We can see that if the data is put in L2 RAM, the time can be reduced
>> > from 13234 to 2941. It is much better than assembly code optimization
>> > which reduces time from 13234 to 6250.
>> >
>> > Before I pay my attention to assembly code optimization, and haven't
>> > found memory access effect.
>> >
>> > My another question is that: memory access latency is multiple cycles in
>> > the
>> > C64x+ pipeline.
>> > For load instruction, it needs five cycles to obtain data. If queue or
>> > tree
>> > data structure is used,
>> > I don't know how to optimize it. Can anyone share his experience with
>> > it?
>>
>> Check out 'delay slots' and 'load instructions' in spru732c. If you
>> look at the assembly code generated by the C compiler, you will
>> probably see that it makes use of the delay slots.
>> Q1. Are you comparing optimized [by the compiler] C code with assembly
>> code??
>>
>> mikedunn
>> >
>> > Thanks in advance.
>> > Jogging
>> >
>> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>> >
>> >> Hi,
>> >> It is difficult to give an answer without a complete understanding of
>> >> the
>> >> real-time deadlines of the system.
>> >>
>> >> Let me take a swing at it in the most general fashion.
>> >>
>> >> If your are looking for an average issue slot usage of 6-7 or above
>> >> (out of
>> >> a total possible 8) for inner loops/kernel of your algorithm, then
>> >> there
>> >> might be a need for pipelined/linear assembly (more likely that you
>> >> would
>> >> need pipelined assembly). But pipelined assembly takes long time to
>> >> develop
>> >> and difficult to maintain, by an order.
>> >>
>> >> Linear assembly is much easier to code and somewhat easier to maintain
>> >> than
>> >> pipelined assembly. Linear assembly has given me outputs of the order 5
>> >> to 6
>> >> (average) for the inner loops. But it is possible that out of 10
>> >> cycles, two
>> >> or three might be at, 4 out of a total of 8 per cycle.
>> >>
>> >> Obviously you have to make an initial target mapping analysis of your
>> >> requirement by mapping the loads/stores and arithmetic of your
>> >> algorithm
>> >> into C6X VLIW instruction set capabilities, keeping in mind all
>> >> restrictions
>> >> of the processor (cross path stalls etc..). If you are using existing
>> >> library functions, this might become a little difficult. But in general
>> >> this
>> >> analysis gives you a good idea of what is achievable.
>> >>
>> >> In general intrinsics with good compiler directives (pragmas),does the
>> >> job
>> >> for most applications. C6X provides very efficient pragmas for
>> >> optimizations
>> >> C6X intrinsics with good pragma's easily give you an average issue slot
>> >> usage 4-6.
>> >>
>> >> Pragma's are critical, but also critical are usage of type qualifiers
>> >> like
>> >> restrict, const etc..
>> >> In addition C6X provide pragma's to align memory elements. Completely
>> >> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> >> provides good debug info, (I think you need to turn it on), on what
>> >> exactly
>> >> can improve the algorithm performance. For example, if there are
>> >> excessive
>> >> register to memory pills.
>> >>
>> >>
>> >> Regards
>> >>
>> >> --- On *Sun, 3/29/09, j...@gmail.com
>> >> *wrote:
>> >>
>> >> From: j...@gmail.com
>> >> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> >> To: c...
>> >> Date: Sunday, March 29, 2009, 7:41 PM
>> >>
>> >>
>> >> Hi, all
>> >>
>> >>
>> >> Nowadays I am reading the documents about optimization:
>> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> >> spru198i TMS320C6000 Programmers Guide.pdf
>> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
>> >> Guide.pdf.
>> >>
>> >> C64x+ is a special architecture, and instructions has different
>> >> latencies
>> >> depending on the type fo inctructions. At first, I think refining C/C++
>> >> code
>> >> with pragmas and programming with intrinsics can solve the problem of
>> >> optimization. But I got some ppt from internet. It seems assembly code
>> >> is
>> >> necessary sometimes.
>> >> For your experience, do you think assembly code/linear assembly code
>> >> is
>> >> necessary?Under what conditions and for what application?
>> >>
>> >> Thanks in advance.
>> >> Jogging
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> > _____________________________________
>> >
>> >
>> >
>> >
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________
On Thu, Apr 23, 2009 at 4:21 AM, jogging song wrote:
> Hi, Michael
> At first I hope to know the reason why the linear assembly code is
> necessary.
Maybe you misunderstood. I am not saying that coding in assembly is necessary.
What is necessary is to understand what assembly code is generated by
the C compiler.
You might effectively optimize C code by carefully using pragmas,
intrinsics, and restrict. IMO, You cannot evaluate the effectiveness
of pragmas, intrinsics, and restrict without looking at before and
after versions of the assembly listing.
> I can provide information to c compiler with pragma and restrict.
> Intrinsics can be used to instruction selection. So in my opinion linear
> assembly code
> is not necessary. The benefit of assembly code is instruction selection.
and sequence.
> With pragma,
> restrict and intrinsics I can implement the most function of assembly code.
>
> I work on optimization for a while, and find memory access is more
> important,
> because it influences the performance greatly.
> So the first step of the workflow of improving the performance of C should
> be
> improve memory access pattern.
>
> I have no experience of using DMA on C64x+. Can anyone share his experience
> of
> using DMA. How does DMA improve the performance. I find DMA is not part of
> DSP/BIOS.
DSP/BIOS supports DMA. Lookup 'Direct Memory Access' at wikipedia.
The short version is that DMA uses a state machine to perform memory
[or peripheral] accesses while the CPU is executing instructions.
mikedunn
> I want to know whether DMA can be used without DSP/BIOS.
>
> Best Regards
> Jogging
>
> On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn
> wrote:
>>
>> jogging,
>>
>> On Wed, Apr 22, 2009 at 7:38 AM, jogging song
>> wrote:
>> > Hi,
>> > Thanks for your opinion. I agree with you completely.
>> > Recently I find memory access may influence the performance more than
>> > assembly code.
>> > In order to learn more about the memory access effect, I do some tests.
>> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
>> > In the example, test program runs the function in c and then the
>> > function in
>> > assembly code.
>> > At first, I put the data in L2 RAM, the resulting time is below:
>> > IMG_perimeter asm cycle: 1029
>> > IMG_perimeter c cycle: 2941
>> >
>> > Then I put the data in external memory DDR2, the resulting time is
>> > below.
>> > IMG_perimeter asm cycle: 6250
>> > IMG_perimeter c cycle: 13234
>> >
>> > We can see that if the data is put in L2 RAM, the time can be reduced
>> > from 13234 to 2941. It is much better than assembly code optimization
>> > which reduces time from 13234 to 6250.
>> >
>> > Before I pay my attention to assembly code optimization, and haven't
>> > found memory access effect.
>> >
>> > My another question is that: memory access latency is multiple cycles in
>> > the
>> > C64x+ pipeline.
>> > For load instruction, it needs five cycles to obtain data. If queue or
>> > tree
>> > data structure is used,
>> > I don't know how to optimize it. Can anyone share his experience with
>> > it?
>>
>> Check out 'delay slots' and 'load instructions' in spru732c. If you
>> look at the assembly code generated by the C compiler, you will
>> probably see that it makes use of the delay slots.
>> Q1. Are you comparing optimized [by the compiler] C code with assembly
>> code??
>>
>> mikedunn
>> >
>> > Thanks in advance.
>> > Jogging
>> >
>> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi wrote:
>> >
>> >> Hi,
>> >> It is difficult to give an answer without a complete understanding of
>> >> the
>> >> real-time deadlines of the system.
>> >>
>> >> Let me take a swing at it in the most general fashion.
>> >>
>> >> If your are looking for an average issue slot usage of 6-7 or above
>> >> (out of
>> >> a total possible 8) for inner loops/kernel of your algorithm, then
>> >> there
>> >> might be a need for pipelined/linear assembly (more likely that you
>> >> would
>> >> need pipelined assembly). But pipelined assembly takes long time to
>> >> develop
>> >> and difficult to maintain, by an order.
>> >>
>> >> Linear assembly is much easier to code and somewhat easier to maintain
>> >> than
>> >> pipelined assembly. Linear assembly has given me outputs of the order 5
>> >> to 6
>> >> (average) for the inner loops. But it is possible that out of 10
>> >> cycles, two
>> >> or three might be at, 4 out of a total of 8 per cycle.
>> >>
>> >> Obviously you have to make an initial target mapping analysis of your
>> >> requirement by mapping the loads/stores and arithmetic of your
>> >> algorithm
>> >> into C6X VLIW instruction set capabilities, keeping in mind all
>> >> restrictions
>> >> of the processor (cross path stalls etc..). If you are using existing
>> >> library functions, this might become a little difficult. But in general
>> >> this
>> >> analysis gives you a good idea of what is achievable.
>> >>
>> >> In general intrinsics with good compiler directives (pragmas),does the
>> >> job
>> >> for most applications. C6X provides very efficient pragmas for
>> >> optimizations
>> >> C6X intrinsics with good pragma's easily give you an average issue slot
>> >> usage 4-6.
>> >>
>> >> Pragma's are critical, but also critical are usage of type qualifiers
>> >> like
>> >> restrict, const etc..
>> >> In addition C6X provide pragma's to align memory elements. Completely
>> >> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
>> >> provides good debug info, (I think you need to turn it on), on what
>> >> exactly
>> >> can improve the algorithm performance. For example, if there are
>> >> excessive
>> >> register to memory pills.
>> >>
>> >>
>> >> Regards
>> >>
>> >> --- On *Sun, 3/29/09, j...@gmail.com
>> >> *wrote:
>> >>
>> >> From: j...@gmail.com
>> >> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> >> To: c...
>> >> Date: Sunday, March 29, 2009, 7:41 PM
>> >>
>> >>
>> >> Hi, all
>> >>
>> >>
>> >> Nowadays I am reading the documents about optimization:
>> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> >> spru198i TMS320C6000 Programmers Guide.pdf
>> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
>> >> Guide.pdf.
>> >>
>> >> C64x+ is a special architecture, and instructions has different
>> >> latencies
>> >> depending on the type fo inctructions. At first, I think refining C/C++
>> >> code
>> >> with pragmas and programming with intrinsics can solve the problem of
>> >> optimization. But I got some ppt from internet. It seems assembly code
>> >> is
>> >> necessary sometimes.
>> >> For your experience, do you think assembly code/linear assembly code
>> >> is
>> >> necessary?Under what conditions and for what application?
>> >>
>> >> Thanks in advance.
>> >> Jogging
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> > _____________________________________
>> >
>> >
>> >
>> >
>> >
>>
>> --
>> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
--
www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
_____________________________________






