
Technical discussions about the TI C6000 DSPs (including the c62x, c64x and c67x DSPs).
Hi, all Nowadays I am reading the documents about optimization: spru187o TMS320C6000 Optimizing Compiler v6.1.pdf spru198i TMS320C6000 Programmerâs Guide.pdf spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf. C64x+ is a special architecture, and instructions has different latencies depending on the type fo inctructions. At first, I think refining C/C++ code with pragmas and programming with intrinsics can solve the problem of optimization. But I got some ppt from internet. It seems assembly code is necessary sometimes. For your experience, do you think assembly code/linear assembly code is necessary?Under what conditions and for what application? Thanks in advance. Jogging ___________________________________________________________________
Hi, It is difficult to give an answer without a complete understanding of the r= eal-time deadlines=C2=A0 of the system. Let me take a swing at it in the most general fashion. If your are looking for an average issue slot usage of 6-7 or above (out of= a total possible 8) for inner loops/kernel of your algorithm, then there m= ight be a=C2=A0 need for pipelined/linear assembly (more likely that you wo= uld need pipelined assembly). But pipelined assembly takes long time to dev= elop and difficult to maintain, by an order. Linear assembly is much easier to code and somewhat easier to maintain than= pipelined assembly. Linear assembly has given me outputs of the order 5 to= 6 (average) for the inner loops. But it is possible that out of 10 cycles,= two or three might be at, 4 out of a total of 8 per cycle. Obviously you have to make an initial target mapping analysis of your requirement by mapping the loads/stores and arithmetic of your algorithm i= nto C6X VLIW instruction set capabilities, keeping in mind all restrictions= of the processor (cross path stalls etc..). If you are using existing libr= ary functions, this might become a little difficult. But in general this an= alysis gives you a good idea of what is achievable. In general intrinsics with good compiler directives (pragmas),does the job for most applications. C6X provides very efficient pragmas for optimizations C6X intrinsics with good pragma's easily give you an average = issue slot usage 4-6. Pragma's are critical, but also critical are usage of type qualifiers like = restrict, const etc.. In addition C6X provide pragma's to align memory elements. Completely avoid= ing unaligned accesses, can be a benefit. In addition,C6X compiler provides= good debug info, (I think you need to turn it on), on what exactly can imp= rove the algorithm performance. For example, if there are excessive registe= r to memory pills.=20 Regards --- On Sun, 3/29/09, j...@gmail.com <j...@gmail.com> wrote: From: j...@gmail.com <j...@gmail.com> Subject: [c6x] Is Assembly code/linear assembly code necessary? To: c...@yahoogroups.com Date: Sunday, March 29, 2009, 7:41 PM Hi, all =20=20 Nowadays I am reading the documents about optimization: spru187o TMS320C6000 Optimizing Compiler v6.1.pdf spru198i TMS320C6000 Programmer=E2=80=99s Guide.pdf spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf. C64x+ is a special architecture, and instructions has different latencies depending on the type fo inctructions. At first, I think refining C/C++ cod= e with pragmas and programming with intrinsics can solve the problem of optimization. But I got some ppt from internet. It seems assembly code is necessary sometimes.=20 For your experience, do you think assembly code/linear assembly code is necessary?Under what conditions and for what application? Thanks in advance. Jogging =20=20=20=20=20=20 _____________________________________
Hi,
Thanks for your opinion. I agree with you completely.
Recently I find memory access may influence the performance more than
assembly code.
In order to learn more about the memory access effect, I do some tests.
I run the IMG_perimeter function from imglib library on DM6437 EVM.
In the example, test program runs the function in c and then the function i=
n
assembly code.
At first, I put the data in L2 RAM, the resulting time is below:
IMG_perimeter asm cycle: 1029
IMG_perimeter c cycle: 2941
Then I put the data in external memory DDR2, the resulting time is below.
IMG_perimeter asm cycle: 6250
IMG_perimeter c cycle: 13234
We can see that if the data is put in L2 RAM, the time can be reduced
from 13234 to 2941. It is much better than assembly code optimization
which reduces time from 13234 to 6250.
Before I pay my attention to assembly code optimization, and haven't
found memory access effect.
My another question is that: memory access latency is multiple cycles in th=
e
C64x+ pipeline.
For load instruction, it needs five cycles to obtain data. If queue or tree
data structure is used,
I don't know how to optimize it. Can anyone share his experience with it?
Thanks in advance.
Jogging
On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote:
> Hi,
> It is difficult to give an answer without a complete understanding of the
> real-time deadlines of the system.
>
> Let me take a swing at it in the most general fashion.
>
> If your are looking for an average issue slot usage of 6-7 or above (out =
of
> a total possible 8) for inner loops/kernel of your algorithm, then there
> might be a need for pipelined/linear assembly (more likely that you woul=
d
> need pipelined assembly). But pipelined assembly takes long time to devel=
op
> and difficult to maintain, by an order.
>
> Linear assembly is much easier to code and somewhat easier to maintain th=
an
> pipelined assembly. Linear assembly has given me outputs of the order 5 t=
o 6
> (average) for the inner loops. But it is possible that out of 10 cycles, =
two
> or three might be at, 4 out of a total of 8 per cycle.
>
> Obviously you have to make an initial target mapping analysis of your
> requirement by mapping the loads/stores and arithmetic of your algorithm
> into C6X VLIW instruction set capabilities, keeping in mind all restricti=
ons
> of the processor (cross path stalls etc..). If you are using existing
> library functions, this might become a little difficult. But in general t=
his
> analysis gives you a good idea of what is achievable.
>
> In general intrinsics with good compiler directives (pragmas),does the jo=
b
> for most applications. C6X provides very efficient pragmas for optimizati=
ons
> C6X intrinsics with good pragma's easily give you an average issue slot
> usage 4-6.
>
> Pragma's are critical, but also critical are usage of type qualifiers lik=
e
> restrict, const etc..
> In addition C6X provide pragma's to align memory elements. Completely
> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler
> provides good debug info, (I think you need to turn it on), on what exact=
ly
> can improve the algorithm performance. For example, if there are excessiv=
e
> register to memory pills.
> Regards
>
> --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com>*wrote=
:
>
> From: j...@gmail.com <j...@gmail.com>
> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> To: c...@yahoogroups.com
> Date: Sunday, March 29, 2009, 7:41 PM
> Hi, all
> Nowadays I am reading the documents about optimization:
> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> spru198i TMS320C6000 Programmer=92s Guide.pdf
> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf.
>
> C64x+ is a special architecture, and instructions has different latencies
> depending on the type fo inctructions. At first, I think refining C/C++ c=
ode
> with pragmas and programming with intrinsics can solve the problem of
> optimization. But I got some ppt from internet. It seems assembly code is
> necessary sometimes.
> For your experience, do you think assembly code/linear assembly code is
> necessary?Under what conditions and for what application?
>
> Thanks in advance.
> Jogging
>
_____________________________________
______________________________Hi,=20 =20 I have some doubts on your figures, are you sure you had Cache enabled when= running in external memory? Where were the data to process? in internal SDRAM as well? I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal R= AM is internal RAM, it sounds 2 different things to me. =20 Normally with some good pragmas and optimise instructions to the compiler y= ou can get the same result as assembly code, but for far less efforts. =20 Regards =20 > To: r...@yahoo.com > CC: c...@yahoogroups.com > From: j...@gmail.com > Date: Wed, 22 Apr 2009 20:38:20 +0800 > Subject: Re: [c6x] Is Assembly code/linear assembly code necessary? >=20 > Hi, > Thanks for your opinion. I agree with you completely. > Recently I find memory access may influence the performance more than > assembly code. > In order to learn more about the memory access effect, I do some tests. > I run the IMG_perimeter function from imglib library on DM6437 EVM. > In the example, test program runs the function in c and then the function= in > assembly code. > At first, I put the data in L2 RAM, the resulting time is below: > IMG_perimeter asm cycle: 1029 > IMG_perimeter c cycle: 2941 >=20 > Then I put the data in external memory DDR2, the resulting time is below. > IMG_perimeter asm cycle: 6250 > IMG_perimeter c cycle: 13234 >=20 > We can see that if the data is put in L2 RAM, the time can be reduced > from 13234 to 2941. It is much better than assembly code optimization > which reduces time from 13234 to 6250. >=20 > Before I pay my attention to assembly code optimization, and haven't > found memory access effect. >=20 > My another question is that: memory access latency is multiple cycles in = the > C64x+ pipeline. > For load instruction, it needs five cycles to obtain data. If queue or tr= ee > data structure is used, > I don't know how to optimize it. Can anyone share his experience with it? >=20 > Thanks in advance. > Jogging >=20 > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote: >=20 > > Hi, > > It is difficult to give an answer without a complete understanding of t= he > > real-time deadlines of the system. > > > > Let me take a swing at it in the most general fashion. > > > > If your are looking for an average issue slot usage of 6-7 or above (ou= t of > > a total possible 8) for inner loops/kernel of your algorithm, then ther= e > > might be a need for pipelined/linear assembly (more likely that you wou= ld > > need pipelined assembly). But pipelined assembly takes long time to dev= elop > > and difficult to maintain, by an order. > > > > Linear assembly is much easier to code and somewhat easier to maintain = than > > pipelined assembly. Linear assembly has given me outputs of the order 5= to 6 > > (average) for the inner loops. But it is possible that out of 10 cycles= , two > > or three might be at, 4 out of a total of 8 per cycle. > > > > Obviously you have to make an initial target mapping analysis of your > > requirement by mapping the loads/stores and arithmetic of your algorith= m > > into C6X VLIW instruction set capabilities, keeping in mind all restric= tions > > of the processor (cross path stalls etc..). If you are using existing > > library functions, this might become a little difficult. But in general= this > > analysis gives you a good idea of what is achievable. > > > > In general intrinsics with good compiler directives (pragmas),does the = job > > for most applications. C6X provides very efficient pragmas for optimiza= tions > > C6X intrinsics with good pragma's easily give you an average issue slot > > usage 4-6. > > > > Pragma's are critical, but also critical are usage of type qualifiers l= ike > > restrict, const etc.. > > In addition C6X provide pragma's to align memory elements. Completely > > avoiding unaligned accesses, can be a benefit. In addition,C6X compiler > > provides good debug info, (I think you need to turn it on), on what exa= ctly > > can improve the algorithm performance. For example, if there are excess= ive > > register to memory pills. > > > > > > Regards > > > > --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com>*wro= te: > > > > From: j...@gmail.com <j...@gmail.com> > > Subject: [c6x] Is Assembly code/linear assembly code necessary? > > To: c...@yahoogroups.com > > Date: Sunday, March 29, 2009, 7:41 PM > > > > > > Hi, all > > > > > > Nowadays I am reading the documents about optimization: > > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf > > spru198i TMS320C6000 Programmer=92s Guide.pdf > > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pd= f. > > > > C64x+ is a special architecture, and instructions has different latenci= es > > depending on the type fo inctructions. At first, I think refining C/C++= code > > with pragmas and programming with intrinsics can solve the problem of > > optimization. But I got some ppt from internet. It seems assembly code = is > > necessary sometimes. > > For your experience, do you think assembly code/linear assembly code is > > necessary?Under what conditions and for what application? > > > > Thanks in advance. > > Jogging > > ___________________________________________________________________
Hi,
I assure that external memory is cacheable because I obtain three sets
of figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet <
c...@hotmail.com> wrote:
> Hi,
>
> I have some doubts on your figures, are you sure you had Cache enabled
when
> running in external memory?
> Where were the data to process? in internal SDRAM as well?
> I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal
> RAM is internal RAM, it sounds 2 different things to me.
>
> Normally with some good pragmas and optimise instructions to the compiler
> you can get the same result as assembly code, but for far less efforts.
>
> Regards
> > To: r...@yahoo.com
> > CC: c...@yahoogroups.com
> > From: j...@gmail.com
> > Date: Wed, 22 Apr 2009 20:38:20 +0800
> > Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> >
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some
tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the
function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is
below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles
in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue
or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with
it?
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote:
> >
> > > Hi,
> > > It is difficult to give an answer without a complete
understanding of
> the
> > > real-time deadlines of the system.
> > >
> > > Let me take a swing at it in the most general fashion.
> > >
> > > If your are looking for an average issue slot usage of 6-7 or
above
> (out of
> > > a total possible 8) for inner loops/kernel of your algorithm,
then
> there
> > > might be a need for pipelined/linear assembly (more likely that
you
> would
> > > need pipelined assembly). But pipelined assembly takes long time
to
> develop
> > > and difficult to maintain, by an order.
> > >
> > > Linear assembly is much easier to code and somewhat easier to
maintain
> than
> > > pipelined assembly. Linear assembly has given me outputs of the
order 5
> to 6
> > > (average) for the inner loops. But it is possible that out of 10
> cycles, two
> > > or three might be at, 4 out of a total of 8 per cycle.
> > >
> > > Obviously you have to make an initial target mapping analysis of
your
> > > requirement by mapping the loads/stores and arithmetic of your
> algorithm
> > > into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> > > of the processor (cross path stalls etc..). If you are using
existing
> > > library functions, this might become a little difficult. But in
general
> this
> > > analysis gives you a good idea of what is achievable.
> > >
> > > In general intrinsics with good compiler directives
(pragmas),does the
> job
> > > for most applications. C6X provides very efficient pragmas for
> optimizations
> > > C6X intrinsics with good pragma's easily give you an average
issue slot
> > > usage 4-6.
> > >
> > > Pragma's are critical, but also critical are usage of type
qualifiers
> like
> > > restrict, const etc..
> > > In addition C6X provide pragma's to align memory elements.
Completely
> > > avoiding unaligned accesses, can be a benefit. In addition,C6X
compiler
> > > provides good debug info, (I think you need to turn it on), on
what
> exactly
> > > can improve the algorithm performance. For example, if there are
> excessive
> > > register to memory pills.
> > >
> > >
> > > Regards
> > >
> > > --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com
> >*wrote:
> > >
> > > From: j...@gmail.com <j...@gmail.com>
> > > Subject: [c6x] Is Assembly code/linear assembly code necessary?
> > > To: c...@yahoogroups.com
> > > Date: Sunday, March 29, 2009, 7:41 PM
> > >
> > >
> > > Hi, all
> > >
> > >
> > > Nowadays I am reading the documents about optimization:
> > > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> > > spru198i TMS320C6000 Programmers Guide.pdf
> > > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> > >
> > > C64x+ is a special architecture, and instructions has different
> latencies
> > > depending on the type fo inctructions. At first, I think refining
C/C++
> code
> > > with pragmas and programming with intrinsics can solve the
problem of
> > > optimization. But I got some ppt from internet. It seems assembly
code
> is
> > > necessary sometimes.
> > > For your experience, do you think assembly code/linear assembly
code is
> > > necessary?Under what conditions and for what application?
> > >
> > > Thanks in advance.
> > > Jogging
> > >
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
>
jogging, On Wed, Apr 22, 2009 at 7:38 AM, jogging song <j...@gmail.com> wrote= : > Hi, > =A0 =A0Thanks for your opinion. I agree with you completely. > Recently I find memory access may influence the performance more than > assembly code. > In order to learn more about the memory access effect, I do some tests. > I run the IMG_perimeter function from imglib library on DM6437 EVM. > In the example, test program runs the function in c and then the function= in > assembly code. > At first, I put the data in L2 RAM, the resulting time is below: > IMG_perimeter asm cycle: 1029 > IMG_perimeter c cycle: 2941 > > Then I put the data in external memory DDR2, the resulting time is below. > IMG_perimeter asm cycle: 6250 > IMG_perimeter c cycle: 13234 > > We can see that if the data is put in L2 RAM, the time can be reduced > from 13234 =A0to 2941. =A0It is much better than assembly code optimizati= on > which reduces time from 13234 =A0to 6250. > > Before I pay my attention to assembly code optimization, and haven't > found memory access effect. > > My another question is that: memory access latency is multiple cycles in = the > C64x+ pipeline. > For load instruction, it needs five cycles to obtain data. If queue or tr= ee > data structure is used, > I don't know how to optimize it. Can anyone share his experience with it? <mld> Check out 'delay slots' and 'load instructions' in spru732c. If you look at the assembly code generated by the C compiler, you will probably see that it makes use of the delay slots. Q1. Are you comparing optimized [by the compiler] C code with assembly code= ?? mikedunn > > Thanks in advance. > Jogging > > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote: > >> Hi, >> It is difficult to give an answer without a complete understanding of th= e >> real-time deadlines =A0of the system. >> >> Let me take a swing at it in the most general fashion. >> >> If your are looking for an average issue slot usage of 6-7 or above (out= of >> a total possible 8) for inner loops/kernel of your algorithm, then there >> might be a =A0need for pipelined/linear assembly (more likely that you w= ould >> need pipelined assembly). But pipelined assembly takes long time to deve= lop >> and difficult to maintain, by an order. >> >> Linear assembly is much easier to code and somewhat easier to maintain t= han >> pipelined assembly. Linear assembly has given me outputs of the order 5 = to 6 >> (average) for the inner loops. But it is possible that out of 10 cycles,= two >> or three might be at, 4 out of a total of 8 per cycle. >> >> Obviously you have to make an initial target mapping analysis of your >> requirement by mapping the loads/stores and arithmetic of your algorithm >> into C6X VLIW instruction set capabilities, keeping in mind all restrict= ions >> of the processor (cross path stalls etc..). If you are using existing >> library functions, this might become a little difficult. But in general = this >> analysis gives you a good idea of what is achievable. >> >> In general intrinsics with good compiler directives (pragmas),does the j= ob >> for most applications. C6X provides very efficient pragmas for optimizat= ions >> C6X intrinsics with good pragma's easily give you an average issue slot >> usage 4-6. >> >> Pragma's are critical, but also critical are usage of type qualifiers li= ke >> restrict, const etc.. >> In addition C6X provide pragma's to align memory elements. Completely >> avoiding unaligned accesses, can be a benefit. In addition,C6X compiler >> provides good debug info, (I think you need to turn it on), on what exac= tly >> can improve the algorithm performance. For example, if there are excessi= ve >> register to memory pills. >> Regards >> >> --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com>*wrot= e: >> >> From: j...@gmail.com <j...@gmail.com> >> Subject: [c6x] Is Assembly code/linear assembly code necessary? >> To: c...@yahoogroups.com >> Date: Sunday, March 29, 2009, 7:41 PM >> Hi, all >> =A0Nowadays I am reading the documents about optimization: >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf >> spru198i TMS320C6000 Programmer=92s Guide.pdf >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.pdf= . >> >> C64x+ is a special architecture, and instructions has different latencie= s >> depending on the type fo inctructions. At first, I think refining C/C++ = code >> with pragmas and programming with intrinsics can solve the problem of >> optimization. But I got some ppt from internet. It seems assembly code i= s >> necessary sometimes. >> For your experience, do you think assembly code/linear assembly code =A0= is >> necessary?Under what conditions and for what application? >> >> Thanks in advance. >> Jogging >> > > > _____________________________________ > --=20 www.dsprelated.com/blogs-1/nf/Mike_Dunn.php ___________________________________________________________________
christophe, On Wed, Apr 22, 2009 at 11:23 AM, christophe blouet <c...@hotmail.com> wrote: > Hi, > > I have some doubts on your figures, are you sure you had Cache enabled wh= en running in external memory? > > Where were the data to process? in internal SDRAM as well? <mld> If we are being picky about terminology [I do not care for the term 'L2 RAM'], should we not say 'internal SDRAM'?? :-) mikedunn > > I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal= RAM is internal RAM, it sounds 2 different things to me. > > Normally with some good pragmas and optimise instructions to the compiler= you can get the same result as assembly code, but for far less efforts. > > Regards >> To: r...@yahoo.com >> CC: c...@yahoogroups.com >> From: j...@gmail.com >> Date: Wed, 22 Apr 2009 20:38:20 +0800 >> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary? >> >> Hi, >> Thanks for your opinion. I agree with you completely. >> Recently I find memory access may influence the performance more than >> assembly code. >> In order to learn more about the memory access effect, I do some tests. >> I run the IMG_perimeter function from imglib library on DM6437 EVM. >> In the example, test program runs the function in c and then the functio= n in >> assembly code. >> At first, I put the data in L2 RAM, the resulting time is below: >> IMG_perimeter asm cycle: 1029 >> IMG_perimeter c cycle: 2941 >> >> Then I put the data in external memory DDR2, the resulting time is below= . >> IMG_perimeter asm cycle: 6250 >> IMG_perimeter c cycle: 13234 >> >> We can see that if the data is put in L2 RAM, the time can be reduced >> from 13234 to 2941. It is much better than assembly code optimization >> which reduces time from 13234 to 6250. >> >> Before I pay my attention to assembly code optimization, and haven't >> found memory access effect. >> >> My another question is that: memory access latency is multiple cycles in= the >> C64x+ pipeline. >> For load instruction, it needs five cycles to obtain data. If queue or t= ree >> data structure is used, >> I don't know how to optimize it. Can anyone share his experience with it= ? >> >> Thanks in advance. >> Jogging >> >> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote: >> >> > Hi, >> > It is difficult to give an answer without a complete understanding of = the >> > real-time deadlines of the system. >> > >> > Let me take a swing at it in the most general fashion. >> > >> > If your are looking for an average issue slot usage of 6-7 or above (o= ut of >> > a total possible 8) for inner loops/kernel of your algorithm, then the= re >> > might be a need for pipelined/linear assembly (more likely that you wo= uld >> > need pipelined assembly). But pipelined assembly takes long time to de= velop >> > and difficult to maintain, by an order. >> > >> > Linear assembly is much easier to code and somewhat easier to maintain= than >> > pipelined assembly. Linear assembly has given me outputs of the order = 5 to 6 >> > (average) for the inner loops. But it is possible that out of 10 cycle= s, two >> > or three might be at, 4 out of a total of 8 per cycle. >> > >> > Obviously you have to make an initial target mapping analysis of your >> > requirement by mapping the loads/stores and arithmetic of your algorit= hm >> > into C6X VLIW instruction set capabilities, keeping in mind all restri= ctions >> > of the processor (cross path stalls etc..). If you are using existing >> > library functions, this might become a little difficult. But in genera= l this >> > analysis gives you a good idea of what is achievable. >> > >> > In general intrinsics with good compiler directives (pragmas),does the= job >> > for most applications. C6X provides very efficient pragmas for optimiz= ations >> > C6X intrinsics with good pragma's easily give you an average issue slo= t >> > usage 4-6. >> > >> > Pragma's are critical, but also critical are usage of type qualifiers = like >> > restrict, const etc.. >> > In addition C6X provide pragma's to align memory elements. Completely >> > avoiding unaligned accesses, can be a benefit. In addition,C6X compile= r >> > provides good debug info, (I think you need to turn it on), on what ex= actly >> > can improve the algorithm performance. For example, if there are exces= sive >> > register to memory pills. >> > >> > >> > Regards >> > >> > --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com>*wr= ote: >> > >> > From: j...@gmail.com <j...@gmail.com> >> > Subject: [c6x] Is Assembly code/linear assembly code necessary? >> > To: c...@yahoogroups.com >> > Date: Sunday, March 29, 2009, 7:41 PM >> > >> > >> > Hi, all >> > >> > >> > Nowadays I am reading the documents about optimization: >> > spru187o TMS320C6000 Optimizing Compiler v6.1.pdf >> > spru198i TMS320C6000 Programmer=92s Guide.pdf >> > spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference Guide.p= df. >> > >> > C64x+ is a special architecture, and instructions has different latenc= ies >> > depending on the type fo inctructions. At first, I think refining C/C+= + code >> > with pragmas and programming with intrinsics can solve the problem of >> > optimization. But I got some ppt from internet. It seems assembly code= is >> > necessary sometimes. >> > For your experience, do you think assembly code/linear assembly code i= s >> > necessary?Under what conditions and for what application? >> > >> > Thanks in advance. >> > Jogging >> > > > > _____________________________________ > --=20 www.dsprelated.com/blogs-1/nf/Mike_Dunn.php _____________________________________
Ok, sounds good for the Cache enabled, but how big is your Cache? it can change
the results if your program is big. but if it's a small one, once loaded in
Cache you wouldn't see much difference between internal SDRAM ;-) and external
DDR.Really have a look on the C optimisations, by giving a minimum loop number,
the compiler will expand the number of calculations in one loop and then your
code won't suffer of pipeline delay. I got the same results using this method as
the best optimised routine in asm.
Regards
Date: Thu, 23 Apr 2009 09:45:30 +0800
Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
From: j...@gmail.com
To: c...@hotmail.com
CC: r...@yahoo.com; c...@yahoogroups.com
Hi,
I assure that external memory is cacheable because I obtain three sets of
figures.
The third set of figure is with cache off on external memory.
IMG_perimeter asm cycle: 28444
IMG_perimeter c cycle: 298242
In the function IMG_perimeter needs one input and one output.
In the test I put them both in internal RAM or in DDR2.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 12:23 AM, christophe blouet <c...@hotmail.com>
wrote:
Hi,
I have some doubts on your figures, are you sure you had Cache enabled when
running in external memory?
Where were the data to process? in internal SDRAM as well?
I wouldn't use the term internal L2 RAM, L2 means Level 2 Cache, internal RAM is
internal RAM, it sounds 2 different things to me.
Normally with some good pragmas and optimise instructions to the compiler you
can get the same result as assembly code, but for far less efforts.
Regards
> To: r...@yahoo.com
> CC: c...@yahoogroups.com
> From: j...@gmail.com
> Date: Wed, 22 Apr 2009 20:38:20 +0800
> Subject: Re: [c6x] Is Assembly code/linear assembly code necessary?
>
> Hi,
> Thanks for your opinion. I agree with you completely.
> Recently I find memory access may influence the performance more than
> assembly code.
> In order to learn more about the memory access effect, I do some tests.
> I run the IMG_perimeter function from imglib library on DM6437 EVM.
> In the example, test program runs the function in c and then the function
in
> assembly code.
> At first, I put the data in L2 RAM, the resulting time is below:
> IMG_perimeter asm cycle: 1029
> IMG_perimeter c cycle: 2941
>
> Then I put the data in external memory DDR2, the resulting time is below.
> IMG_perimeter asm cycle: 6250
> IMG_perimeter c cycle: 13234
>
> We can see that if the data is put in L2 RAM, the time can be reduced
> from 13234 to 2941. It is much better than assembly code optimization
> which reduces time from 13234 to 6250.
>
> Before I pay my attention to assembly code optimization, and haven't
> found memory access effect.
>
> My another question is that: memory access latency is multiple cycles in
the
> C64x+ pipeline.
> For load instruction, it needs five cycles to obtain data. If queue or
tree
> data structure is used,
> I don't know how to optimize it. Can anyone share his experience with it?
>
> Thanks in advance.
> Jogging
>
> On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote:
>
>> Hi,
>> It is difficult to give an answer without a complete understanding of
the
>> real-time deadlines of the system.
>>
>> Let me take a swing at it in the most general fashion.
>>
>> If your are looking for an average issue slot usage of 6-7 or above
(out of
>> a total possible 8) for inner loops/kernel of your algorithm, then
there
>> might be a need for pipelined/linear assembly (more likely that you
would
>> need pipelined assembly). But pipelined assembly takes long time to
develop
>> and difficult to maintain, by an order.
>>
>> Linear assembly is much easier to code and somewhat easier to maintain
than
>> pipelined assembly. Linear assembly has given me outputs of the order 5
to 6
>> (average) for the inner loops. But it is possible that out of 10
cycles, two
>> or three might be at, 4 out of a total of 8 per cycle.
>>
>> Obviously you have to make an initial target mapping analysis of your
>> requirement by mapping the loads/stores and arithmetic of your
algorithm
>> into C6X VLIW instruction set capabilities, keeping in mind all
restrictions
>> of the processor (cross path stalls etc..). If you are using existing
>> library functions, this might become a little difficult. But in general
this
>> analysis gives you a good idea of what is achievable.
>>
>> In general intrinsics with good compiler directives (pragmas),does the
job
>> for most applications. C6X provides very efficient pragmas for
optimizations
>> C6X intrinsics with good pragma's easily give you an average issue
slot
>> usage 4-6.
>>
>> Pragma's are critical, but also critical are usage of type qualifiers
like
>> restrict, const etc..
>> In addition C6X provide pragma's to align memory elements. Completely
>> avoiding unaligned accesses, can be a benefit. In addition,C6X
compiler
>> provides good debug info, (I think you need to turn it on), on what
exactly
>> can improve the algorithm performance. For example, if there are
excessive
>> register to memory pills.
>> Regards
>>
>> --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com>*wrote:
>>
>> From: j...@gmail.com <j...@gmail.com>
>> Subject: [c6x] Is Assembly code/linear assembly code necessary?
>> To: c...@yahoogroups.com
>> Date: Sunday, March 29, 2009, 7:41 PM
>> Hi, all
>> Nowadays I am reading the documents about optimization:
>> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
>> spru198i TMS320C6000 Programmers Guide.pdf
>> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
Guide.pdf.
>>
>> C64x+ is a special architecture, and instructions has different
latencies
>> depending on the type fo inctructions. At first, I think refining C/C++
code
>> with pragmas and programming with intrinsics can solve the problem of
>> optimization. But I got some ppt from internet. It seems assembly code
is
>> necessary sometimes.
>> For your experience, do you think assembly code/linear assembly code
is
>> necessary?Under what conditions and for what application?
>>
>> Thanks in advance.
>> Jogging
>>
>
>
> _____________________________________
>
>
> Individual Email | Traditional
>
>
>
> http://docs.yahoo.com/info/terms/
>
______________________________Hi, Michael
At first I hope to know the reason why the linear assembly code is
necessary.
I can provide information to c compiler with pragma and restrict.
Intrinsics can be used to instruction selection. So in my opinion linear
assembly code
is not necessary. The benefit of assembly code is instruction selection.
With pragma,
restrict and intrinsics I can implement the most function of assembly code.
I work on optimization for a while, and find memory access is more
important,
because it influences the performance greatly.
So the first step of the workflow of improving the performance of C should
be
improve memory access pattern.
I have no experience of using DMA on C64x+. Can anyone share his experience
of
using DMA. How does DMA improve the performance. I find DMA is not part of
DSP/BIOS.
I want to know whether DMA can be used without DSP/BIOS.
Best Regards
Jogging
On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn <m...@gmail.com>wrote:
> jogging,
>
> On Wed, Apr 22, 2009 at 7:38 AM, jogging song <j...@gmail.com>
> wrote:
> > Hi,
> > Thanks for your opinion. I agree with you completely.
> > Recently I find memory access may influence the performance more than
> > assembly code.
> > In order to learn more about the memory access effect, I do some
tests.
> > I run the IMG_perimeter function from imglib library on DM6437 EVM.
> > In the example, test program runs the function in c and then the
function
> in
> > assembly code.
> > At first, I put the data in L2 RAM, the resulting time is below:
> > IMG_perimeter asm cycle: 1029
> > IMG_perimeter c cycle: 2941
> >
> > Then I put the data in external memory DDR2, the resulting time is
below.
> > IMG_perimeter asm cycle: 6250
> > IMG_perimeter c cycle: 13234
> >
> > We can see that if the data is put in L2 RAM, the time can be reduced
> > from 13234 to 2941. It is much better than assembly code
optimization
> > which reduces time from 13234 to 6250.
> >
> > Before I pay my attention to assembly code optimization, and haven't
> > found memory access effect.
> >
> > My another question is that: memory access latency is multiple cycles
in
> the
> > C64x+ pipeline.
> > For load instruction, it needs five cycles to obtain data. If queue
or
> tree
> > data structure is used,
> > I don't know how to optimize it. Can anyone share his experience with
it?
> <mld>
> Check out 'delay slots' and 'load instructions' in spru732c. If you
> look at the assembly code generated by the C compiler, you will
> probably see that it makes use of the delay slots.
> Q1. Are you comparing optimized [by the compiler] C code with assembly
> code??
>
> mikedunn
> >
> > Thanks in advance.
> > Jogging
> >
> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote:
> >
> >> Hi,
> >> It is difficult to give an answer without a complete understanding
of
> the
> >> real-time deadlines of the system.
> >>
> >> Let me take a swing at it in the most general fashion.
> >>
> >> If your are looking for an average issue slot usage of 6-7 or
above (out
> of
> >> a total possible 8) for inner loops/kernel of your algorithm, then
there
> >> might be a need for pipelined/linear assembly (more likely that
you
> would
> >> need pipelined assembly). But pipelined assembly takes long time
to
> develop
> >> and difficult to maintain, by an order.
> >>
> >> Linear assembly is much easier to code and somewhat easier to
maintain
> than
> >> pipelined assembly. Linear assembly has given me outputs of the
order 5
> to 6
> >> (average) for the inner loops. But it is possible that out of 10
cycles,
> two
> >> or three might be at, 4 out of a total of 8 per cycle.
> >>
> >> Obviously you have to make an initial target mapping analysis of
your
> >> requirement by mapping the loads/stores and arithmetic of your
algorithm
> >> into C6X VLIW instruction set capabilities, keeping in mind all
> restrictions
> >> of the processor (cross path stalls etc..). If you are using
existing
> >> library functions, this might become a little difficult. But in
general
> this
> >> analysis gives you a good idea of what is achievable.
> >>
> >> In general intrinsics with good compiler directives (pragmas),does
the
> job
> >> for most applications. C6X provides very efficient pragmas for
> optimizations
> >> C6X intrinsics with good pragma's easily give you an average issue
slot
> >> usage 4-6.
> >>
> >> Pragma's are critical, but also critical are usage of type
qualifiers
> like
> >> restrict, const etc..
> >> In addition C6X provide pragma's to align memory elements.
Completely
> >> avoiding unaligned accesses, can be a benefit. In addition,C6X
compiler
> >> provides good debug info, (I think you need to turn it on), on
what
> exactly
> >> can improve the algorithm performance. For example, if there are
> excessive
> >> register to memory pills.
> >>
> >>
> >> Regards
> >>
> >> --- On *Sun, 3/29/09, j...@gmail.com <j...@gmail.com
> >*wrote:
> >>
> >> From: j...@gmail.com <j...@gmail.com>
> >> Subject: [c6x] Is Assembly code/linear assembly code necessary?
> >> To: c...@yahoogroups.com
> >> Date: Sunday, March 29, 2009, 7:41 PM
> >>
> >>
> >> Hi, all
> >>
> >>
> >> Nowadays I am reading the documents about optimization:
> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf
> >> spru198i TMS320C6000 Programmers Guide.pdf
> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference
> Guide.pdf.
> >>
> >> C64x+ is a special architecture, and instructions has different
> latencies
> >> depending on the type fo inctructions. At first, I think refining
C/C++
> code
> >> with pragmas and programming with intrinsics can solve the problem
of
> >> optimization. But I got some ppt from internet. It seems assembly
code
> is
> >> necessary sometimes.
> >> For your experience, do you think assembly code/linear assembly
code is
> >> necessary?Under what conditions and for what application?
> >>
> >> Thanks in advance.
> >> Jogging
> >>
> >
> >
> >
> >
> >
> >
> > _____________________________________
> >
> >
> >
> >
> > --
> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php
>
______________________________jogging, On Thu, Apr 23, 2009 at 4:21 AM, jogging song <j...@gmail.com> wrote= : > Hi, Michael > =A0=A0=A0=A0=A0 At first I hope to know the reason why the linear assembl= y code is > necessary. <mld> Maybe you misunderstood. I am not saying that coding in assembly is necessa= ry. What is necessary is to understand what assembly code is generated by the C compiler. You might effectively optimize C code by carefully using pragmas, intrinsics, and restrict. IMO, You cannot evaluate the effectiveness of pragmas, intrinsics, and restrict without looking at before and after versions of the assembly listing. > I can provide information to c compiler with pragma and restrict. > Intrinsics can be used to instruction selection. So in my opinion linear > assembly code > is not necessary. The benefit of assembly code is instruction selection. <mld> and sequence. > With pragma, > restrict and intrinsics I can implement the most function of assembly cod= e. > > I work on optimization for a while, and find memory access is more > important, > because it influences the performance greatly. > So the first step of=A0 the workflow of improving the performance of C sh= ould > be > improve memory access pattern. > > I have no experience of using DMA on C64x+. Can anyone share his experien= ce > of > using DMA. How does DMA improve the performance. I find DMA is not part o= f > DSP/BIOS. <mld> DSP/BIOS supports DMA. Lookup 'Direct Memory Access' at wikipedia. The short version is that DMA uses a state machine to perform memory [or peripheral] accesses while the CPU is executing instructions. mikedunn > =A0I want to know whether DMA can be used without DSP/BIOS. > > Best Regards > Jogging > > On Thu, Apr 23, 2009 at 1:32 PM, Michael Dunn <m...@gmail.com> > wrote: >> >> jogging, >> >> On Wed, Apr 22, 2009 at 7:38 AM, jogging song <j...@gmail.com> >> wrote: >> > Hi, >> > =A0 =A0Thanks for your opinion. I agree with you completely. >> > Recently I find memory access may influence the performance more than >> > assembly code. >> > In order to learn more about the memory access effect, I do some tests= . >> > I run the IMG_perimeter function from imglib library on DM6437 EVM. >> > In the example, test program runs the function in c and then the >> > function in >> > assembly code. >> > At first, I put the data in L2 RAM, the resulting time is below: >> > IMG_perimeter asm cycle: 1029 >> > IMG_perimeter c cycle: 2941 >> > >> > Then I put the data in external memory DDR2, the resulting time is >> > below. >> > IMG_perimeter asm cycle: 6250 >> > IMG_perimeter c cycle: 13234 >> > >> > We can see that if the data is put in L2 RAM, the time can be reduced >> > from 13234 =A0to 2941. =A0It is much better than assembly code optimiz= ation >> > which reduces time from 13234 =A0to 6250. >> > >> > Before I pay my attention to assembly code optimization, and haven't >> > found memory access effect. >> > >> > My another question is that: memory access latency is multiple cycles = in >> > the >> > C64x+ pipeline. >> > For load instruction, it needs five cycles to obtain data. If queue or >> > tree >> > data structure is used, >> > I don't know how to optimize it. Can anyone share his experience with >> > it? >> <mld> >> Check out 'delay slots' and 'load instructions' in spru732c. If you >> look at the assembly code generated by the C compiler, you will >> probably see that it makes use of the delay slots. >> Q1. Are you comparing optimized [by the compiler] C code with assembly >> code?? >> >> mikedunn >> > >> > Thanks in advance. >> > Jogging >> > >> > On Tue, Mar 31, 2009 at 1:13 PM, rvsasi <r...@yahoo.com> wrote: >> > >> >> Hi, >> >> It is difficult to give an answer without a complete understanding of >> >> the >> >> real-time deadlines =A0of the system. >> >> >> >> Let me take a swing at it in the most general fashion. >> >> >> >> If your are looking for an average issue slot usage of 6-7 or above >> >> (out of >> >> a total possible 8) for inner loops/kernel of your algorithm, then >> >> there >> >> might be a =A0need for pipelined/linear assembly (more likely that yo= u >> >> would >> >> need pipelined assembly). But pipelined assembly takes long time to >> >> develop >> >> and difficult to maintain, by an order. >> >> >> >> Linear assembly is much easier to code and somewhat easier to maintai= n >> >> than >> >> pipelined assembly. Linear assembly has given me outputs of the order= 5 >> >> to 6 >> >> (average) for the inner loops. But it is possible that out of 10 >> >> cycles, two >> >> or three might be at, 4 out of a total of 8 per cycle. >> >> >> >> Obviously you have to make an initial target mapping analysis of your >> >> requirement by mapping the loads/stores and arithmetic of your >> >> algorithm >> >> into C6X VLIW instruction set capabilities, keeping in mind all >> >> restrictions >> >> of the processor (cross path stalls etc..). If you are using existing >> >> library functions, this might become a little difficult. But in gener= al >> >> this >> >> analysis gives you a good idea of what is achievable. >> >> >> >> In general intrinsics with good compiler directives (pragmas),does th= e >> >> job >> >> for most applications. C6X provides very efficient pragmas for >> >> optimizations >> >> C6X intrinsics with good pragma's easily give you an average issue sl= ot >> >> usage 4-6. >> >> >> >> Pragma's are critical, but also critical are usage of type qualifiers >> >> like >> >> restrict, const etc.. >> >> In addition C6X provide pragma's to align memory elements. Completely >> >> avoiding unaligned accesses, can be a benefit. In addition,C6X compil= er >> >> provides good debug info, (I think you need to turn it on), on what >> >> exactly >> >> can improve the algorithm performance. For example, if there are >> >> excessive >> >> register to memory pills. >> >> >> >> >> >> Regards >> >> >> >> --- On *Sun, 3/29/09, j...@gmail.com >> >> <j...@gmail.com>*wrote: >> >> >> >> From: j...@gmail.com <j...@gmail.com> >> >> Subject: [c6x] Is Assembly code/linear assembly code necessary? >> >> To: c...@yahoogroups.com >> >> Date: Sunday, March 29, 2009, 7:41 PM >> >> >> >> >> >> Hi, all >> >> >> >> >> >> =A0Nowadays I am reading the documents about optimization: >> >> spru187o TMS320C6000 Optimizing Compiler v6.1.pdf >> >> spru198i TMS320C6000 Programmer=92s Guide.pdf >> >> spru732c TMS320C64xC64x+ DSP CPU and Instruction Set Reference >> >> Guide.pdf. >> >> >> >> C64x+ is a special architecture, and instructions has different >> >> latencies >> >> depending on the type fo inctructions. At first, I think refining C/C= ++ >> >> code >> >> with pragmas and programming with intrinsics can solve the problem of >> >> optimization. But I got some ppt from internet. It seems assembly cod= e >> >> is >> >> necessary sometimes. >> >> For your experience, do you think assembly code/linear assembly code >> >> =A0is >> >> necessary?Under what conditions and for what application? >> >> >> >> Thanks in advance. >> >> Jogging >> >> >> > >> > >> > >> > >> > >> > >> > _____________________________________ >> > >> > >> > >> > >> > >> >> -- >> www.dsprelated.com/blogs-1/nf/Mike_Dunn.php --=20 www.dsprelated.com/blogs-1/nf/Mike_Dunn.php ___________________________________________________________________
Hi , I am facing some problem with filter implementation. Sampling frequency of my ADC =2560hz . ADC will give data which have frequency range from 0-5120hz. Because of hardware limitation I cannot down sample below 2560hz. But signal of interest is 0-to 5 0hz. So after acquiring data from adc I am implementing fir filter whose sampling frequency is 10240hz, and cutoff frequency 50hz. After that I am decimating filtered data by 20 times. I mean to say that actually I am over sampling the signal 20 times. Fs= 2560 Fc=50hz Decimation factor =20; So effective sampling frequency is = 2560/20= 128hz My fft spectrum show frequencies range from 0-64hz. According my system requirement anything above -80db will be considered to the signal. In order to achieve this, I should select my filter stop band attenuation >90db. But conventional FIR filter cannot give this much attenuation. Can anyone suggest some technique to achieve this.. Thanks in advance, Regards, Ramaraju______________________________
Why don't you cascade 2 filters? Are you sure you have enough dynamic to achieve more than 90dB after computation? To: c...@yahoogroups.com From: r...@lntemsys.com Date: Sat, 25 Apr 2009 17:18:17 +0530 Subject: [c6x] Fir filter with high stop band attenuation Hi , I am facing some problem with filter implementation. Sampling frequency of my ADC =2560hz . ADC will give data which have frequency range from 0-5120hz. Because of hardware limitation I cannot down sample below 2560hz. But signal of interest is 0-to 5 0hz. So after acquiring data from adc I am implementing fir filter whose sampling frequency is 10240hz, and cutoff frequency 50hz. After that I am decimating filtered data by 20 times. I mean to say that actually I am over sampling the signal 20 times. Fs= 2560 Fc=50hz Decimation factor =20; So effective sampling frequency is = 2560/20= 128hz My fft spectrum show frequencies range from 0-64hz. According my system requirement anything above -80db will be considered to the signal. In order to achieve this, I should select my filter stop band attenuation >90db. But conventional FIR filter cannot give this much attenuation. Can anyone suggest some technique to achieve this.. Thanks in advance, Regards, Ramaraju ___________________________________________________________________