DSPRelated.com
Forums

problem with paralleling TMS320C55

Started by Gilles RONSIN September 20, 2005
hi,

i'm currently working on optimizing an assembly code (initialy 
generated from a C source with code composer 2.20.20). The source 
contents somes instructions in parallel, but i need to break somes 
dual operations to separate functionnalities. I don't understand why, 
if I remove || (so the instructions becomes sequentials) the program 
don't run correctly (erroneus values). Somebody can explain me ?

thx

-- 
Embryon de site : http://gilles.ronsin.free.fr 
G�rez vos messages non lus http://gilles.ronsin.free.fr/#nonlus V3.0
Il est impossible pour un optimiste d'�tre agr�ablement surpris.
Gilles RONSIN wrote:
> hi, > > i'm currently working on optimizing an assembly code (initialy > generated from a C source with code composer 2.20.20). The source > contents somes instructions in parallel, but i need to break somes > dual operations to separate functionnalities. I don't understand why, > if I remove || (so the instructions becomes sequentials) the program > don't run correctly (erroneus values). Somebody can explain me ?
An example of a failed serialization and a few instructions around it might help someone to understand the problem. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Jerry Avins <jya@ieee.org>, le mar. 20 sept. 2005 15:32:48, &#4294967295;crivait
ceci: 

Hi

> An example of a failed serialization and a few instructions around > it might help someone to understand the problem.
Thank for answer. I don't understand all subtilities of assembly language, my test is only to see global result (audio file) working part: (comments are my interpretation) mov r2, hi(ac0) ; ac0 = r2<<16 || mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] mov dbl(*sp(#0)),ac0 ;get tab[0] || sftl ac0,#0,ac1 ;set ac1 with ac0 and reset flag ? mac ac1,t1,ac0 ;ac0=ac0*t1+ac1 || mov *sp(#7),t2 ;t2=tab[4]&0xFF I've try this mov r2, hi(ac0) mac ac0,t1,ac0 mov *sp(#7),t2 and the result (sound) is bad I've try to remove all || so keeping correct code but the sound fail too It's one of my examples... -- Embryon de site : http://gilles.ronsin.free.fr G&#4294967295;rez vos messages non lus http://gilles.ronsin.free.fr/#nonlus V3.0 Il est impossible pour un optimiste d'&#4294967295;tre agr&#4294967295;ablement surpris.
Gilles RONSIN wrote:
> Jerry Avins <jya@ieee.org>, le mar. 20 sept. 2005 15:32:48, &#4294967295;crivait > ceci: > > Hi > > >>An example of a failed serialization and a few instructions around >>it might help someone to understand the problem. > > > Thank for answer. I don't understand all subtilities of assembly > language, my test is only to see global result (audio file) > > working part: (comments are my interpretation) > > mov r2, hi(ac0) ; ac0 = r2<<16 > || mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] > mov dbl(*sp(#0)),ac0 ;get tab[0] > || sftl ac0,#0,ac1 ;set ac1 with ac0 and reset flag ? > mac ac1,t1,ac0 ;ac0=ac0*t1+ac1 > || mov *sp(#7),t2 ;t2=tab[4]&0xFF > > > I've try this > mov r2, hi(ac0) > mac ac0,t1,ac0 > mov *sp(#7),t2 > > and the result (sound) is bad
That doesn't run the same sequence at all. You want mov r2, hi(ac0) mov ac0, dbl(*sp(#0)) mov dbl(*sp(#0)),ac0 sftl ac0,#0,ac1 mac ac1,t1,ac0 mov *sp(#7),t2
> I've try to remove all || so keeping correct code but the sound fail > too
You can do the separate operations sequentially instead of in parallel (of course, it takes longer that way) but you can't just not do them. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Jerry Avins <jya@ieee.org>, le mar. 20 sept. 2005 17:44:31, &#4294967295;crivait
ceci: 

Hi Jerry

>> mov r2, hi(ac0) ; ac0 = r2<<16 >> || mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] mov >> || dbl(*sp(#0)),ac0 ;get tab[0] >> || sftl ac0,#0,ac1 ;set ac1 with ac0 and >> || reset flag ? mac ac1,t1,ac0 >> || ;ac0=ac0*t1+ac1 mov *sp(#7),t2 >> || ;t2=tab[4]&0xFF >> >> >> I've try this >> mov r2, hi(ac0) >> mac ac0,t1,ac0 >> mov *sp(#7),t2 >> >> and the result (sound) is bad > > That doesn't run the same sequence at all. You want > > mov r2, hi(ac0) > mov ac0, dbl(*sp(#0)) > mov dbl(*sp(#0)),ac0
What the need to save ac0 for read it just after (I haven't say that the line after this sequence is mov ac0, dbl(*sp(#0)) again, it's why I want remove them)
> sftl ac0,#0,ac1 > mac ac1,t1,ac0
what the need to copy ac0 in ac1, whereas you can directly do a mac ac0,t1,ac0 ?
> mov *sp(#7),t2
I just need to recover the more cycles I can. Anyway, if I write like you tell me, the result is not good... that is what I don't understand. -- Embryon de site : http://gilles.ronsin.free.fr G&#4294967295;rez vos messages non lus http://gilles.ronsin.free.fr/#nonlus V3.0 Il est impossible pour un optimiste d'&#4294967295;tre agr&#4294967295;ablement surpris.
Gilles RONSIN wrote:
> Jerry Avins <jya@ieee.org>, le mar. 20 sept. 2005 17:44:31, &#4294967295;crivait > ceci: > > Hi Jerry > > >>> mov r2, hi(ac0) ; ac0 = r2<<16 >>>|| mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] mov >>>|| dbl(*sp(#0)),ac0 ;get tab[0] >>>|| sftl ac0,#0,ac1 ;set ac1 with ac0 and >>>|| reset flag ? mac ac1,t1,ac0 >>>|| ;ac0=ac0*t1+ac1 mov *sp(#7),t2 >>>|| ;t2=tab[4]&0xFF >>> >>> >>>I've try this >>> mov r2, hi(ac0) >>> mac ac0,t1,ac0 >>> mov *sp(#7),t2 >>> >>>and the result (sound) is bad >> >>That doesn't run the same sequence at all. You want >> >> mov r2, hi(ac0) >> mov ac0, dbl(*sp(#0)) >> mov dbl(*sp(#0)),ac0 > > > What the need to save ac0 for read it just after (I haven't say that > the line after this sequence is mov ac0, dbl(*sp(#0)) again, it's why I > want remove them) > > >> sftl ac0,#0,ac1 >> mac ac1,t1,ac0 > > > what the need to copy ac0 in ac1, whereas you can directly do a > mac ac0,t1,ac0 ?
You don't save time by removing the instructions. Parallel execution means that both instructions execute at the same time.
>> mov *sp(#7),t2 > > > I just need to recover the more cycles I can.
The operations are there for a reason. You can't just not do them; the best you can do is run them in parallel, and that's what || accomplishes.
> Anyway, if I write like you tell me, the result is not good... that is > what I don't understand.
It could be that the effective order of execution changes when you remove the parallelism. Anyhow, you can only make it slower that way, even if you make it run correctly. If you look at the op code, you'll see that mov r2, hi(ac0) ; ac0 = r2<<16 || mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] is _one_ instruction with _one_ execution time. You'll have to get your sppedup somewhere else. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Jerry Avins <jya@ieee.org>, le mer. 21 sept. 2005 16:04:34, &#4294967295;crivait
ceci: 


> You don't save time by removing the instructions. Parallel > execution means that both instructions execute at the same time.
Sure, that's a very good principle.
>>> mov *sp(#7),t2 >> >> >> I just need to recover the more cycles I can. > > The operations are there for a reason. You can't just not do them; > the best you can do is run them in parallel, and that's what || > accomplishes.
Yes, but I need too to break some parts of assembly code to set the same compilations options without generate as code than existing options combination (it is a big project). Otherwise the code maintenance should be too difficult.
>> Anyway, if I write like you tell me, the result is not good... >> that is what I don't understand. > > It could be that the effective order of execution changes when you > remove the parallelism. Anyhow, you can only make it slower that > way, even if you make it run correctly. If you look at the op > code, you'll see that > > mov r2, hi(ac0) ; ac0 = r2<<16 >|| mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] > > is _one_ instruction with _one_ execution time. You'll have to get > your sppedup somewhere else.
You are right. The remark I made is that a compiler has to respect patterns to translate code. By example after each calculation, a memory write is done (mov ac0,dbl(*sp(#0))), even if the value is retake in the next instruction... if you have 10 consecutive C instruction, you will have 9 "mov ac0,dbl(*sp(#0))" followed by "mov dbl(*sp(#0)),ac0" into each instruction. It's easy for optimizing, to remove all intermediates backup to keep, first readind of value and last writing of value. Effectively sometime, it can break dual instruction benefit. Thank you for your remarks. Regards -- Embryon de site : http://gilles.ronsin.free.fr G&#4294967295;rez vos messages non lus http://gilles.ronsin.free.fr/#nonlus V3.0 Il est impossible pour un optimiste d'&#4294967295;tre agr&#4294967295;ablement surpris.
Gilles RONSIN wrote:
> Jerry Avins <jya@ieee.org>, le mer. 21 sept. 2005 16:04:34, &#4294967295;crivait > ceci: > > > >>You don't save time by removing the instructions. Parallel >>execution means that both instructions execute at the same time. > > > Sure, that's a very good principle. > > >>>> mov *sp(#7),t2 >>> >>> >>>I just need to recover the more cycles I can. >> >>The operations are there for a reason. You can't just not do them; >>the best you can do is run them in parallel, and that's what || >>accomplishes. > > > Yes, but I need too to break some parts of assembly code to set the > same compilations options without generate as code than existing > options combination (it is a big project). Otherwise the code > maintenance should be too difficult. > > >>>Anyway, if I write like you tell me, the result is not good... >>>that is what I don't understand. >> >>It could be that the effective order of execution changes when you >>remove the parallelism. Anyhow, you can only make it slower that >>way, even if you make it run correctly. If you look at the op >>code, you'll see that >> >> mov r2, hi(ac0) ; ac0 = r2<<16 >>|| mov ac0, dbl(*sp(#0)) ;store in 32bit tab[0] >> >>is _one_ instruction with _one_ execution time. You'll have to get >>your sppedup somewhere else. > > > You are right. The remark I made is that a compiler has to respect > patterns to translate code. By example after each calculation, a memory > write is done (mov ac0,dbl(*sp(#0))), even if the value is retake in > the next instruction... if you have 10 consecutive C instruction, you > will have 9 "mov ac0,dbl(*sp(#0))" followed by "mov dbl(*sp(#0)),ac0" > into each instruction. It's easy for optimizing, to remove all > intermediates backup to keep, first readind of value and last writing > of value. Effectively sometime, it can break dual instruction benefit.
Those look like register indirect writes that all go to different places. The code steps through one table of values and writes another. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Jerry Avins wrote:

   ...

> Those look like register indirect writes that all go to different > places. The code steps through one table of values and writes another.
The compiler is pretty smart. Before altering its output, check with the manual to be sure that you know exactly what each instruction does. If necessary, sketch for yourself a picture of what the code does. More of your time savings will come from cleverer algorithms, some of which can't be expressed well in C, than from beating the compiler at its own game. Sometimes, you may be able to code a zero-overhead loop that (for reasons of safety in the general case) the compiler avoided. If the original program assumes no initialized data, you may be able to save some code space. With a good compiler, you will have to settle for an accumulation of small gains. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Jerry Avins <jya@ieee.org>, le mer. 21 sept. 2005 19:33:48, &#4294967295;crivait
ceci: 

For information, I've found a good document who study optimization for 
C55x

http://www.ktu.lt/ultra/journal/pdf_43_2/43-2002-Vol.2_05-
B.Varnagiryte.pdf


-- 
Embryon de site : http://gilles.ronsin.free.fr 
G&#4294967295;rez vos messages non lus http://gilles.ronsin.free.fr/#nonlus V3.0
Il est impossible pour un optimiste d'&#4294967295;tre agr&#4294967295;ablement surpris.