DSPRelated.com
Forums

Load (LDW/LDH/LDB) Instruction

Started by Henrry Andrian April 1, 2004


Jeff-

Thanks for the detailed explanation.really appreciate it. And please be clear am not disputing anybody on the pipeline structure of
c6000 series here at all!!! am not disagreeing to the concept of delay slots
either!!!

Those basic truths are very clearly mentioned in the manuals.Very very
lucidly i might add.No complaints there!

And I fully understand the need for the delay slots and the implications of
a deep pipeline with ps,pg,pw,pr,dp,dc,e1 to e10 stages of the c67x
pipeline.

And again am not talking abt *filling* of delay slots either!

Am talking abt L1D vs L2 SRAM. There is no way that there can be *no
difference* between the access time of a value stored in L1D and L2. Cmon, i
cant accpet that both will have same access time!

-bhooshan

>From: "jfbuggen" <>
>To:
>Subject: [c6x] Re: Load (LDW/LDH/LDB) Instruction
>Date: Thu, 01 Apr 2004 14:19:34 -0000
>
>Hi Bhooshan,
>
> > You are probably right as far as c64x devices are concerned not
>sure abt
> > that device bit u are deifinetly wrong on 621x and 671x
>devices.This
>
>I really don't agree with you. C62x and C64x have the same
>pipeline architecture. C67x is somewhat different, but
>has the load instructions are performed the same way.
>Please read carefully spru189f's chapter on pipeline.
>
>Let me first make one remark : I'm talking about
>optimised assembly (.asm).
>If you're writing linear assembly (.sa), then you don't
>have to worry about load delay slots, because the assembler
>will insert them for you. With linear assembly, you can consider
>that any instruction is completely finished before the next one.
>
>I'll try to explain this better than in my previous post.
>Every instruction is made of 3 pipeline stages: fetch, decode
>and execute. The "execute" stage is made of a variable
>number of phases, depending on the instruction.
>
>The load instruction requires 5 execute phases. It is only at
>the end of the 5th stage that the loaded data is actually put
>into the destination register.
>
>Let's say you're performing a load, followed by
>4 instructions A, B, C, D :
> LDW *A4++, A3
> A
> B
> C
> D
>
>I'll show what's happening at every cycle, considering that
>Fn(X) = Fetch phase n of instruction X n = 1..4
>Dn(X) = Decode phase n of instruction X n = 1..2
>En(X) = Execute phase n of instruction X n = 1..5 (max)
> if needed for this instruction
>
>Cycle N : F1(LDW)
>Cycle N+1 : F1(A) F2(LDW)
>Cycle N+2 : F1(B) F2(A) F3(LDW)
>Cycle N+3 : F1(C) F2(B) F3(A) F4(LDW)
>Cycle N+4 : ... F2(C) F3(B) F4(A) D1(LDW)
>Cycle N+5 : ... F3(C) F4(B) D1(A) D2(LDW)
>Cycle N+6 : ... F4(C) D1(B) D2(A) E1(LDW) // LDW starts execution
>Cycle N+7 : ... D1(C) D2(B) E1(A) E2(LDW) // A starts execution
>Cycle N+8 : ... D2(C) E1(B) E2(A) E3(LDW) // B starts execution
>Cycle N+9 : ... E1(C) E2(B) E3(A) E4(LDW) // C starts execution
>Cycle N+10 : ... E2(C) E3(B) E4(A) E5(LDW) // D starts execution
> -> LDW has now finished its 5 execution stages,
> and A3 contains loaded data
>
>This means that the LDW was started at cycle N+6, but the data
>is ready after cycle N+10 ! So, there is what is called 4 delay
>slots, no matter of where it was (in cache or not).
>
>BUT :
>1) You can execute other instructions during those 4 cycles, as
> long as those instructions don't need the loaded data
>2) The address modification (A4++ in my example) is performed
> at the first execute phase, so that the "A" instruction in
> my example can use the post-incremented value
>
>Note also that the instruction A in my example can be another
>LDW instruction, and its results will be available at cycle N+11,
>so one cycle after the results of the first LDW. This is a very
>effective use of pipelining, but this doesn't remove actually
>the 4 delay slots. It's just an optimised way of combining loads.
>
>If the data is not in L1D, it doesn't change anything in this
>story, but THE EXECUTION UNIT IS STALLED until the data is
>ready, so that more CPU cycles will be required.
>
> > The answer is no you dont have to insert 4 nops after every
> > load.Particularly if your data is very *locally* accessed(read-one
>
>I agree, you don't HAVE TO insert nops, because you can perform
>other calculations while waiting for the data to be loaded,
>but you ALWAYS have to wait 4 cycles before the data is actually
>loaded into the register.
>
>This gives some problems with interruptible code. I have deeply
>discussed this with TI support :
>
>If you perform this :
> LDW *A4++, A3
> ZERO A3
> STW A3, *A5
> NOP 2
> // A3 loaded here, let's use it...
>
>If an interrupt occurs between the LDW and the
>ZERO instruction, then the values can be loaded to A3
>before actually executing the ZERO instruction, and
>those values can be overwritten.
>
>I hope that it is clear now. If not, don't hesitate to
>contact an official TI support center, they will be happy
>to confirm all this to you.
>
>Cheers
>
>J-F >
>
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
>this message will receive your answer. You need to do a "reply all" if you
>want your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links

_________________________________________________________________
Join BharatMatrimony.com.
http://www.bharatmatrimony.com/cgi-bin/bmclicks1.cgi?72 Unmarried? Join
Free.



--- jfbuggen <> wrote:
> Hi Bhooshan,

well - I'm not Bhooshan, but hi ;-)

> I really don't agree with you.

what is this? a political debate? are we politics? or
engineers? I'll be the judge here ;-) Bhooshan it
totally wrong in his theories (I doubt he has ever
written any "optimized assembly" code for C6000).

> I'll try to explain this better than in my previous
> post.

well - anybody has the right to explain anything as
long as one wants - however - I believe what you
already said is enough for a serious engineer to write
down the list of "things to learn" and start reading
documents...

> If the data is not in L1D, it doesn't change
> anything in this
> story, but THE EXECUTION UNIT IS STALLED until the
> data is
> ready, so that more CPU cycles will be required.

clear and precise... explained in spru198f.pdf on page
6-26 - especially figure 6-25 is quite good ;-)

and getting back to my first question - what is this?
a political debate? ;-) come on people! we are
engineers! there is just one solution to the OP's
problem - so - if two people have two various opinions
about it - that means at least one of them is wrong!

so - to the OP - don't bother with cache problems
because cache has nothing to do with it - just read
about "program and data memory stalls"

Wojciech Rewers
PS: please - could anybody offer me a DSP/embedded job?

__________________________________





Nice to watch a long and excited discussion here. A few more
comments to add.

Just in case, I am not going to argue that an LDx instr does not
have 4 delay slots, no latency cycles and may stall the CPU for
a number of cycles required to fetch data from an external memory.

After Andrew and Jeff have given an excellent insight in the pipeline
operation and single cycle throughput, what is still unknown for me
is how to precalculate (I do not mean to measure it using CCS stalls
counter) the number of CPU stalls for edma fetches of cache missed data
from an external memory.

I think the number of stalls varies depending on the CPU clock rate and
DRAM type and clock rate and differ for various hardware.

Is it enough to know e.g. the CPU clock rate, the DRAM clock rate and
use a certain formula to calculate the number of stalls, even the formula
might not take into account that L1 to L2 misses in a C64xx are also
pipelined, from 8 cycles worst case (a single miss) down to 2 cycles
for a sequence of misses?

I think the number of stalls would be a constant for a fixed CPU/DRAM
clock rate ratio and an isolated L1 to L2 miss?

Thanks,

Andrew


Hi-

If only the manuals had said *Four More* clock cycles! ANyways, Andrew E,
Andrew Nesterov-Thanks.Am clear Now. I think i made a mistake in
interpreting L1D Miss L2Hit scenario. The four clock cycles mentioned there
kind of confused me.

-bhooshan

>From: "Andrew Nesterov" <>
>To:
>Subject: [c6x] Re: Load (LDW/LDH/LDB) Instruction
>Date: Thu, 01 Apr 2004 20:30:20 -0000
>
>--- In , "Bhooshan iyer" <bhooshaniyer@h...> wrote:
>
> > Am talking abt L1D vs L2 SRAM. There is no way that there can be
> > *no difference* between the access time of a value stored in L1D
> > and L2. Cmon, i cant accpet that both will have same access time!
>
>That's how it works! L1D does not stall the CPU. L2D (or better
>to say L1D miss that hits in L2D) ad L2 SRAM stalls the CPU for
>2-8 cycles (C64xx) and for 4 cycles (C671x). Finally, an external
>DRAM access stalls the CPU until edma fetches the data, and perhaps
>for more than 8 cycles.
>
>Hence, the total access time is
>
>4 cycles for an L1D access
>4+(2 to 8) for an L2 access
>4+(more than 8) for an external memory access.
>
>They are not the same, but the number of LDx's delay slots.
>
>Rgds,
>
>Andrew >_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
>this message will receive your answer. You need to do a "reply all" if you
>want your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links

_________________________________________________________________
Apply for a Citibank Suvidha Account. http://go.msnserver.com/IN/45533.asp
Get FREE organiser.


Hello,

> well - I'm not Bhooshan, but hi ;-)
>
> > I really don't agree with you.

Sorry ;-) I mixed the messages when replying...

> what is this? a political debate? are we politics? or
> engineers? I'll be the judge here ;-) Bhooshan it
> totally wrong in his theories (I doubt he has ever
> written any "optimized assembly" code for C6000).

In fact, I'm sure that engineers can accept a debate,
and this has nothing to do with politics.
I don't pretend to tell the truth, but I expose my
opinions. If Bhooshan doesn't agree with me, it is
perhaps because he or me didn't understand well a technical
point, or perhaps because I didn't understand well his
question. It's always worth to write some explanation,
even if it is a little long, just to share our ideas.

> well - anybody has the right to explain anything as
> long as one wants - however - I believe what you
> already said is enough for a serious engineer to write
> down the list of "things to learn" and start reading
> documents...

Unfortunately not, as you see Bhooshan's last posts.
In fact, I think that he's mixing the mandatory delay
slots for every load, and the additional cpu stalls
for "not-L1D" loads.

> clear and precise... explained in spru198f.pdf on page
> 6-26 - especially figure 6-25 is quite good ;-)

Yes, you're right, I should have given more references
instead of writing all this stuff down ;-)

> and getting back to my first question - what is this?
> a political debate? ;-) come on people! we are

I never saw any politics in all this thread. Hopefully...

> engineers! there is just one solution to the OP's
> problem - so - if two people have two various opinions
> about it - that means at least one of them is wrong!

May I use this clever citation as a footer of my posts ? ;-)

Cheers

J-F