Reply by jfbuggen April 2, 20042004-04-02
Hello,

> well - I'm not Bhooshan, but hi ;-)
>
> > I really don't agree with you.

Sorry ;-) I mixed the messages when replying...

> what is this? a political debate? are we politics? or
> engineers? I'll be the judge here ;-) Bhooshan it
> totally wrong in his theories (I doubt he has ever
> written any "optimized assembly" code for C6000).

In fact, I'm sure that engineers can accept a debate,
and this has nothing to do with politics.
I don't pretend to tell the truth, but I expose my
opinions. If Bhooshan doesn't agree with me, it is
perhaps because he or me didn't understand well a technical
point, or perhaps because I didn't understand well his
question. It's always worth to write some explanation,
even if it is a little long, just to share our ideas.

> well - anybody has the right to explain anything as
> long as one wants - however - I believe what you
> already said is enough for a serious engineer to write
> down the list of "things to learn" and start reading
> documents...

Unfortunately not, as you see Bhooshan's last posts.
In fact, I think that he's mixing the mandatory delay
slots for every load, and the additional cpu stalls
for "not-L1D" loads.

> clear and precise... explained in spru198f.pdf on page
> 6-26 - especially figure 6-25 is quite good ;-)

Yes, you're right, I should have given more references
instead of writing all this stuff down ;-)

> and getting back to my first question - what is this?
> a political debate? ;-) come on people! we are

I never saw any politics in all this thread. Hopefully...

> engineers! there is just one solution to the OP's
> problem - so - if two people have two various opinions
> about it - that means at least one of them is wrong!

May I use this clever citation as a footer of my posts ? ;-)

Cheers

J-F


Reply by Bhooshan iyer April 2, 20042004-04-02
Hi-

If only the manuals had said *Four More* clock cycles! ANyways, Andrew E,
Andrew Nesterov-Thanks.Am clear Now. I think i made a mistake in
interpreting L1D Miss L2Hit scenario. The four clock cycles mentioned there
kind of confused me.

-bhooshan

>From: "Andrew Nesterov" <>
>To:
>Subject: [c6x] Re: Load (LDW/LDH/LDB) Instruction
>Date: Thu, 01 Apr 2004 20:30:20 -0000
>
>--- In , "Bhooshan iyer" <bhooshaniyer@h...> wrote:
>
> > Am talking abt L1D vs L2 SRAM. There is no way that there can be
> > *no difference* between the access time of a value stored in L1D
> > and L2. Cmon, i cant accpet that both will have same access time!
>
>That's how it works! L1D does not stall the CPU. L2D (or better
>to say L1D miss that hits in L2D) ad L2 SRAM stalls the CPU for
>2-8 cycles (C64xx) and for 4 cycles (C671x). Finally, an external
>DRAM access stalls the CPU until edma fetches the data, and perhaps
>for more than 8 cycles.
>
>Hence, the total access time is
>
>4 cycles for an L1D access
>4+(2 to 8) for an L2 access
>4+(more than 8) for an external memory access.
>
>They are not the same, but the number of LDx's delay slots.
>
>Rgds,
>
>Andrew >_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
>this message will receive your answer. You need to do a "reply all" if you
>want your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links

_________________________________________________________________
Apply for a Citibank Suvidha Account. http://go.msnserver.com/IN/45533.asp
Get FREE organiser.


Reply by Andrew Nesterov April 1, 20042004-04-01

Nice to watch a long and excited discussion here. A few more
comments to add.

Just in case, I am not going to argue that an LDx instr does not
have 4 delay slots, no latency cycles and may stall the CPU for
a number of cycles required to fetch data from an external memory.

After Andrew and Jeff have given an excellent insight in the pipeline
operation and single cycle throughput, what is still unknown for me
is how to precalculate (I do not mean to measure it using CCS stalls
counter) the number of CPU stalls for edma fetches of cache missed data
from an external memory.

I think the number of stalls varies depending on the CPU clock rate and
DRAM type and clock rate and differ for various hardware.

Is it enough to know e.g. the CPU clock rate, the DRAM clock rate and
use a certain formula to calculate the number of stalls, even the formula
might not take into account that L1 to L2 misses in a C64xx are also
pipelined, from 8 cycles worst case (a single miss) down to 2 cycles
for a sequence of misses?

I think the number of stalls would be a constant for a fixed CPU/DRAM
clock rate ratio and an isolated L1 to L2 miss?

Thanks,

Andrew


Reply by Wojciech Rewers April 1, 20042004-04-01
--- jfbuggen <> wrote:
> Hi Bhooshan,

well - I'm not Bhooshan, but hi ;-)

> I really don't agree with you.

what is this? a political debate? are we politics? or
engineers? I'll be the judge here ;-) Bhooshan it
totally wrong in his theories (I doubt he has ever
written any "optimized assembly" code for C6000).

> I'll try to explain this better than in my previous
> post.

well - anybody has the right to explain anything as
long as one wants - however - I believe what you
already said is enough for a serious engineer to write
down the list of "things to learn" and start reading
documents...

> If the data is not in L1D, it doesn't change
> anything in this
> story, but THE EXECUTION UNIT IS STALLED until the
> data is
> ready, so that more CPU cycles will be required.

clear and precise... explained in spru198f.pdf on page
6-26 - especially figure 6-25 is quite good ;-)

and getting back to my first question - what is this?
a political debate? ;-) come on people! we are
engineers! there is just one solution to the OP's
problem - so - if two people have two various opinions
about it - that means at least one of them is wrong!

so - to the OP - don't bother with cache problems
because cache has nothing to do with it - just read
about "program and data memory stalls"

Wojciech Rewers
PS: please - could anybody offer me a DSP/embedded job?

__________________________________




Reply by Bhooshan iyer April 1, 20042004-04-01


Jeff-

Thanks for the detailed explanation.really appreciate it. And please be clear am not disputing anybody on the pipeline structure of
c6000 series here at all!!! am not disagreeing to the concept of delay slots
either!!!

Those basic truths are very clearly mentioned in the manuals.Very very
lucidly i might add.No complaints there!

And I fully understand the need for the delay slots and the implications of
a deep pipeline with ps,pg,pw,pr,dp,dc,e1 to e10 stages of the c67x
pipeline.

And again am not talking abt *filling* of delay slots either!

Am talking abt L1D vs L2 SRAM. There is no way that there can be *no
difference* between the access time of a value stored in L1D and L2. Cmon, i
cant accpet that both will have same access time!

-bhooshan

>From: "jfbuggen" <>
>To:
>Subject: [c6x] Re: Load (LDW/LDH/LDB) Instruction
>Date: Thu, 01 Apr 2004 14:19:34 -0000
>
>Hi Bhooshan,
>
> > You are probably right as far as c64x devices are concerned not
>sure abt
> > that device bit u are deifinetly wrong on 621x and 671x
>devices.This
>
>I really don't agree with you. C62x and C64x have the same
>pipeline architecture. C67x is somewhat different, but
>has the load instructions are performed the same way.
>Please read carefully spru189f's chapter on pipeline.
>
>Let me first make one remark : I'm talking about
>optimised assembly (.asm).
>If you're writing linear assembly (.sa), then you don't
>have to worry about load delay slots, because the assembler
>will insert them for you. With linear assembly, you can consider
>that any instruction is completely finished before the next one.
>
>I'll try to explain this better than in my previous post.
>Every instruction is made of 3 pipeline stages: fetch, decode
>and execute. The "execute" stage is made of a variable
>number of phases, depending on the instruction.
>
>The load instruction requires 5 execute phases. It is only at
>the end of the 5th stage that the loaded data is actually put
>into the destination register.
>
>Let's say you're performing a load, followed by
>4 instructions A, B, C, D :
> LDW *A4++, A3
> A
> B
> C
> D
>
>I'll show what's happening at every cycle, considering that
>Fn(X) = Fetch phase n of instruction X n = 1..4
>Dn(X) = Decode phase n of instruction X n = 1..2
>En(X) = Execute phase n of instruction X n = 1..5 (max)
> if needed for this instruction
>
>Cycle N : F1(LDW)
>Cycle N+1 : F1(A) F2(LDW)
>Cycle N+2 : F1(B) F2(A) F3(LDW)
>Cycle N+3 : F1(C) F2(B) F3(A) F4(LDW)
>Cycle N+4 : ... F2(C) F3(B) F4(A) D1(LDW)
>Cycle N+5 : ... F3(C) F4(B) D1(A) D2(LDW)
>Cycle N+6 : ... F4(C) D1(B) D2(A) E1(LDW) // LDW starts execution
>Cycle N+7 : ... D1(C) D2(B) E1(A) E2(LDW) // A starts execution
>Cycle N+8 : ... D2(C) E1(B) E2(A) E3(LDW) // B starts execution
>Cycle N+9 : ... E1(C) E2(B) E3(A) E4(LDW) // C starts execution
>Cycle N+10 : ... E2(C) E3(B) E4(A) E5(LDW) // D starts execution
> -> LDW has now finished its 5 execution stages,
> and A3 contains loaded data
>
>This means that the LDW was started at cycle N+6, but the data
>is ready after cycle N+10 ! So, there is what is called 4 delay
>slots, no matter of where it was (in cache or not).
>
>BUT :
>1) You can execute other instructions during those 4 cycles, as
> long as those instructions don't need the loaded data
>2) The address modification (A4++ in my example) is performed
> at the first execute phase, so that the "A" instruction in
> my example can use the post-incremented value
>
>Note also that the instruction A in my example can be another
>LDW instruction, and its results will be available at cycle N+11,
>so one cycle after the results of the first LDW. This is a very
>effective use of pipelining, but this doesn't remove actually
>the 4 delay slots. It's just an optimised way of combining loads.
>
>If the data is not in L1D, it doesn't change anything in this
>story, but THE EXECUTION UNIT IS STALLED until the data is
>ready, so that more CPU cycles will be required.
>
> > The answer is no you dont have to insert 4 nops after every
> > load.Particularly if your data is very *locally* accessed(read-one
>
>I agree, you don't HAVE TO insert nops, because you can perform
>other calculations while waiting for the data to be loaded,
>but you ALWAYS have to wait 4 cycles before the data is actually
>loaded into the register.
>
>This gives some problems with interruptible code. I have deeply
>discussed this with TI support :
>
>If you perform this :
> LDW *A4++, A3
> ZERO A3
> STW A3, *A5
> NOP 2
> // A3 loaded here, let's use it...
>
>If an interrupt occurs between the LDW and the
>ZERO instruction, then the values can be loaded to A3
>before actually executing the ZERO instruction, and
>those values can be overwritten.
>
>I hope that it is clear now. If not, don't hesitate to
>contact an official TI support center, they will be happy
>to confirm all this to you.
>
>Cheers
>
>J-F >
>
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
>this message will receive your answer. You need to do a "reply all" if you
>want your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links

_________________________________________________________________
Join BharatMatrimony.com.
http://www.bharatmatrimony.com/cgi-bin/bmclicks1.cgi?72 Unmarried? Join
Free.



Reply by Bhooshan iyer April 1, 20042004-04-01

Andrew- >I'm confident that Jeff has the correct interpretation here.
>See comments embedded in your email.

Am not yet convinced.Ill wait for more evidence. :) >Actually, the c64x and C621x have the same load characteristics.

Not exactly, they seem to have some slight(?) differences.
1]C64x employs a banked memory structure (MEM_BANK Pragma comes into
picture)
2]Line sizes are different

I quote from spru609a.pdf section 3.2.1

"The C621x/C671x DSP does not employ a banked memory structure in L1D.
The L1D is implemented with a single bank of dual-ported, 64-bit memory.
This
allows two simultaneous accesses on each cycle with no stalls. This is in
contrast to the C620x/C670x and C64x devices. The C64x devices employ a
least-significant bit (LSB) based memory banking structure that only allows
one access to each bank on each cycle."

So that proves that their characteristics are *quite* different.

morevover in another place the document says " An L1D read miss that hits L2
SRAM or
L2 cache stalls the CPU for 4 cycles." that implies if there was no L1D miss
then there wouldnt be a stall for 4 cycles! (See this is precisely my point
TI documents make me *Interepret* things!)

So, i strongly suspect the argument that every load *requires* 4 delay
slots.Not in 621x/671x atleast. >Back to back loads can occur in a pipelined loop, but that is a different
>discussion

Am not sure i understand what u mean here.

>The 4 delay slots are due to the deep pipeline on the the C6xxx family. It
>doesn't matter where >the data is, the DSP uses the 4 delay slots to go
>through the gymnastics of fetching the data.

I understand that the 4 delay slots are for:

1]pg(data address generate)
2]ps(Send to memory)
3]pw(Wait for data to be ready)
4]pr(read)

Now my point is if value *found* in L1D where is the question of "sending to
memory", "waiting for value to be ready" and "reading" ? Why??

remember the document says L1D cpu access time is 1 cycle? if every load
regardless of where it is stored, is going to take 4 delay slots(hence 4
clock cycles) then what is the point in having a single cycle access memory?

-bhooshan

_________________________________________________________________
Get the best deals. On Electronics, Mobiles, Laptops. Log on to
www.baazee.com http://go.msnserver.com/IN/45530.asp



Reply by Wojciech Rewers April 1, 20042004-04-01
--- Henrry Andrian <> wrote:

> LDW .D1T1 *A1,A2
> NOP 4

> Above instruction will load 32 bit data which point
> by A1 into A2
> with four NOP which mean 4 delay slot.

> Does we always have to wait for 4 delay slot in
> order to get data using Load instruction ?

YES - if LDW instruction takes 5 cycles - there are
alwyas 4 processor cycles until the data can be
used!!! and caching has nothing to do with that!!!

now - couple of issues - although you always have to
"wait" those 4 delay slots - that does not mean that
you always have to put those 4 NOPS after each LDW

you can execute other instructions that don't depend
on the data you're loading with that LDW - this way
you're not loosing any "power" for executing NOPS

finally - if you master the "software pipeline idea" -
you'll be able to pipeline loads with whatever you're
performing on the data and then you can go down with
the "average execution time" - but again - that has
nothing to do with the delay slots needed to load the
data...

well - if you want - read the "sample-by-sample FIR
optimized" thread that I started around summer 2003 -
there is a full explanation of FIR engine using
"software pipeline" - using single cycle loop I was
loading/multiplying/accumulating data - even though
you're application is totally different - I believe
the idea of "software pipeline" is clearly showed
there...

anyway - good luck to all ;-)

Wojciech Rewers

PS. Can anybody offer me a job as embedded/DSP
engineer? I'm ready to allocate anywhere!
__________________________________




Reply by Andrew Elder April 1, 20042004-04-01

Henrry,

Can you post the C code for this ?
Are you sure that C code isn't fast enough ?

At 04:42 AM 4/1/2004 -0800, henrry wrote:
>Below is two answer of my previous email:
>
>>>My question are:
>>>1. Does we always have to wait for 4 delay slot in
>order to get data
>using Load instruction ?
>
>>>Ganesh wrote:
> Itsn't always that you need to wait for 4 delay slots
>for LOAD instructions i.e. for every load you needn't
>have to insert 4 NOPS.
>
>Henrry write: (Related to Ganesh answer)
>According to Ganesh, Every load neednt have to insert
>4 NOPS, It means that after Load Instruction I put
>other instruction to running it first.
>Example:
>
>LDW .D1T1 *A1, A2
>XXXX
>XXXX
>XXXX
>XXXX
>
>Where: XXXX is other instruction, so here we dont
>need NOP 4.

Yes, that is correct. You just need to wait the 4 delay slot for *A1 to actually
appear in A2.

>In my application (image processing), where I work
>with window mask (3x3 / 7x7), I need to load the data
>from memory first, for further processing. So I still
>have to wait the data of window mask (7x7 / 3x3) ready
>in register (A/B) first before further processing. So
>I think the 4 delay slot still a bottleneck in my
>application. My application is using Image 320 x 240,
>so any delay slot put in the loop will make the
>program slower. &#61516;. The simple example is my
>downsampling program, I first load 2x2 window mask
>into register, before I havent get the image data, I
>couldnt process the data so I havent
>My program example:
> ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A
>|| ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B

>ADD .S1 Ey1_addr_A, idx_val_A,
Ey1_val_A ; Get
>Ey1 address
>|| ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B
>
> LDW .D1T1 *Ex1_val_A,Ex1_val_A
; Get
>Ex1 value
>|| LDW .D2T2 *Ex2_val_B,Ex2_val_B
; Get
>Ex2 value
> LDW .D1T1 *Ey1_val_A,Ey1_val_A
; Get
>Ey1 value
>|| LDW .D2T2 *Ey2_val_B,Ey2_val_B
; Get
>Ey2 value
> NOP 4
>
> ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A
>|| ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B
>
>This is my bottleneck of my program; I need to repeat
>it for equally my image size so delay slot will be 4 x
>320 x 240. It just part of my program, It will appear
>frequently in other of my program. Please correct me
>if I am wrong? Thx.
>
>Par Ligander wrote:
> Yes, always. If was memory speed dependant it would
>be very difficult so make any code binary portable
>
>Henrry write: (Related to Par Ligander)
>Could explain more detail about binary portable ? Or
>maybe if you dont mind could you give simple example
>of it. >>>2. Assume the data located in L2 Cache (0x0000
>0010)? So If I use
>>>the Load instruction to load data from 0x0000 0010
>should I also
>>>need 4 delay slot ?
>
>Ganesha wrote:
>You don't need a 4 delay slot if its in the cache.
>Kindly go through the function of L2 Hits and misses
>w.r.t. cache and how it affects your execution.
>If you have multiple misses, they will be pipelined
>and you average delay will be less than 4.
>
>Henrry write:
>According to your answer, I have wrong perception
>about L2 Cache. I think the L2 Cache that I mention in
>my previous email is L2 SRAM which located from 0x0000
>0000 until 0x0000 10000.

Henrry, read SPRU656A.PDF and see if that helps. >3. Assume the data locate in SDRAM (which is external
>memory in
>DSK6711 board), should I also need 4 delay slot ?
>Yes if you have a single load. if you have multiple
>loads, then my previous answer should provide you with
>enough information.
>4. I have planned to use QDMA to move a small block of
>data from
>SDRAM to L2 Cache to speed up my program. But it
>seemed useless,
>because when the data already in L2 Cache, I still
>need 4 delay slot
>in order to load the data from L2 Cache which is same
>delay slot
>while loading data from SDRAM. So In my opinion, I
>just load
>directly from SDRAM instead of using QDMA to move the
>data to L2
>Cache. I think that if there is no delay slot while
>loading data
>from L2 cache, the QDMA will be usefull. Any
>correction ?
>
>I dont understand how can you perform a DMA into cache
>area. I guess cache is controlled by cache controller
>and to the best of my knowledge, you don't program to
>write data into memory area configured as Cache. you
>might use DMA to transfer data from SDRAM to L2 SRAM,
>but not cache. Kindly cross-check and cross-verify.
>Cache is used to hold the memory segment that your
>program is currently accessing and in all probability
>will be holding in future.
>It is controlled by the logic of your software,
>depending on the way memory is accessed, be it program
>or data memory.
>You never write anything explicit. The cache
>controller uses LRU (Least Recently Used) algorithm to
>update the cache lines. You need to maintain coherency
>for which you need to perform a cache clean.
>In a nutshell, you can only Clean the cache from your
>control. To gain very good cache performance, you need
>to restructure your program suitably.
>
>Henrry write:
>About the cache, I think C6711 provide to level cache
>which are L1P and L2. According to your explanation
>about cache, both of the cache cannot be accessed by
>our program, but can be access by the logic of our
>program. So it will use by the Cache controller
>automatically when there is a load or store
>instruction for temporary storing.
>
>Par Ligander wrote:
>Correct QDMA, can not reduce the load delay. You can
>use a DMA scheme
>as you suggest to reduce the number of stall cycles
>infliced by
>the slow external memory but the cache is in most
>cases a better
>mechanism to do that.
>
>Henrry write:
>I have not fully understand about your answer. You
>mean that I still could use DMA sheme to move data
>from external memory (SDRAM) to L2 SRAM to handle any
>stall cycles ? Correct me if I am wrong

Henry on the C6711 you can select the size of the cache, up to a maximum of 64
Kbytes. You could have 32 Kbytes of SRAM and 32 Kbytes of L2 cache. In that case
you could QDMA into the SRAM region. You can not QDMA into L2 cache memory.

I would think that in all probability, you could just set the whole SRAM to be
L2 cache and let the processor take care of fetching and caching memory.

- Andrew E.

>=====
>Best Regards,
>
>Henrry Andrian - Researcher
>ISCI Lab (http://isci.cn.nctu.edu.tw)
>Office Ph. +886 3 5712121 ext: 54358
>Mobile Ph. +886 931198986
>National Chiao Tung University (http://www.nctu.edu.tw)
>Hsinchu - Taiwan, ROC
>
>__________________________________ >
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links >
>




Reply by jfbuggen April 1, 20042004-04-01
Hi Bhooshan,

> You are probably right as far as c64x devices are concerned not
sure abt
> that device bit u are deifinetly wrong on 621x and 671x
devices.This

I really don't agree with you. C62x and C64x have the same
pipeline architecture. C67x is somewhat different, but
has the load instructions are performed the same way.
Please read carefully spru189f's chapter on pipeline.

Let me first make one remark : I'm talking about
optimised assembly (.asm).
If you're writing linear assembly (.sa), then you don't
have to worry about load delay slots, because the assembler
will insert them for you. With linear assembly, you can consider
that any instruction is completely finished before the next one.

I'll try to explain this better than in my previous post.
Every instruction is made of 3 pipeline stages: fetch, decode
and execute. The "execute" stage is made of a variable
number of phases, depending on the instruction.

The load instruction requires 5 execute phases. It is only at
the end of the 5th stage that the loaded data is actually put
into the destination register.

Let's say you're performing a load, followed by
4 instructions A, B, C, D :
LDW *A4++, A3
A
B
C
D

I'll show what's happening at every cycle, considering that
Fn(X) = Fetch phase n of instruction X n = 1..4
Dn(X) = Decode phase n of instruction X n = 1..2
En(X) = Execute phase n of instruction X n = 1..5 (max)
if needed for this instruction

Cycle N : F1(LDW)
Cycle N+1 : F1(A) F2(LDW)
Cycle N+2 : F1(B) F2(A) F3(LDW)
Cycle N+3 : F1(C) F2(B) F3(A) F4(LDW)
Cycle N+4 : ... F2(C) F3(B) F4(A) D1(LDW)
Cycle N+5 : ... F3(C) F4(B) D1(A) D2(LDW)
Cycle N+6 : ... F4(C) D1(B) D2(A) E1(LDW) // LDW starts execution
Cycle N+7 : ... D1(C) D2(B) E1(A) E2(LDW) // A starts execution
Cycle N+8 : ... D2(C) E1(B) E2(A) E3(LDW) // B starts execution
Cycle N+9 : ... E1(C) E2(B) E3(A) E4(LDW) // C starts execution
Cycle N+10 : ... E2(C) E3(B) E4(A) E5(LDW) // D starts execution
-> LDW has now finished its 5 execution stages,
and A3 contains loaded data

This means that the LDW was started at cycle N+6, but the data
is ready after cycle N+10 ! So, there is what is called 4 delay
slots, no matter of where it was (in cache or not).

BUT :
1) You can execute other instructions during those 4 cycles, as
long as those instructions don't need the loaded data
2) The address modification (A4++ in my example) is performed
at the first execute phase, so that the "A" instruction in
my example can use the post-incremented value

Note also that the instruction A in my example can be another
LDW instruction, and its results will be available at cycle N+11,
so one cycle after the results of the first LDW. This is a very
effective use of pipelining, but this doesn't remove actually
the 4 delay slots. It's just an optimised way of combining loads.

If the data is not in L1D, it doesn't change anything in this
story, but THE EXECUTION UNIT IS STALLED until the data is
ready, so that more CPU cycles will be required.

> The answer is no you dont have to insert 4 nops after every
> load.Particularly if your data is very *locally* accessed(read-one

I agree, you don't HAVE TO insert nops, because you can perform
other calculations while waiting for the data to be loaded,
but you ALWAYS have to wait 4 cycles before the data is actually
loaded into the register.

This gives some problems with interruptible code. I have deeply
discussed this with TI support :

If you perform this :
LDW *A4++, A3
ZERO A3
STW A3, *A5
NOP 2
// A3 loaded here, let's use it...

If an interrupt occurs between the LDW and the
ZERO instruction, then the values can be loaded to A3
before actually executing the ZERO instruction, and
those values can be overwritten.

I hope that it is clear now. If not, don't hesitate to
contact an official TI support center, they will be happy
to confirm all this to you.

Cheers

J-F



Reply by Andrew Elder April 1, 20042004-04-01

Bhooshan,

I'm confident that Jeff has the correct interpretation here.
See comments embedded in your email.

As an aside, Henrry, if your are really interested in what is going on, use the
expert, the TI C compiler. Right some C code and study the assembly that it
produces. That said, I HIGHLY recommend that you stay away from assembly
programming on the C6711. Rules of C6xxx programming are
1) Get everything working in C.
2) If code isn't fast enough, use profiler to figure out what is running too
slow.
2.5) Check for TI Signal Processing libraries that implement what you need.
3) Optimize C code.
4) Use TI optimizer tools to optimize C code.
5) Use TI app notes and documentation to optimize C code
6) Repeat from 4 :-)

The main point is that you should try REALLY HARD to implement everything in C
code. The C compiler does a pretty good job "out of the box" and does an
excellent job if you read up some on what the optimization options are.

Finally, if you still don't have enough performance, you can resort to linear
assembly and worst case actual assembly.
At 01:04 PM 4/1/2004 +0000, Bhooshan iyer wrote:

>Jeff-
>
>>Yes, you need always to wait 4 delay slots. Those slots are
>>required because of the pipeline architecture. This is the minimum
>>required time, assuming that the data is already present in
>>L1D memory.
>>May I suggest to have a look at the spru189f document, which
>>has two chapters for explaining the pipeline of the C62x/C64x
>>and C67x processors.
>>The load instruction requires 5 execute cycles, so it has 4
>>delay slots.
>
>You are probably right as far as c64x devices are concerned not sure abt
>that device bit u are deifinetly wrong on 621x and 671x devices.This
>question is a classic question.Quite vexing too because there is no *direct
>* and *easy* documentation on the same(so,whats new? TI and their
>documentation!)

Actually, the c64x and C621x have the same load characteristics.
>The answer is no you dont have to insert 4 nops after every
>load.Particularly if your data is very *locally* accessed(read-one after
>another in a tight set, say 4k!) then the first load needs to be *delayed*
>and the rest not so! Evry data access in 1x devices when encountered a cache
>miss picks up 32 bytes of data with 2 *fetches* from L2(if available) and
>after that for the next nearest data the access time is *just 1cycle* for
>every load! (read L1D CPU access time is only 1 cycle)

If data is in L1D the load has 4 delay slots. If the data is not in L1D it will
take longer.

Back to back loads can occur in a pipelined loop, but that is a different
discussion. >There are several other complications in this load operations like
>*banking*(again 1x devices for some vague reason need not be
>*banked*.Meaning MEM_BANK pragmas are not reuired for 1x devices.does that
>make sense?), L1D conflict etc...which again have penalties associated with
>them. IF required we can open that topic otherwise ill just leave it aside.
>
>Again the simple answer -no 4 nops is not reuiredevery time.But unless you
>are a past master at cache programming(am not!) it is very unlikely you can
>bank(a different sort of bank,this... :) ) on it not insert NOPs. CCS 2.2
>has some new features like the cache analysis tool kit which help you to
>analyse the cache and make some changes to ur code.Well, good luck!

4 nops are required every time.
>>>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
>> > the Load instruction to load data from 0x0000 0010 should I also
>> > need 4 delay slot ?
>yes.this is what the 4 delay slots are for, if it is in L2. but the next
>time the next nearest data is already likey to be in L1 so you wont need
>that 4 delay slots again, ideally.

The 4 delay slots are due to the deep pipeline on the the C6xxx family. It
doesn't matter where the data is, the DSP uses the 4 delay slots to go through
the gymnastics of fetching the data.
>> > 3. Assume the data locate in SDRAM (which is external memory in
>> > DSK6711 board), should I also need 4 delay slot ?
>
>No it will require more.L2 line size is 128 bytes so it highly unlikely that
>this can just take 4 cycles as is the case for the line size of 32 bytes!

Yes. You always need 4 delay slots.

- Andrew E. >-bhooshan