DSPRelated.com
Forums

Load (LDW/LDH/LDB) Instruction

Started by Henrry Andrian April 1, 2004
Dear all,

Hi, I have question related in Load Data from memory (LDW/LDH/LDB)
using assembly language. Currently I am using DSK6711 board.
According to the SPRU189(TMS320C6000 CPU and Instruction Set
Reference Guide) documentation, Load Instruction need 4 delay slot
to Load 32bit/16 bit/8 bit data from memory into register.
Example:
LDW .D1T1 *A1,A2
NOP 4
Above instruction will load 32 bit data which point by A1 into A2
with four NOP which mean 4 delay slot.

My question are:
1. Does we always have to wait for 4 delay slot in order to get data
using Load instruction ?
2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
the Load instruction to load data from 0x0000 0010 should I also
need 4 delay slot ?
3. Assume the data locate in SDRAM (which is external memory in
DSK6711 board), should I also need 4 delay slot ?
4. I have planned to use QDMA to move a small block of data from
SDRAM to L2 Cache to speed up my program. But it seemed useless,
because when the data already in L2 Cache, I still need 4 delay slot
in order to load the data from L2 Cache which is same delay slot
while loading data from SDRAM. So In my opinion, I just load
directly from SDRAM instead of using QDMA to move the data to L2
Cache. I think that if there is no delay slot while loading data
from L2 cache, the QDMA will be usefull. Any correction ?

Above are all my question about Load instruction in C6000,
especially I implement in DSK6711 Board. I will appreaciate any
answer of my question. Thank you

Best Regards,

Henrry Andrian
Graduated Student - ISCI LAB (http://isci.cn.nctu.edu.tw)
National Chiao Tung University - Hsinchu, ROC.

Lab Phone. +886-3-5712121 ext.54358
Cellular Phone. +886-931-198986



Hi Henry,
Kindly find my answers embedded in your mail.
----- Original Message -----
From: Henrry Andrian
To: c...@yahoogroups.com
Sent: Thursday, April 01, 2004 1:12 PM
Subject: [c6x] Load (LDW/LDH/LDB) Instruction

Dear all,

Hi, I have question related in Load Data from memory (LDW/LDH/LDB)
using assembly language. Currently I am using DSK6711 board.
According to the SPRU189(TMS320C6000 CPU and Instruction Set
Reference Guide) documentation, Load Instruction need 4 delay slot
to Load 32bit/16 bit/8 bit data from memory into register.
Example:
LDW .D1T1  *A1,A2
NOP        4
Above instruction will load 32 bit data which point by A1 into A2
with four NOP which mean 4 delay slot.

My question are:
1. Does we always have to wait for 4 delay slot in order to get data
using Load instruction ?
Itsn't always that you need to wait for 4 delay slots for LOAD instructions i.e. for every load you needn't have to insert 4 NOPS.
2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
the Load instruction to load data from 0x0000 0010 should I also
need 4 delay slot ?
You don't need a 4 delay slot if its in the cache. Kindly go through the function of L2 Hits and misses w.r.t. cache and how it affects your execution.
If you have multiple misses, they will be pipelined and you average delay will be less than 4.
3. Assume the data locate in SDRAM (which is external memory in
DSK6711 board), should I also need 4 delay slot ?
Yes if you have a single load. if you have multiple loads, then my previous answer should provide you with enough information.
4. I have planned to use QDMA to move a small block of data from
SDRAM to L2 Cache to speed up my program. But it seemed useless,
because when the data already in L2 Cache, I still need 4 delay slot
in order to load the data from L2 Cache which is same delay slot
while loading data from SDRAM.  So In my opinion, I just load
directly from SDRAM instead of using QDMA to move the data to L2
Cache. I think that if there is no delay slot while loading data
from L2 cache, the QDMA will be usefull. Any correction ?
I dont understand how can you perform a DMA into cache area. I guess cache is controlled by cache controller and to the best of my knowledge, you don't program to write data into memory area configured as Cache. you might use DMA to transfer data from SDRAM to L2 SRAM, but not cache. Kindly cross-check and cross-verify.
Cache is used to hold the memory segment that your program is currently accessing and in all probability will be holding in future.
It is controlled by the logic of your software, depending on the way memory is accessed, be it program or data memory.
You never write anything explicit. The cache controller uses LRU (Least Recently Used) algorithm to update the cache lines. You need to maintain coherency for which you need to perform a cache clean.
In a nutshell, you can only Clean the cache from your control. To gain very good cache performance, you need to restructure your program suitably.
Above are all my question about Load instruction in C6000,
especially I implement in DSK6711 Board.  I will appreaciate any
answer of my question. Thank you
Hope this helps.
Thanks and Regards,
Ganesh
Best Regards,

Henrry Andrian
Graduated Student - ISCI LAB (http://isci.cn.nctu.edu.tw)
National Chiao Tung University - Hsinchu, ROC.

Lab Phone. +886-3-5712121 ext.54358
Cellular Phone. +886-931-198986
_____________________________________
Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer.  You need to do a "reply all" if you want your answer to be distributed to the entire group.

_____________________________________
About this discussion group:

To Join:  Send an email to c...@yahoogroups.com

To Post:  Send an email to c...@yahoogroups.com

To Leave: Send an email to c...@yahoogroups.com

Archives: http://www.yahoogroups.com/group/c6x

Other Groups: http://www.dsprelated.com
 
Yahoo! Groups Links 


Hello,

Let me answer to your questions according to my experience
on the C64xx DSPs. It is not exactly the same processor, but
similar enough to the one you are using, so my answers should
be valid for you.

> 1. Does we always have to wait for 4 delay slot in order to get
data
> using Load instruction ?

Yes, you need always to wait 4 delay slots. Those slots are
required because of the pipeline architecture. This is the minimum
required time, assuming that the data is already present in
L1D memory.
May I suggest to have a look at the spru189f document, which
has two chapters for explaining the pipeline of the C62x/C64x
and C67x processors.
The load instruction requires 5 execute cycles, so it has 4
delay slots.

> 2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
> the Load instruction to load data from 0x0000 0010 should I also
> need 4 delay slot ?
> 3. Assume the data locate in SDRAM (which is external memory in
> DSK6711 board), should I also need 4 delay slot ?

A first warning before answering : L2 cache is a part of
L2 memory that is used as cache, so that the DSP takes care
of fetching data from external memory to the L2 cache, by
issuing special QDMA transfers. Please don't perform on your
own any transfer to this zone, as it can trash some cache data.
You are allowed to make QDMA transfers to L2 memory that is NOT
configured as cache. The size of the L2 cache is configurable,
and the remaining space of L2 can be used for your program/data.

The 4 delay slots are required for every load, because of
the pipeline architecture.
When the data is not in the L1D cache, the DSP looks has to
find the data in its actual place.
1) If the load address is somewhere in L2 (not used as L2 cache),
it has to fetch the data from this L2 memory
2) If the load address is somewhere in external memory, and if
L2 cache is enabled (and this page of external memory is
configured as cacheable through MAR register), then the DSP
looks in the L2 cache if the data is present. If no, it has
to fetch the data from the external memory.
3) If the load address is somewhere in external memory, and if
L2 cache is not enabled, the data is directly fetched from
external memory.

In all those cases where the data was not initially in L1D, some
additional time is needed to fetch the data to L1D. During this
time, the execution of your program is SUSPENDED. This means that
during some additional CPU cycles, no execute packet will be
processed. This is not reflected in your assembly code, as the
execution is really suspended. You keep with your 4 delay slots
in your assembly code, but sometimes, you will actually wait
more than 4 cycles.

> 4. I have planned to use QDMA to move a small block of data from
> SDRAM to L2 Cache to speed up my program. But it seemed useless,

You can choose to control the process of transferring the
appropriate data from external memory to L2 (not cache!) through
QDMA transfers, and then issue load operations to this L2 zone.
This has the advantage to give you control about what's being
put in L2 at what time.
Another possibility is to let the DSP manage it by enabling L2
cache and enabling caching for the page of external memory where
your data resides. This makes the things easier, but sometimes
less efficient, since you don't have much control on cache
allocation and trashing.

For the C64x, there is a very interesting document called
"two level internal memory reference guide" spru610.
There should be a similar document for your DSP, but I don't
have its reference.

I hope it helps

J-F


Below is two answer of my previous email:

>>My question are:
>>1. Does we always have to wait for 4 delay slot in
order to get data
using Load instruction ?

>>Ganesh wrote:
Itsn't always that you need to wait for 4 delay slots
for LOAD instructions i.e. for every load you needn't
have to insert 4 NOPS.

Henrry write: (Related to Ganesh answer)
According to Ganesh, Every load neednt have to insert
4 NOPS, It means that after Load Instruction I put
other instruction to running it first.
Example:

LDW .D1T1 *A1, A2
XXXX
XXXX
XXXX
XXXX

Where: XXXX is other instruction, so here we dont
need NOP 4.

In my application (image processing), where I work
with window mask (3x3 / 7x7), I need to load the data
from memory first, for further processing. So I still
have to wait the data of window mask (7x7 / 3x3) ready
in register (A/B) first before further processing. So
I think the 4 delay slot still a bottleneck in my
application. My application is using Image 320 x 240,
so any delay slot put in the loop will make the
program slower. . The simple example is my
downsampling program, I first load 2x2 window mask
into register, before I havent get the image data, I
couldnt process the data so I havent
My program example:
ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A
|| ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B
ADD .S1 Ey1_addr_A, idx_val_A, Ey1_val_A ; Get
Ey1 address
|| ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B

LDW .D1T1 *Ex1_val_A,Ex1_val_A ; Get
Ex1 value
|| LDW .D2T2 *Ex2_val_B,Ex2_val_B ; Get
Ex2 value
LDW .D1T1 *Ey1_val_A,Ey1_val_A ; Get
Ey1 value
|| LDW .D2T2 *Ey2_val_B,Ey2_val_B ; Get
Ey2 value
NOP 4

ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A
|| ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B

This is my bottleneck of my program; I need to repeat
it for equally my image size so delay slot will be 4 x
320 x 240. It just part of my program, It will appear
frequently in other of my program. Please correct me
if I am wrong? Thx.

Par Ligander wrote:
Yes, always. If was memory speed dependant it would
be very difficult so make any code binary portable

Henrry write: (Related to Par Ligander)
Could explain more detail about binary portable ? Or
maybe if you dont mind could you give simple example
of it. >>2. Assume the data located in L2 Cache (0x0000
0010)? So If I use
>>the Load instruction to load data from 0x0000 0010
should I also
>>need 4 delay slot ?

Ganesha wrote:
You don't need a 4 delay slot if its in the cache.
Kindly go through the function of L2 Hits and misses
w.r.t. cache and how it affects your execution.
If you have multiple misses, they will be pipelined
and you average delay will be less than 4.

Henrry write:
According to your answer, I have wrong perception
about L2 Cache. I think the L2 Cache that I mention in
my previous email is L2 SRAM which located from 0x0000
0000 until 0x0000 10000.

3. Assume the data locate in SDRAM (which is external
memory in
DSK6711 board), should I also need 4 delay slot ?
Yes if you have a single load. if you have multiple
loads, then my previous answer should provide you with
enough information.
4. I have planned to use QDMA to move a small block of
data from
SDRAM to L2 Cache to speed up my program. But it
seemed useless,
because when the data already in L2 Cache, I still
need 4 delay slot
in order to load the data from L2 Cache which is same
delay slot
while loading data from SDRAM. So In my opinion, I
just load
directly from SDRAM instead of using QDMA to move the
data to L2
Cache. I think that if there is no delay slot while
loading data
from L2 cache, the QDMA will be usefull. Any
correction ?

I dont understand how can you perform a DMA into cache
area. I guess cache is controlled by cache controller
and to the best of my knowledge, you don't program to
write data into memory area configured as Cache. you
might use DMA to transfer data from SDRAM to L2 SRAM,
but not cache. Kindly cross-check and cross-verify.
Cache is used to hold the memory segment that your
program is currently accessing and in all probability
will be holding in future.
It is controlled by the logic of your software,
depending on the way memory is accessed, be it program
or data memory.
You never write anything explicit. The cache
controller uses LRU (Least Recently Used) algorithm to
update the cache lines. You need to maintain coherency
for which you need to perform a cache clean.
In a nutshell, you can only Clean the cache from your
control. To gain very good cache performance, you need
to restructure your program suitably.

Henrry write:
About the cache, I think C6711 provide to level cache
which are L1P and L2. According to your explanation
about cache, both of the cache cannot be accessed by
our program, but can be access by the logic of our
program. So it will use by the Cache controller
automatically when there is a load or store
instruction for temporary storing.

Par Ligander wrote:
Correct QDMA, can not reduce the load delay. You can
use a DMA scheme
as you suggest to reduce the number of stall cycles
infliced by
the slow external memory but the cache is in most
cases a better
mechanism to do that.

Henrry write:
I have not fully understand about your answer. You
mean that I still could use DMA sheme to move data
from external memory (SDRAM) to L2 SRAM to handle any
stall cycles ? Correct me if I am wrong =====
Best Regards,

Henrry Andrian - Researcher
ISCI Lab (http://isci.cn.nctu.edu.tw)
Office Ph. +886 3 5712121 ext: 54358
Mobile Ph. +886 931198986
National Chiao Tung University (http://www.nctu.edu.tw)
Hsinchu - Taiwan, ROC

__________________________________




Jeff-

>Yes, you need always to wait 4 delay slots. Those slots are
>required because of the pipeline architecture. This is the minimum
>required time, assuming that the data is already present in
>L1D memory.
>May I suggest to have a look at the spru189f document, which
>has two chapters for explaining the pipeline of the C62x/C64x
>and C67x processors.
>The load instruction requires 5 execute cycles, so it has 4
>delay slots.

You are probably right as far as c64x devices are concerned not sure abt
that device bit u are deifinetly wrong on 621x and 671x devices.This
question is a classic question.Quite vexing too because there is no *direct
* and *easy* documentation on the same(so,whats new? TI and their
documentation!)

The answer is no you dont have to insert 4 nops after every
load.Particularly if your data is very *locally* accessed(read-one after
another in a tight set, say 4k!) then the first load needs to be *delayed*
and the rest not so! Evry data access in 1x devices when encountered a cache
miss picks up 32 bytes of data with 2 *fetches* from L2(if available) and
after that for the next nearest data the access time is *just 1cycle* for
every load! (read L1D CPU access time is only 1 cycle)

There are several other complications in this load operations like
*banking*(again 1x devices for some vague reason need not be
*banked*.Meaning MEM_BANK pragmas are not reuired for 1x devices.does that
make sense?), L1D conflict etc...which again have penalties associated with
them. IF required we can open that topic otherwise ill just leave it aside.

Again the simple answer -no 4 nops is not reuiredevery time.But unless you
are a past master at cache programming(am not!) it is very unlikely you can
bank(a different sort of bank,this... :) ) on it not insert NOPs. CCS 2.2
has some new features like the cache analysis tool kit which help you to
analyse the cache and make some changes to ur code.Well, good luck! >>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
> > the Load instruction to load data from 0x0000 0010 should I also
> > need 4 delay slot ?
yes.this is what the 4 delay slots are for, if it is in L2. but the next
time the next nearest data is already likey to be in L1 so you wont need
that 4 delay slots again, ideally. > > 3. Assume the data locate in SDRAM (which is external memory in
> > DSK6711 board), should I also need 4 delay slot ?

No it will require more.L2 line size is 128 bytes so it highly unlikely that
this can just take 4 cycles as is the case for the line size of 32 bytes!

-bhooshan

_________________________________________________________________
Apply for a Citibank Suvidha Account. http://go.msnserver.com/IN/45533.asp
Get FREE organiser.




Bhooshan,

I'm confident that Jeff has the correct interpretation here.
See comments embedded in your email.

As an aside, Henrry, if your are really interested in what is going on, use the
expert, the TI C compiler. Right some C code and study the assembly that it
produces. That said, I HIGHLY recommend that you stay away from assembly
programming on the C6711. Rules of C6xxx programming are
1) Get everything working in C.
2) If code isn't fast enough, use profiler to figure out what is running too
slow.
2.5) Check for TI Signal Processing libraries that implement what you need.
3) Optimize C code.
4) Use TI optimizer tools to optimize C code.
5) Use TI app notes and documentation to optimize C code
6) Repeat from 4 :-)

The main point is that you should try REALLY HARD to implement everything in C
code. The C compiler does a pretty good job "out of the box" and does an
excellent job if you read up some on what the optimization options are.

Finally, if you still don't have enough performance, you can resort to linear
assembly and worst case actual assembly.
At 01:04 PM 4/1/2004 +0000, Bhooshan iyer wrote:

>Jeff-
>
>>Yes, you need always to wait 4 delay slots. Those slots are
>>required because of the pipeline architecture. This is the minimum
>>required time, assuming that the data is already present in
>>L1D memory.
>>May I suggest to have a look at the spru189f document, which
>>has two chapters for explaining the pipeline of the C62x/C64x
>>and C67x processors.
>>The load instruction requires 5 execute cycles, so it has 4
>>delay slots.
>
>You are probably right as far as c64x devices are concerned not sure abt
>that device bit u are deifinetly wrong on 621x and 671x devices.This
>question is a classic question.Quite vexing too because there is no *direct
>* and *easy* documentation on the same(so,whats new? TI and their
>documentation!)

Actually, the c64x and C621x have the same load characteristics.
>The answer is no you dont have to insert 4 nops after every
>load.Particularly if your data is very *locally* accessed(read-one after
>another in a tight set, say 4k!) then the first load needs to be *delayed*
>and the rest not so! Evry data access in 1x devices when encountered a cache
>miss picks up 32 bytes of data with 2 *fetches* from L2(if available) and
>after that for the next nearest data the access time is *just 1cycle* for
>every load! (read L1D CPU access time is only 1 cycle)

If data is in L1D the load has 4 delay slots. If the data is not in L1D it will
take longer.

Back to back loads can occur in a pipelined loop, but that is a different
discussion. >There are several other complications in this load operations like
>*banking*(again 1x devices for some vague reason need not be
>*banked*.Meaning MEM_BANK pragmas are not reuired for 1x devices.does that
>make sense?), L1D conflict etc...which again have penalties associated with
>them. IF required we can open that topic otherwise ill just leave it aside.
>
>Again the simple answer -no 4 nops is not reuiredevery time.But unless you
>are a past master at cache programming(am not!) it is very unlikely you can
>bank(a different sort of bank,this... :) ) on it not insert NOPs. CCS 2.2
>has some new features like the cache analysis tool kit which help you to
>analyse the cache and make some changes to ur code.Well, good luck!

4 nops are required every time.
>>>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
>> > the Load instruction to load data from 0x0000 0010 should I also
>> > need 4 delay slot ?
>yes.this is what the 4 delay slots are for, if it is in L2. but the next
>time the next nearest data is already likey to be in L1 so you wont need
>that 4 delay slots again, ideally.

The 4 delay slots are due to the deep pipeline on the the C6xxx family. It
doesn't matter where the data is, the DSP uses the 4 delay slots to go through
the gymnastics of fetching the data.
>> > 3. Assume the data locate in SDRAM (which is external memory in
>> > DSK6711 board), should I also need 4 delay slot ?
>
>No it will require more.L2 line size is 128 bytes so it highly unlikely that
>this can just take 4 cycles as is the case for the line size of 32 bytes!

Yes. You always need 4 delay slots.

- Andrew E. >-bhooshan




Hi Bhooshan,

> You are probably right as far as c64x devices are concerned not
sure abt
> that device bit u are deifinetly wrong on 621x and 671x
devices.This

I really don't agree with you. C62x and C64x have the same
pipeline architecture. C67x is somewhat different, but
has the load instructions are performed the same way.
Please read carefully spru189f's chapter on pipeline.

Let me first make one remark : I'm talking about
optimised assembly (.asm).
If you're writing linear assembly (.sa), then you don't
have to worry about load delay slots, because the assembler
will insert them for you. With linear assembly, you can consider
that any instruction is completely finished before the next one.

I'll try to explain this better than in my previous post.
Every instruction is made of 3 pipeline stages: fetch, decode
and execute. The "execute" stage is made of a variable
number of phases, depending on the instruction.

The load instruction requires 5 execute phases. It is only at
the end of the 5th stage that the loaded data is actually put
into the destination register.

Let's say you're performing a load, followed by
4 instructions A, B, C, D :
LDW *A4++, A3
A
B
C
D

I'll show what's happening at every cycle, considering that
Fn(X) = Fetch phase n of instruction X n = 1..4
Dn(X) = Decode phase n of instruction X n = 1..2
En(X) = Execute phase n of instruction X n = 1..5 (max)
if needed for this instruction

Cycle N : F1(LDW)
Cycle N+1 : F1(A) F2(LDW)
Cycle N+2 : F1(B) F2(A) F3(LDW)
Cycle N+3 : F1(C) F2(B) F3(A) F4(LDW)
Cycle N+4 : ... F2(C) F3(B) F4(A) D1(LDW)
Cycle N+5 : ... F3(C) F4(B) D1(A) D2(LDW)
Cycle N+6 : ... F4(C) D1(B) D2(A) E1(LDW) // LDW starts execution
Cycle N+7 : ... D1(C) D2(B) E1(A) E2(LDW) // A starts execution
Cycle N+8 : ... D2(C) E1(B) E2(A) E3(LDW) // B starts execution
Cycle N+9 : ... E1(C) E2(B) E3(A) E4(LDW) // C starts execution
Cycle N+10 : ... E2(C) E3(B) E4(A) E5(LDW) // D starts execution
-> LDW has now finished its 5 execution stages,
and A3 contains loaded data

This means that the LDW was started at cycle N+6, but the data
is ready after cycle N+10 ! So, there is what is called 4 delay
slots, no matter of where it was (in cache or not).

BUT :
1) You can execute other instructions during those 4 cycles, as
long as those instructions don't need the loaded data
2) The address modification (A4++ in my example) is performed
at the first execute phase, so that the "A" instruction in
my example can use the post-incremented value

Note also that the instruction A in my example can be another
LDW instruction, and its results will be available at cycle N+11,
so one cycle after the results of the first LDW. This is a very
effective use of pipelining, but this doesn't remove actually
the 4 delay slots. It's just an optimised way of combining loads.

If the data is not in L1D, it doesn't change anything in this
story, but THE EXECUTION UNIT IS STALLED until the data is
ready, so that more CPU cycles will be required.

> The answer is no you dont have to insert 4 nops after every
> load.Particularly if your data is very *locally* accessed(read-one

I agree, you don't HAVE TO insert nops, because you can perform
other calculations while waiting for the data to be loaded,
but you ALWAYS have to wait 4 cycles before the data is actually
loaded into the register.

This gives some problems with interruptible code. I have deeply
discussed this with TI support :

If you perform this :
LDW *A4++, A3
ZERO A3
STW A3, *A5
NOP 2
// A3 loaded here, let's use it...

If an interrupt occurs between the LDW and the
ZERO instruction, then the values can be loaded to A3
before actually executing the ZERO instruction, and
those values can be overwritten.

I hope that it is clear now. If not, don't hesitate to
contact an official TI support center, they will be happy
to confirm all this to you.

Cheers

J-F




Henrry,

Can you post the C code for this ?
Are you sure that C code isn't fast enough ?

At 04:42 AM 4/1/2004 -0800, henrry wrote:
>Below is two answer of my previous email:
>
>>>My question are:
>>>1. Does we always have to wait for 4 delay slot in
>order to get data
>using Load instruction ?
>
>>>Ganesh wrote:
> Itsn't always that you need to wait for 4 delay slots
>for LOAD instructions i.e. for every load you needn't
>have to insert 4 NOPS.
>
>Henrry write: (Related to Ganesh answer)
>According to Ganesh, Every load neednt have to insert
>4 NOPS, It means that after Load Instruction I put
>other instruction to running it first.
>Example:
>
>LDW .D1T1 *A1, A2
>XXXX
>XXXX
>XXXX
>XXXX
>
>Where: XXXX is other instruction, so here we dont
>need NOP 4.

Yes, that is correct. You just need to wait the 4 delay slot for *A1 to actually
appear in A2.

>In my application (image processing), where I work
>with window mask (3x3 / 7x7), I need to load the data
>from memory first, for further processing. So I still
>have to wait the data of window mask (7x7 / 3x3) ready
>in register (A/B) first before further processing. So
>I think the 4 delay slot still a bottleneck in my
>application. My application is using Image 320 x 240,
>so any delay slot put in the loop will make the
>program slower. . The simple example is my
>downsampling program, I first load 2x2 window mask
>into register, before I havent get the image data, I
>couldnt process the data so I havent
>My program example:
> ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A
>|| ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B

>ADD .S1 Ey1_addr_A, idx_val_A,
Ey1_val_A ; Get
>Ey1 address
>|| ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B
>
> LDW .D1T1 *Ex1_val_A,Ex1_val_A
; Get
>Ex1 value
>|| LDW .D2T2 *Ex2_val_B,Ex2_val_B
; Get
>Ex2 value
> LDW .D1T1 *Ey1_val_A,Ey1_val_A
; Get
>Ey1 value
>|| LDW .D2T2 *Ey2_val_B,Ey2_val_B
; Get
>Ey2 value
> NOP 4
>
> ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A
>|| ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B
>
>This is my bottleneck of my program; I need to repeat
>it for equally my image size so delay slot will be 4 x
>320 x 240. It just part of my program, It will appear
>frequently in other of my program. Please correct me
>if I am wrong? Thx.
>
>Par Ligander wrote:
> Yes, always. If was memory speed dependant it would
>be very difficult so make any code binary portable
>
>Henrry write: (Related to Par Ligander)
>Could explain more detail about binary portable ? Or
>maybe if you dont mind could you give simple example
>of it. >>>2. Assume the data located in L2 Cache (0x0000
>0010)? So If I use
>>>the Load instruction to load data from 0x0000 0010
>should I also
>>>need 4 delay slot ?
>
>Ganesha wrote:
>You don't need a 4 delay slot if its in the cache.
>Kindly go through the function of L2 Hits and misses
>w.r.t. cache and how it affects your execution.
>If you have multiple misses, they will be pipelined
>and you average delay will be less than 4.
>
>Henrry write:
>According to your answer, I have wrong perception
>about L2 Cache. I think the L2 Cache that I mention in
>my previous email is L2 SRAM which located from 0x0000
>0000 until 0x0000 10000.

Henrry, read SPRU656A.PDF and see if that helps. >3. Assume the data locate in SDRAM (which is external
>memory in
>DSK6711 board), should I also need 4 delay slot ?
>Yes if you have a single load. if you have multiple
>loads, then my previous answer should provide you with
>enough information.
>4. I have planned to use QDMA to move a small block of
>data from
>SDRAM to L2 Cache to speed up my program. But it
>seemed useless,
>because when the data already in L2 Cache, I still
>need 4 delay slot
>in order to load the data from L2 Cache which is same
>delay slot
>while loading data from SDRAM. So In my opinion, I
>just load
>directly from SDRAM instead of using QDMA to move the
>data to L2
>Cache. I think that if there is no delay slot while
>loading data
>from L2 cache, the QDMA will be usefull. Any
>correction ?
>
>I dont understand how can you perform a DMA into cache
>area. I guess cache is controlled by cache controller
>and to the best of my knowledge, you don't program to
>write data into memory area configured as Cache. you
>might use DMA to transfer data from SDRAM to L2 SRAM,
>but not cache. Kindly cross-check and cross-verify.
>Cache is used to hold the memory segment that your
>program is currently accessing and in all probability
>will be holding in future.
>It is controlled by the logic of your software,
>depending on the way memory is accessed, be it program
>or data memory.
>You never write anything explicit. The cache
>controller uses LRU (Least Recently Used) algorithm to
>update the cache lines. You need to maintain coherency
>for which you need to perform a cache clean.
>In a nutshell, you can only Clean the cache from your
>control. To gain very good cache performance, you need
>to restructure your program suitably.
>
>Henrry write:
>About the cache, I think C6711 provide to level cache
>which are L1P and L2. According to your explanation
>about cache, both of the cache cannot be accessed by
>our program, but can be access by the logic of our
>program. So it will use by the Cache controller
>automatically when there is a load or store
>instruction for temporary storing.
>
>Par Ligander wrote:
>Correct QDMA, can not reduce the load delay. You can
>use a DMA scheme
>as you suggest to reduce the number of stall cycles
>infliced by
>the slow external memory but the cache is in most
>cases a better
>mechanism to do that.
>
>Henrry write:
>I have not fully understand about your answer. You
>mean that I still could use DMA sheme to move data
>from external memory (SDRAM) to L2 SRAM to handle any
>stall cycles ? Correct me if I am wrong

Henry on the C6711 you can select the size of the cache, up to a maximum of 64
Kbytes. You could have 32 Kbytes of SRAM and 32 Kbytes of L2 cache. In that case
you could QDMA into the SRAM region. You can not QDMA into L2 cache memory.

I would think that in all probability, you could just set the whole SRAM to be
L2 cache and let the processor take care of fetching and caching memory.

- Andrew E.

>=====
>Best Regards,
>
>Henrry Andrian - Researcher
>ISCI Lab (http://isci.cn.nctu.edu.tw)
>Office Ph. +886 3 5712121 ext: 54358
>Mobile Ph. +886 931198986
>National Chiao Tung University (http://www.nctu.edu.tw)
>Hsinchu - Taiwan, ROC
>
>__________________________________ >
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links >
>




--- Henrry Andrian <> wrote:

> LDW .D1T1 *A1,A2
> NOP 4

> Above instruction will load 32 bit data which point
> by A1 into A2
> with four NOP which mean 4 delay slot.

> Does we always have to wait for 4 delay slot in
> order to get data using Load instruction ?

YES - if LDW instruction takes 5 cycles - there are
alwyas 4 processor cycles until the data can be
used!!! and caching has nothing to do with that!!!

now - couple of issues - although you always have to
"wait" those 4 delay slots - that does not mean that
you always have to put those 4 NOPS after each LDW

you can execute other instructions that don't depend
on the data you're loading with that LDW - this way
you're not loosing any "power" for executing NOPS

finally - if you master the "software pipeline idea" -
you'll be able to pipeline loads with whatever you're
performing on the data and then you can go down with
the "average execution time" - but again - that has
nothing to do with the delay slots needed to load the
data...

well - if you want - read the "sample-by-sample FIR
optimized" thread that I started around summer 2003 -
there is a full explanation of FIR engine using
"software pipeline" - using single cycle loop I was
loading/multiplying/accumulating data - even though
you're application is totally different - I believe
the idea of "software pipeline" is clearly showed
there...

anyway - good luck to all ;-)

Wojciech Rewers

PS. Can anybody offer me a job as embedded/DSP
engineer? I'm ready to allocate anywhere!
__________________________________





Andrew- >I'm confident that Jeff has the correct interpretation here.
>See comments embedded in your email.

Am not yet convinced.Ill wait for more evidence. :) >Actually, the c64x and C621x have the same load characteristics.

Not exactly, they seem to have some slight(?) differences.
1]C64x employs a banked memory structure (MEM_BANK Pragma comes into
picture)
2]Line sizes are different

I quote from spru609a.pdf section 3.2.1

"The C621x/C671x DSP does not employ a banked memory structure in L1D.
The L1D is implemented with a single bank of dual-ported, 64-bit memory.
This
allows two simultaneous accesses on each cycle with no stalls. This is in
contrast to the C620x/C670x and C64x devices. The C64x devices employ a
least-significant bit (LSB) based memory banking structure that only allows
one access to each bank on each cycle."

So that proves that their characteristics are *quite* different.

morevover in another place the document says " An L1D read miss that hits L2
SRAM or
L2 cache stalls the CPU for 4 cycles." that implies if there was no L1D miss
then there wouldnt be a stall for 4 cycles! (See this is precisely my point
TI documents make me *Interepret* things!)

So, i strongly suspect the argument that every load *requires* 4 delay
slots.Not in 621x/671x atleast. >Back to back loads can occur in a pipelined loop, but that is a different
>discussion

Am not sure i understand what u mean here.

>The 4 delay slots are due to the deep pipeline on the the C6xxx family. It
>doesn't matter where >the data is, the DSP uses the 4 delay slots to go
>through the gymnastics of fetching the data.

I understand that the 4 delay slots are for:

1]pg(data address generate)
2]ps(Send to memory)
3]pw(Wait for data to be ready)
4]pr(read)

Now my point is if value *found* in L1D where is the question of "sending to
memory", "waiting for value to be ready" and "reading" ? Why??

remember the document says L1D cpu access time is 1 cycle? if every load
regardless of where it is stored, is going to take 4 delay slots(hence 4
clock cycles) then what is the point in having a single cycle access memory?

-bhooshan

_________________________________________________________________
Get the best deals. On Electronics, Mobiles, Laptops. Log on to
www.baazee.com http://go.msnserver.com/IN/45530.asp