c6x | Load (LDW/LDH/LDB) Instruction

Dear all, Hi, I have question related in Load Data from memory (LDW/LDH/LDB) using assembly language. Currently I am using DSK6711 board. According to the SPRU189(TMS320C6000 CPU and Instruction Set Reference Guide) documentation, Load Instruction need 4 delay slot to Load 32bit/16 bit/8 bit data from memory into register. Example: LDW .D1T1 *A1,A2 NOP 4 Above instruction will load 32 bit data which point by A1 into A2 with four NOP which mean 4 delay slot. My question are: 1. Does we always have to wait for 4 delay slot in order to get data using Load instruction ? 2. Assume the data located in L2 Cache (0x0000 0010)? So If I use the Load instruction to load data from 0x0000 0010 should I also need 4 delay slot ? 3. Assume the data locate in SDRAM (which is external memory in DSK6711 board), should I also need 4 delay slot ? 4. I have planned to use QDMA to move a small block of data from SDRAM to L2 Cache to speed up my program. But it seemed useless, because when the data already in L2 Cache, I still need 4 delay slot in order to load the data from L2 Cache which is same delay slot while loading data from SDRAM. So In my opinion, I just load directly from SDRAM instead of using QDMA to move the data to L2 Cache. I think that if there is no delay slot while loading data from L2 cache, the QDMA will be usefull. Any correction ? Above are all my question about Load instruction in C6000, especially I implement in DSK6711 Board. I will appreaciate any answer of my question. Thank you Best Regards, Henrry Andrian Graduated Student - ISCI LAB (http://isci.cn.nctu.edu.tw) National Chiao Tung University - Hsinchu, ROC. Lab Phone. +886-3-5712121 ext.54358 Cellular Phone. +886-931-198986

Reply by Ganesh Vijayan ●April 1, 20042004-04-01

Hi Henry,

Kindly find my answers embedded in your mail.

----- Original Message -----

From: Henrry Andrian

To: c...@yahoogroups.com

Sent: Thursday, April 01, 2004 1:12 PM

Subject: [c6x] Load (LDW/LDH/LDB) Instruction

Dear all,

Hi, I have question related in Load Data from memory (LDW/LDH/LDB)
using assembly language. Currently I am using DSK6711 board.
According to the SPRU189(TMS320C6000 CPU and Instruction Set
Reference Guide) documentation, Load Instruction need 4 delay slot
to Load 32bit/16 bit/8 bit data from memory into register.
Example:
LDW .D1T1 *A1,A2
NOP 4
Above instruction will load 32 bit data which point by A1 into A2
with four NOP which mean 4 delay slot.

My question are:
1. Does we always have to wait for 4 delay slot in order to get data
using Load instruction ?

Itsn't always that you need to wait for 4 delay slots for LOAD instructions i.e. for every load you needn't have to insert 4 NOPS.
2. Assume the data located in L2 Cache (0x0000 0010)? So If I use
the Load instruction to load data from 0x0000 0010 should I also
need 4 delay slot ?

You don't need a 4 delay slot if its in the cache. Kindly go through the function of L2 Hits and misses w.r.t. cache and how it affects your execution.

If you have multiple misses, they will be pipelined and you average delay will be less than 4.
3. Assume the data locate in SDRAM (which is external memory in
DSK6711 board), should I also need 4 delay slot ?

Yes if you have a single load. if you have multiple loads, then my previous answer should provide you with enough information.
4. I have planned to use QDMA to move a small block of data from
SDRAM to L2 Cache to speed up my program. But it seemed useless,
because when the data already in L2 Cache, I still need 4 delay slot
in order to load the data from L2 Cache which is same delay slot
while loading data from SDRAM. So In my opinion, I just load
directly from SDRAM instead of using QDMA to move the data to L2
Cache. I think that if there is no delay slot while loading data
from L2 cache, the QDMA will be usefull. Any correction ?
I dont understand how can you perform a DMA into cache area. I guess cache is controlled by cache controller and to the best of my knowledge, you don't program to write data into memory area configured as Cache. you might use DMA to transfer data from SDRAM to L2 SRAM, but not cache. Kindly cross-check and cross-verify.

Cache is used to hold the memory segment that your program is currently accessing and in all probability will be holding in future.

It is controlled by the logic of your software, depending on the way memory is accessed, be it program or data memory.

You never write anything explicit. The cache controller uses LRU (Least Recently Used) algorithm to update the cache lines. You need to maintain coherency for which you need to perform a cache clean.

In a nutshell, you can only Clean the cache from your control. To gain very good cache performance, you need to restructure your program suitably.
Above are all my question about Load instruction in C6000,
especially I implement in DSK6711 Board. I will appreaciate any
answer of my question. Thank you
Hope this helps.

Thanks and Regards,

Ganesh
Best Regards,

Henrry Andrian
Graduated Student - ISCI LAB (http://isci.cn.nctu.edu.tw)
National Chiao Tung University - Hsinchu, ROC.

Lab Phone. +886-3-5712121 ext.54358
Cellular Phone. +886-931-198986
_____________________________________
Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group.

_____________________________________
About this discussion group:

To Join: Send an email to c...@yahoogroups.com

To Post: Send an email to c...@yahoogroups.com

To Leave: Send an email to c...@yahoogroups.com

Archives: http://www.yahoogroups.com/group/c6x

Other Groups: http://www.dsprelated.com

Yahoo! Groups Links

Reply by jfbuggen ●April 1, 20042004-04-01

Hello, Let me answer to your questions according to my experience on the C64xx DSPs. It is not exactly the same processor, but similar enough to the one you are using, so my answers should be valid for you. > 1. Does we always have to wait for 4 delay slot in order to get data > using Load instruction ? Yes, you need always to wait 4 delay slots. Those slots are required because of the pipeline architecture. This is the minimum required time, assuming that the data is already present in L1D memory. May I suggest to have a look at the spru189f document, which has two chapters for explaining the pipeline of the C62x/C64x and C67x processors. The load instruction requires 5 execute cycles, so it has 4 delay slots. > 2. Assume the data located in L2 Cache (0x0000 0010)? So If I use > the Load instruction to load data from 0x0000 0010 should I also > need 4 delay slot ? > 3. Assume the data locate in SDRAM (which is external memory in > DSK6711 board), should I also need 4 delay slot ? A first warning before answering : L2 cache is a part of L2 memory that is used as cache, so that the DSP takes care of fetching data from external memory to the L2 cache, by issuing special QDMA transfers. Please don't perform on your own any transfer to this zone, as it can trash some cache data. You are allowed to make QDMA transfers to L2 memory that is NOT configured as cache. The size of the L2 cache is configurable, and the remaining space of L2 can be used for your program/data. The 4 delay slots are required for every load, because of the pipeline architecture. When the data is not in the L1D cache, the DSP looks has to find the data in its actual place. 1) If the load address is somewhere in L2 (not used as L2 cache), it has to fetch the data from this L2 memory 2) If the load address is somewhere in external memory, and if L2 cache is enabled (and this page of external memory is configured as cacheable through MAR register), then the DSP looks in the L2 cache if the data is present. If no, it has to fetch the data from the external memory. 3) If the load address is somewhere in external memory, and if L2 cache is not enabled, the data is directly fetched from external memory. In all those cases where the data was not initially in L1D, some additional time is needed to fetch the data to L1D. During this time, the execution of your program is SUSPENDED. This means that during some additional CPU cycles, no execute packet will be processed. This is not reflected in your assembly code, as the execution is really suspended. You keep with your 4 delay slots in your assembly code, but sometimes, you will actually wait more than 4 cycles. > 4. I have planned to use QDMA to move a small block of data from > SDRAM to L2 Cache to speed up my program. But it seemed useless, You can choose to control the process of transferring the appropriate data from external memory to L2 (not cache!) through QDMA transfers, and then issue load operations to this L2 zone. This has the advantage to give you control about what's being put in L2 at what time. Another possibility is to let the DSP manage it by enabling L2 cache and enabling caching for the page of external memory where your data resides. This makes the things easier, but sometimes less efficient, since you don't have much control on cache allocation and trashing. For the C64x, there is a very interesting document called "two level internal memory reference guide" spru610. There should be a similar document for your DSP, but I don't have its reference. I hope it helps J-F

Reply by henrry ●April 1, 20042004-04-01

Below is two answer of my previous email: >>My question are: >>1. Does we always have to wait for 4 delay slot in order to get data using Load instruction ? >>Ganesh wrote: Itsn't always that you need to wait for 4 delay slots for LOAD instructions i.e. for every load you needn't have to insert 4 NOPS. Henrry write: (Related to Ganesh answer) According to Ganesh, Every load neednt have to insert 4 NOPS, It means that after Load Instruction I put other instruction to running it first. Example: LDW .D1T1 *A1, A2 XXXX XXXX XXXX XXXX Where: XXXX is other instruction, so here we dont need NOP 4. In my application (image processing), where I work with window mask (3x3 / 7x7), I need to load the data from memory first, for further processing. So I still have to wait the data of window mask (7x7 / 3x3) ready in register (A/B) first before further processing. So I think the 4 delay slot still a bottleneck in my application. My application is using Image 320 x 240, so any delay slot put in the loop will make the program slower. . The simple example is my downsampling program, I first load 2x2 window mask into register, before I havent get the image data, I couldnt process the data so I havent My program example: ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A || ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B ADD .S1 Ey1_addr_A, idx_val_A, Ey1_val_A ; Get Ey1 address || ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B LDW .D1T1 *Ex1_val_A,Ex1_val_A ; Get Ex1 value || LDW .D2T2 *Ex2_val_B,Ex2_val_B ; Get Ex2 value LDW .D1T1 *Ey1_val_A,Ey1_val_A ; Get Ey1 value || LDW .D2T2 *Ey2_val_B,Ey2_val_B ; Get Ey2 value NOP 4 ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A || ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B This is my bottleneck of my program; I need to repeat it for equally my image size so delay slot will be 4 x 320 x 240. It just part of my program, It will appear frequently in other of my program. Please correct me if I am wrong? Thx. Par Ligander wrote: Yes, always. If was memory speed dependant it would be very difficult so make any code binary portable Henrry write: (Related to Par Ligander) Could explain more detail about binary portable ? Or maybe if you dont mind could you give simple example of it. >>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use >>the Load instruction to load data from 0x0000 0010 should I also >>need 4 delay slot ? Ganesha wrote: You don't need a 4 delay slot if its in the cache. Kindly go through the function of L2 Hits and misses w.r.t. cache and how it affects your execution. If you have multiple misses, they will be pipelined and you average delay will be less than 4. Henrry write: According to your answer, I have wrong perception about L2 Cache. I think the L2 Cache that I mention in my previous email is L2 SRAM which located from 0x0000 0000 until 0x0000 10000. 3. Assume the data locate in SDRAM (which is external memory in DSK6711 board), should I also need 4 delay slot ? Yes if you have a single load. if you have multiple loads, then my previous answer should provide you with enough information. 4. I have planned to use QDMA to move a small block of data from SDRAM to L2 Cache to speed up my program. But it seemed useless, because when the data already in L2 Cache, I still need 4 delay slot in order to load the data from L2 Cache which is same delay slot while loading data from SDRAM. So In my opinion, I just load directly from SDRAM instead of using QDMA to move the data to L2 Cache. I think that if there is no delay slot while loading data from L2 cache, the QDMA will be usefull. Any correction ? I dont understand how can you perform a DMA into cache area. I guess cache is controlled by cache controller and to the best of my knowledge, you don't program to write data into memory area configured as Cache. you might use DMA to transfer data from SDRAM to L2 SRAM, but not cache. Kindly cross-check and cross-verify. Cache is used to hold the memory segment that your program is currently accessing and in all probability will be holding in future. It is controlled by the logic of your software, depending on the way memory is accessed, be it program or data memory. You never write anything explicit. The cache controller uses LRU (Least Recently Used) algorithm to update the cache lines. You need to maintain coherency for which you need to perform a cache clean. In a nutshell, you can only Clean the cache from your control. To gain very good cache performance, you need to restructure your program suitably. Henrry write: About the cache, I think C6711 provide to level cache which are L1P and L2. According to your explanation about cache, both of the cache cannot be accessed by our program, but can be access by the logic of our program. So it will use by the Cache controller automatically when there is a load or store instruction for temporary storing. Par Ligander wrote: Correct QDMA, can not reduce the load delay. You can use a DMA scheme as you suggest to reduce the number of stall cycles infliced by the slow external memory but the cache is in most cases a better mechanism to do that. Henrry write: I have not fully understand about your answer. You mean that I still could use DMA sheme to move data from external memory (SDRAM) to L2 SRAM to handle any stall cycles ? Correct me if I am wrong ===== Best Regards, Henrry Andrian - Researcher ISCI Lab (http://isci.cn.nctu.edu.tw) Office Ph. +886 3 5712121 ext: 54358 Mobile Ph. +886 931198986 National Chiao Tung University (http://www.nctu.edu.tw) Hsinchu - Taiwan, ROC __________________________________

Reply by Bhooshan iyer ●April 1, 20042004-04-01

Jeff- >Yes, you need always to wait 4 delay slots. Those slots are >required because of the pipeline architecture. This is the minimum >required time, assuming that the data is already present in >L1D memory. >May I suggest to have a look at the spru189f document, which >has two chapters for explaining the pipeline of the C62x/C64x >and C67x processors. >The load instruction requires 5 execute cycles, so it has 4 >delay slots. You are probably right as far as c64x devices are concerned not sure abt that device bit u are deifinetly wrong on 621x and 671x devices.This question is a classic question.Quite vexing too because there is no *direct * and *easy* documentation on the same(so,whats new? TI and their documentation!) The answer is no you dont have to insert 4 nops after every load.Particularly if your data is very *locally* accessed(read-one after another in a tight set, say 4k!) then the first load needs to be *delayed* and the rest not so! Evry data access in 1x devices when encountered a cache miss picks up 32 bytes of data with 2 *fetches* from L2(if available) and after that for the next nearest data the access time is *just 1cycle* for every load! (read L1D CPU access time is only 1 cycle) There are several other complications in this load operations like *banking*(again 1x devices for some vague reason need not be *banked*.Meaning MEM_BANK pragmas are not reuired for 1x devices.does that make sense?), L1D conflict etc...which again have penalties associated with them. IF required we can open that topic otherwise ill just leave it aside. Again the simple answer -no 4 nops is not reuiredevery time.But unless you are a past master at cache programming(am not!) it is very unlikely you can bank(a different sort of bank,this... :) ) on it not insert NOPs. CCS 2.2 has some new features like the cache analysis tool kit which help you to analyse the cache and make some changes to ur code.Well, good luck! >>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use > > the Load instruction to load data from 0x0000 0010 should I also > > need 4 delay slot ? yes.this is what the 4 delay slots are for, if it is in L2. but the next time the next nearest data is already likey to be in L1 so you wont need that 4 delay slots again, ideally. > > 3. Assume the data locate in SDRAM (which is external memory in > > DSK6711 board), should I also need 4 delay slot ? No it will require more.L2 line size is 128 bytes so it highly unlikely that this can just take 4 cycles as is the case for the line size of 32 bytes! -bhooshan _________________________________________________________________ Apply for a Citibank Suvidha Account. http://go.msnserver.com/IN/45533.asp Get FREE organiser.

Reply by Andrew Elder ●April 1, 20042004-04-01

Bhooshan, I'm confident that Jeff has the correct interpretation here. See comments embedded in your email. As an aside, Henrry, if your are really interested in what is going on, use the expert, the TI C compiler. Right some C code and study the assembly that it produces. That said, I HIGHLY recommend that you stay away from assembly programming on the C6711. Rules of C6xxx programming are 1) Get everything working in C. 2) If code isn't fast enough, use profiler to figure out what is running too slow. 2.5) Check for TI Signal Processing libraries that implement what you need. 3) Optimize C code. 4) Use TI optimizer tools to optimize C code. 5) Use TI app notes and documentation to optimize C code 6) Repeat from 4 :-) The main point is that you should try REALLY HARD to implement everything in C code. The C compiler does a pretty good job "out of the box" and does an excellent job if you read up some on what the optimization options are. Finally, if you still don't have enough performance, you can resort to linear assembly and worst case actual assembly. At 01:04 PM 4/1/2004 +0000, Bhooshan iyer wrote: >Jeff- > >>Yes, you need always to wait 4 delay slots. Those slots are >>required because of the pipeline architecture. This is the minimum >>required time, assuming that the data is already present in >>L1D memory. >>May I suggest to have a look at the spru189f document, which >>has two chapters for explaining the pipeline of the C62x/C64x >>and C67x processors. >>The load instruction requires 5 execute cycles, so it has 4 >>delay slots. > >You are probably right as far as c64x devices are concerned not sure abt >that device bit u are deifinetly wrong on 621x and 671x devices.This >question is a classic question.Quite vexing too because there is no *direct >* and *easy* documentation on the same(so,whats new? TI and their >documentation!) Actually, the c64x and C621x have the same load characteristics. >The answer is no you dont have to insert 4 nops after every >load.Particularly if your data is very *locally* accessed(read-one after >another in a tight set, say 4k!) then the first load needs to be *delayed* >and the rest not so! Evry data access in 1x devices when encountered a cache >miss picks up 32 bytes of data with 2 *fetches* from L2(if available) and >after that for the next nearest data the access time is *just 1cycle* for >every load! (read L1D CPU access time is only 1 cycle) If data is in L1D the load has 4 delay slots. If the data is not in L1D it will take longer. Back to back loads can occur in a pipelined loop, but that is a different discussion. >There are several other complications in this load operations like >*banking*(again 1x devices for some vague reason need not be >*banked*.Meaning MEM_BANK pragmas are not reuired for 1x devices.does that >make sense?), L1D conflict etc...which again have penalties associated with >them. IF required we can open that topic otherwise ill just leave it aside. > >Again the simple answer -no 4 nops is not reuiredevery time.But unless you >are a past master at cache programming(am not!) it is very unlikely you can >bank(a different sort of bank,this... :) ) on it not insert NOPs. CCS 2.2 >has some new features like the cache analysis tool kit which help you to >analyse the cache and make some changes to ur code.Well, good luck! 4 nops are required every time. >>>2. Assume the data located in L2 Cache (0x0000 0010)? So If I use >> > the Load instruction to load data from 0x0000 0010 should I also >> > need 4 delay slot ? >yes.this is what the 4 delay slots are for, if it is in L2. but the next >time the next nearest data is already likey to be in L1 so you wont need >that 4 delay slots again, ideally. The 4 delay slots are due to the deep pipeline on the the C6xxx family. It doesn't matter where the data is, the DSP uses the 4 delay slots to go through the gymnastics of fetching the data. >> > 3. Assume the data locate in SDRAM (which is external memory in >> > DSK6711 board), should I also need 4 delay slot ? > >No it will require more.L2 line size is 128 bytes so it highly unlikely that >this can just take 4 cycles as is the case for the line size of 32 bytes! Yes. You always need 4 delay slots. - Andrew E. >-bhooshan

Reply by jfbuggen ●April 1, 20042004-04-01

Hi Bhooshan, > You are probably right as far as c64x devices are concerned not sure abt > that device bit u are deifinetly wrong on 621x and 671x devices.This I really don't agree with you. C62x and C64x have the same pipeline architecture. C67x is somewhat different, but has the load instructions are performed the same way. Please read carefully spru189f's chapter on pipeline. Let me first make one remark : I'm talking about optimised assembly (.asm). If you're writing linear assembly (.sa), then you don't have to worry about load delay slots, because the assembler will insert them for you. With linear assembly, you can consider that any instruction is completely finished before the next one. I'll try to explain this better than in my previous post. Every instruction is made of 3 pipeline stages: fetch, decode and execute. The "execute" stage is made of a variable number of phases, depending on the instruction. The load instruction requires 5 execute phases. It is only at the end of the 5th stage that the loaded data is actually put into the destination register. Let's say you're performing a load, followed by 4 instructions A, B, C, D : LDW *A4++, A3 A B C D I'll show what's happening at every cycle, considering that Fn(X) = Fetch phase n of instruction X n = 1..4 Dn(X) = Decode phase n of instruction X n = 1..2 En(X) = Execute phase n of instruction X n = 1..5 (max) if needed for this instruction Cycle N : F1(LDW) Cycle N+1 : F1(A) F2(LDW) Cycle N+2 : F1(B) F2(A) F3(LDW) Cycle N+3 : F1(C) F2(B) F3(A) F4(LDW) Cycle N+4 : ... F2(C) F3(B) F4(A) D1(LDW) Cycle N+5 : ... F3(C) F4(B) D1(A) D2(LDW) Cycle N+6 : ... F4(C) D1(B) D2(A) E1(LDW) // LDW starts execution Cycle N+7 : ... D1(C) D2(B) E1(A) E2(LDW) // A starts execution Cycle N+8 : ... D2(C) E1(B) E2(A) E3(LDW) // B starts execution Cycle N+9 : ... E1(C) E2(B) E3(A) E4(LDW) // C starts execution Cycle N+10 : ... E2(C) E3(B) E4(A) E5(LDW) // D starts execution -> LDW has now finished its 5 execution stages, and A3 contains loaded data This means that the LDW was started at cycle N+6, but the data is ready after cycle N+10 ! So, there is what is called 4 delay slots, no matter of where it was (in cache or not). BUT : 1) You can execute other instructions during those 4 cycles, as long as those instructions don't need the loaded data 2) The address modification (A4++ in my example) is performed at the first execute phase, so that the "A" instruction in my example can use the post-incremented value Note also that the instruction A in my example can be another LDW instruction, and its results will be available at cycle N+11, so one cycle after the results of the first LDW. This is a very effective use of pipelining, but this doesn't remove actually the 4 delay slots. It's just an optimised way of combining loads. If the data is not in L1D, it doesn't change anything in this story, but THE EXECUTION UNIT IS STALLED until the data is ready, so that more CPU cycles will be required. > The answer is no you dont have to insert 4 nops after every > load.Particularly if your data is very *locally* accessed(read-one I agree, you don't HAVE TO insert nops, because you can perform other calculations while waiting for the data to be loaded, but you ALWAYS have to wait 4 cycles before the data is actually loaded into the register. This gives some problems with interruptible code. I have deeply discussed this with TI support : If you perform this : LDW *A4++, A3 ZERO A3 STW A3, *A5 NOP 2 // A3 loaded here, let's use it... If an interrupt occurs between the LDW and the ZERO instruction, then the values can be loaded to A3 before actually executing the ZERO instruction, and those values can be overwritten. I hope that it is clear now. If not, don't hesitate to contact an official TI support center, they will be happy to confirm all this to you. Cheers J-F

Reply by Andrew Elder ●April 1, 20042004-04-01

Henrry, Can you post the C code for this ? Are you sure that C code isn't fast enough ? At 04:42 AM 4/1/2004 -0800, henrry wrote: >Below is two answer of my previous email: > >>>My question are: >>>1. Does we always have to wait for 4 delay slot in >order to get data >using Load instruction ? > >>>Ganesh wrote: > Itsn't always that you need to wait for 4 delay slots >for LOAD instructions i.e. for every load you needn't >have to insert 4 NOPS. > >Henrry write: (Related to Ganesh answer) >According to Ganesh, Every load neednt have to insert >4 NOPS, It means that after Load Instruction I put >other instruction to running it first. >Example: > >LDW .D1T1 *A1, A2 >XXXX >XXXX >XXXX >XXXX > >Where: XXXX is other instruction, so here we dont >need NOP 4. Yes, that is correct. You just need to wait the 4 delay slot for *A1 to actually appear in A2. >In my application (image processing), where I work >with window mask (3x3 / 7x7), I need to load the data >from memory first, for further processing. So I still >have to wait the data of window mask (7x7 / 3x3) ready >in register (A/B) first before further processing. So >I think the 4 delay slot still a bottleneck in my >application. My application is using Image 320 x 240, >so any delay slot put in the loop will make the >program slower. . The simple example is my >downsampling program, I first load 2x2 window mask >into register, before I havent get the image data, I >couldnt process the data so I havent >My program example: > ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A >|| ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B >ADD .S1 Ey1_addr_A, idx_val_A, Ey1_val_A ; Get >Ey1 address >|| ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B > > LDW .D1T1 *Ex1_val_A,Ex1_val_A ; Get >Ex1 value >|| LDW .D2T2 *Ex2_val_B,Ex2_val_B ; Get >Ex2 value > LDW .D1T1 *Ey1_val_A,Ey1_val_A ; Get >Ey1 value >|| LDW .D2T2 *Ey2_val_B,Ey2_val_B ; Get >Ey2 value > NOP 4 > > ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A >|| ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B > >This is my bottleneck of my program; I need to repeat >it for equally my image size so delay slot will be 4 x >320 x 240. It just part of my program, It will appear >frequently in other of my program. Please correct me >if I am wrong? Thx. > >Par Ligander wrote: > Yes, always. If was memory speed dependant it would >be very difficult so make any code binary portable > >Henrry write: (Related to Par Ligander) >Could explain more detail about binary portable ? Or >maybe if you dont mind could you give simple example >of it. >>>2. Assume the data located in L2 Cache (0x0000 >0010)? So If I use >>>the Load instruction to load data from 0x0000 0010 >should I also >>>need 4 delay slot ? > >Ganesha wrote: >You don't need a 4 delay slot if its in the cache. >Kindly go through the function of L2 Hits and misses >w.r.t. cache and how it affects your execution. >If you have multiple misses, they will be pipelined >and you average delay will be less than 4. > >Henrry write: >According to your answer, I have wrong perception >about L2 Cache. I think the L2 Cache that I mention in >my previous email is L2 SRAM which located from 0x0000 >0000 until 0x0000 10000. Henrry, read SPRU656A.PDF and see if that helps. >3. Assume the data locate in SDRAM (which is external >memory in >DSK6711 board), should I also need 4 delay slot ? >Yes if you have a single load. if you have multiple >loads, then my previous answer should provide you with >enough information. >4. I have planned to use QDMA to move a small block of >data from >SDRAM to L2 Cache to speed up my program. But it >seemed useless, >because when the data already in L2 Cache, I still >need 4 delay slot >in order to load the data from L2 Cache which is same >delay slot >while loading data from SDRAM. So In my opinion, I >just load >directly from SDRAM instead of using QDMA to move the >data to L2 >Cache. I think that if there is no delay slot while >loading data >from L2 cache, the QDMA will be usefull. Any >correction ? > >I dont understand how can you perform a DMA into cache >area. I guess cache is controlled by cache controller >and to the best of my knowledge, you don't program to >write data into memory area configured as Cache. you >might use DMA to transfer data from SDRAM to L2 SRAM, >but not cache. Kindly cross-check and cross-verify. >Cache is used to hold the memory segment that your >program is currently accessing and in all probability >will be holding in future. >It is controlled by the logic of your software, >depending on the way memory is accessed, be it program >or data memory. >You never write anything explicit. The cache >controller uses LRU (Least Recently Used) algorithm to >update the cache lines. You need to maintain coherency >for which you need to perform a cache clean. >In a nutshell, you can only Clean the cache from your >control. To gain very good cache performance, you need >to restructure your program suitably. > >Henrry write: >About the cache, I think C6711 provide to level cache >which are L1P and L2. According to your explanation >about cache, both of the cache cannot be accessed by >our program, but can be access by the logic of our >program. So it will use by the Cache controller >automatically when there is a load or store >instruction for temporary storing. > >Par Ligander wrote: >Correct QDMA, can not reduce the load delay. You can >use a DMA scheme >as you suggest to reduce the number of stall cycles >infliced by >the slow external memory but the cache is in most >cases a better >mechanism to do that. > >Henrry write: >I have not fully understand about your answer. You >mean that I still could use DMA sheme to move data >from external memory (SDRAM) to L2 SRAM to handle any >stall cycles ? Correct me if I am wrong Henry on the C6711 you can select the size of the cache, up to a maximum of 64 Kbytes. You could have 32 Kbytes of SRAM and 32 Kbytes of L2 cache. In that case you could QDMA into the SRAM region. You can not QDMA into L2 cache memory. I would think that in all probability, you could just set the whole SRAM to be L2 cache and let the processor take care of fetching and caching memory. - Andrew E. >===== >Best Regards, > >Henrry Andrian - Researcher >ISCI Lab (http://isci.cn.nctu.edu.tw) >Office Ph. +886 3 5712121 ext: 54358 >Mobile Ph. +886 931198986 >National Chiao Tung University (http://www.nctu.edu.tw) >Hsinchu - Taiwan, ROC > >__________________________________ > >_____________________________________ >Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > >_____________________________________ >About this discussion group: > >To Join: Send an email to > >To Post: Send an email to > >To Leave: Send an email to > >Archives: http://www.yahoogroups.com/group/c6x > >Other Groups: http://www.dsprelated.com > >Yahoo! Groups Links > >

Reply by Wojciech Rewers ●April 1, 20042004-04-01

--- Henrry Andrian <> wrote: > LDW .D1T1 *A1,A2 > NOP 4 > Above instruction will load 32 bit data which point > by A1 into A2 > with four NOP which mean 4 delay slot. > Does we always have to wait for 4 delay slot in > order to get data using Load instruction ? YES - if LDW instruction takes 5 cycles - there are alwyas 4 processor cycles until the data can be used!!! and caching has nothing to do with that!!! now - couple of issues - although you always have to "wait" those 4 delay slots - that does not mean that you always have to put those 4 NOPS after each LDW you can execute other instructions that don't depend on the data you're loading with that LDW - this way you're not loosing any "power" for executing NOPS finally - if you master the "software pipeline idea" - you'll be able to pipeline loads with whatever you're performing on the data and then you can go down with the "average execution time" - but again - that has nothing to do with the delay slots needed to load the data... well - if you want - read the "sample-by-sample FIR optimized" thread that I started around summer 2003 - there is a full explanation of FIR engine using "software pipeline" - using single cycle loop I was loading/multiplying/accumulating data - even though you're application is totally different - I believe the idea of "software pipeline" is clearly showed there... anyway - good luck to all ;-) Wojciech Rewers PS. Can anybody offer me a job as embedded/DSP engineer? I'm ready to allocate anywhere! __________________________________

Reply by Bhooshan iyer ●April 1, 20042004-04-01

Andrew- >I'm confident that Jeff has the correct interpretation here. >See comments embedded in your email. Am not yet convinced.Ill wait for more evidence. :) >Actually, the c64x and C621x have the same load characteristics. Not exactly, they seem to have some slight(?) differences. 1]C64x employs a banked memory structure (MEM_BANK Pragma comes into picture) 2]Line sizes are different I quote from spru609a.pdf section 3.2.1 "The C621x/C671x DSP does not employ a banked memory structure in L1D. The L1D is implemented with a single bank of dual-ported, 64-bit memory. This allows two simultaneous accesses on each cycle with no stalls. This is in contrast to the C620x/C670x and C64x devices. The C64x devices employ a least-significant bit (LSB) based memory banking structure that only allows one access to each bank on each cycle." So that proves that their characteristics are *quite* different. morevover in another place the document says " An L1D read miss that hits L2 SRAM or L2 cache stalls the CPU for 4 cycles." that implies if there was no L1D miss then there wouldnt be a stall for 4 cycles! (See this is precisely my point TI documents make me *Interepret* things!) So, i strongly suspect the argument that every load *requires* 4 delay slots.Not in 621x/671x atleast. >Back to back loads can occur in a pipelined loop, but that is a different >discussion Am not sure i understand what u mean here. >The 4 delay slots are due to the deep pipeline on the the C6xxx family. It >doesn't matter where >the data is, the DSP uses the 4 delay slots to go >through the gymnastics of fetching the data. I understand that the 4 delay slots are for: 1]pg(data address generate) 2]ps(Send to memory) 3]pw(Wait for data to be ready) 4]pr(read) Now my point is if value *found* in L1D where is the question of "sending to memory", "waiting for value to be ready" and "reading" ? Why?? remember the document says L1D cpu access time is 1 cycle? if every load regardless of where it is stored, is going to take 4 delay slots(hence 4 clock cycles) then what is the point in having a single cycle access memory? -bhooshan _________________________________________________________________ Get the best deals. On Electronics, Mobiles, Laptops. Log on to www.baazee.com http://go.msnserver.com/IN/45530.asp

Previous12 Next

Load (LDW/LDH/LDB) Instruction

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group