Henrry,
Can you post the C code for this ?
Are you sure that C code isn't fast enough ?
At 04:42 AM 4/1/2004 -0800, henrry wrote:
>Below is two answer of my previous email:
>
>>>My question are:
>>>1. Does we always have to wait for 4 delay slot in
>order to get data
>using Load instruction ?
>
>>>Ganesh wrote:
> Itsn't always that you need to wait for 4 delay slots
>for LOAD instructions i.e. for every load you needn't
>have to insert 4 NOPS.
>
>Henrry write: (Related to Ganesh answer)
>According to Ganesh, Every load neednt have to insert
>4 NOPS, It means that after Load Instruction I put
>other instruction to running it first.
>Example:
>
>LDW .D1T1 *A1, A2
>XXXX
>XXXX
>XXXX
>XXXX
>
>Where: XXXX is other instruction, so here we dont
>need NOP 4.
Yes, that is correct. You just need to wait the 4 delay slot for *A1 to
actually
appear in A2.
>In my application (image processing), where I
work
>with window mask (3x3 / 7x7), I need to load the data
>from memory first, for further processing. So I still
>have to wait the data of window mask (7x7 / 3x3) ready
>in register (A/B) first before further processing. So
>I think the 4 delay slot still a bottleneck in my
>application. My application is using Image 320 x 240,
>so any delay slot put in the loop will make the
>program slower. . The simple example is my
>downsampling program, I first load 2x2 window mask
>into register, before I havent get the image data, I
>couldnt process the data so I havent
>My program example:
> ADD .L1 Ex1_addr_A,idx_val_A,Ex1_val_A
>|| ADD .L2 Ex2_addr_B,idx_val_A,Ex2_val_B
>ADD .S1 Ey1_addr_A, idx_val_A,
Ey1_val_A ; Get
>Ey1 address
>|| ADD .S2 Ey2_addr_B,idx_val_A,Ey2_val_B
>
> LDW .D1T1 *Ex1_val_A,Ex1_val_A
; Get
>Ex1 value
>|| LDW .D2T2 *Ex2_val_B,Ex2_val_B
; Get
>Ex2 value
> LDW .D1T1 *Ey1_val_A,Ey1_val_A
; Get
>Ey1 value
>|| LDW .D2T2 *Ey2_val_B,Ey2_val_B
; Get
>Ey2 value
> NOP 4
>
> ADD .L1 Ex1_val_A,Ex2_val_B,Ex_val_A
>|| ADD .L2 Ey1_val_A,Ey2_val_B,Ey_val_B
>
>This is my bottleneck of my program; I need to repeat
>it for equally my image size so delay slot will be 4 x
>320 x 240. It just part of my program, It will appear
>frequently in other of my program. Please correct me
>if I am wrong? Thx.
>
>Par Ligander wrote:
> Yes, always. If was memory speed dependant it would
>be very difficult so make any code binary portable
>
>Henrry write: (Related to Par Ligander)
>Could explain more detail about binary portable ? Or
>maybe if you dont mind could you give simple example
>of it.
>>>2. Assume the data located in L2 Cache (0x0000
>0010)? So If I use
>>>the Load instruction to load data from 0x0000 0010
>should I also
>>>need 4 delay slot ?
>
>Ganesha wrote:
>You don't need a 4 delay slot if its in the cache.
>Kindly go through the function of L2 Hits and misses
>w.r.t. cache and how it affects your execution.
>If you have multiple misses, they will be pipelined
>and you average delay will be less than 4.
>
>Henrry write:
>According to your answer, I have wrong perception
>about L2 Cache. I think the L2 Cache that I mention in
>my previous email is L2 SRAM which located from 0x0000
>0000 until 0x0000 10000.
Henrry, read SPRU656A.PDF and see if that helps.
>3. Assume the data locate in SDRAM (which is
external
>memory in
>DSK6711 board), should I also need 4 delay slot ?
>Yes if you have a single load. if you have multiple
>loads, then my previous answer should provide you with
>enough information.
>4. I have planned to use QDMA to move a small block of
>data from
>SDRAM to L2 Cache to speed up my program. But it
>seemed useless,
>because when the data already in L2 Cache, I still
>need 4 delay slot
>in order to load the data from L2 Cache which is same
>delay slot
>while loading data from SDRAM. So In my opinion, I
>just load
>directly from SDRAM instead of using QDMA to move the
>data to L2
>Cache. I think that if there is no delay slot while
>loading data
>from L2 cache, the QDMA will be usefull. Any
>correction ?
>
>I dont understand how can you perform a DMA into cache
>area. I guess cache is controlled by cache controller
>and to the best of my knowledge, you don't program to
>write data into memory area configured as Cache. you
>might use DMA to transfer data from SDRAM to L2 SRAM,
>but not cache. Kindly cross-check and cross-verify.
>Cache is used to hold the memory segment that your
>program is currently accessing and in all probability
>will be holding in future.
>It is controlled by the logic of your software,
>depending on the way memory is accessed, be it program
>or data memory.
>You never write anything explicit. The cache
>controller uses LRU (Least Recently Used) algorithm to
>update the cache lines. You need to maintain coherency
>for which you need to perform a cache clean.
>In a nutshell, you can only Clean the cache from your
>control. To gain very good cache performance, you need
>to restructure your program suitably.
>
>Henrry write:
>About the cache, I think C6711 provide to level cache
>which are L1P and L2. According to your explanation
>about cache, both of the cache cannot be accessed by
>our program, but can be access by the logic of our
>program. So it will use by the Cache controller
>automatically when there is a load or store
>instruction for temporary storing.
>
>Par Ligander wrote:
>Correct QDMA, can not reduce the load delay. You can
>use a DMA scheme
>as you suggest to reduce the number of stall cycles
>infliced by
>the slow external memory but the cache is in most
>cases a better
>mechanism to do that.
>
>Henrry write:
>I have not fully understand about your answer. You
>mean that I still could use DMA sheme to move data
>from external memory (SDRAM) to L2 SRAM to handle any
>stall cycles ? Correct me if I am wrong
Henry on the C6711 you can select the size of the cache, up to a maximum of
64
Kbytes. You could have 32 Kbytes of SRAM and 32 Kbytes of L2 cache. In that
case
you could QDMA into the SRAM region. You can not QDMA into L2 cache memory.
I would think that in all probability, you could just set the whole SRAM to
be
L2 cache and let the processor take care of fetching and caching memory.
- Andrew E.
>=====
>Best Regards,
>
>Henrry Andrian - Researcher
>ISCI Lab (http://isci.cn.nctu.edu.tw)
>Office Ph. +886 3 5712121 ext: 54358
>Mobile Ph. +886 931198986
>National Chiao Tung University (http://www.nctu.edu.tw)
>Hsinchu - Taiwan, ROC
>
>__________________________________
>
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the
author of
this message will receive your answer. You need to do a "reply
all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links
>
>
|