DSPRelated.com
Forums

EDMA data cache problem

Started by carlferns August 3, 2006
William/Jeff,

> I think Carl wants to
> do it this way because he's got a large amount of external SDRAM
data (16 Mbyte) and
> he's sort of "double buffering": moving large slices to internal
memory while the
> CPU is chugging away at another slice. This method might better
utilize CPU internal
> memory bus bandwidth and keep both CPU and DMA units busy (hopefully).

That is correct - I intend implementing the ping/pong.....good
practises suggested in the TI documentation but still haven't got to
it....

> Although I still don't fully understand Carl doesn't DMA directly
into cache space...

Not sure I understood. DMA to cache? you mean SDRAM (EMIF) right?

>Otherwise you're right -- don't use EDMA between SRAM and internal
memory, let the
> CPU do the work, and keep cache enabled.

William , if the DSP alone accesses memory, you will never have a
cache coherency problem.

Regards,
C

--- In c..., Jeff Brower wrote:
>
> William-
>
> > I have what ay be a stupid question, but since I'm having some issues
> > programming a 6713 system that may be similar It's better to ask the
> > question and get clarification.
> >
> > If you enable the cache controller DSP, do you really need to make any
> > other cache calls unless you want to free up the L2 ram that the cache
> > is using? my limited understanding of caching would be that once
it is
> > enabled, you access memory using standard memory access commands, and
> > the cache controller optimizes what it thinks it needs to do, in the
> > chunk sizes it wants to. Issuing a command to invalidate and
flush the
> > cache gives you a controlled state of knowing when the cache has been
> > flushed, but should not be necessary.
> >
> > If you wanted complete control of what was in the L2 ram versus the
> > external ram, you'd be better off disabling cache altogether,
freeing up
> > the cache ram for general purpose use, and paging the data in manually
> > to do your manipulations.
> >
> > Is the 6416 processor significantly different in how it works with
cache
> > from the 6713? Am I completely off base in how to deal with a cache
> > controller?
>
> When you use EDMA to move data between external memory and internal
SRAM (not cache),
> the CPU doesn't "know" the internal memory has been changed; i.e.
there is no
> snooping. Code has to manually invalidate that area of cache. I
think Carl wants to
> do it this way because he's got a large amount of external SDRAM
data (16 Mbyte) and
> he's sort of "double buffering": moving large slices to internal
memory while the
> CPU is chugging away at another slice. This method might better
utilize CPU internal
> memory bus bandwidth and keep both CPU and DMA units busy (hopefully).
>
> Although I still don't fully understand Carl doesn't DMA directly
into cache space...
>
> Otherwise you're right -- don't use EDMA between SRAM and internal
memory, let the
> CPU do the work, and keep cache enabled.
>
> -Jeff
>
> > carlferns wrote:
> > >
> > > Jeff,
> > >
> > > Thanks for your comments.
> > > I guess I was not very clear with what I am doing.
> > > My issue is with DMA and the cache coherency.
> > > I have to process a lot more data (16mb)in SDRAM and am slicing
it up
> > > to be processed in ISRAM.
> > > What I keep seeing is that despite invalidating L2, the output
data in
> > > SDRAM at the very end (having processed all the input data from
SDRAM)
> > > is corrupt. The only processing going on is as I mentioned earlier
> > > STEPS 1-5.
> > > If I put a break point and view memory at any stage between
Steps 1-5,
> > > the debugger seems to handle the cache correctly and output data
is good.
> > >
> > > Conclusion - Cache controller is acting up or the API is not doing
> > > what it is supposed to do.
> > >
> > > >>
> > > >> If the issue is that at some other time you need internal
memory for
> > > >> another reason, then I would first try with L2 data cache enabled
> > > and no
> > > >> EDMA.
> > > >>
> > > Here's what I have tried this far.
> > > NO EDMA and No cache - The algorithm works great .
> > > NO EDMA and ENABLED Cache , No problem since I do not use any of the
> > > caching API.
> > > ENABLED EDMA and ENABLED Cache , the output data is bad.
> > >
> > > What I have noticed is it is more of a cache issue rather than a DMA
> > > problem since the data can be verified. It is just that without the
> > > CPU intervening and using the CACHE API, the data gets distorted.
> > >
> > > Thanks,
> > > C
> > >
> > > P.S : How do I get the posts to show up in this group as a
continuous
> > > thread and without the wait.... that would be really cool.
> > >
> > > >I think Guy is asking a good question. Won't 96k x 32 fit in
internal
> > > > memory for 6416? So why use SDRAM and EDMA?
> > > >
> > > > If the issue is that at some other time you need internal
memory for
> > > > another reason, then I would first try with L2 data cache
enabled and no
> > > > EDMA. The first time through your data loop you lose the speed
> > > advantage
> > > > of EDMA, but subsequent times your performance is just as
good. And
> > > more
> > > > importantly, that mode forces you to make absolutely sure your
data is
> > > > organized in the most efficient manner, and you have "thought
through"
> > > > exactly the sequence that data moves and cache is used.
> > > >
> > > > Then, enable EDMA as your last step. The performance gain its
going to
> > > > give you in this situation is minimal, should you should get
it working
> > > > last.
> > > >
> > > > -Jeff
> > >
> > > --- In c... , "Jeff
> > > Brower" wrote:
> > > >
> > > > Carl-
> > > >
> > > > > I have a real big problem with EDMA and cache coherency.
> > > > > Board :6416 Spectrum digital
> > > > > Here's what I am doing.
> > > > > 1) Transfer data from SDRAM to ISRAM.
> > > > > 2) Work with data in ISRAM
> > > > > 3) Transfer back to SDRAM
> > > > >
> > > > > 4) Repeat process for next block in SDRAM to same block in ISRAM
> > > > > .
> > > > > .
> > > > > 5) Finally use SDRAM data.
> > > > >
> > > > > Blocks are 128 byte aligned in ISRAM and SDRAM and processed
in chunks
> > > > > of multiples of 128 (actually (96k).
> > > > > L2 cache is 128k and enabled.
> > > > >
> > > > > Now before transfer from SDRAM to ISRAM in step 1, I always
> > > > > CACHE_wbInvL2 (SDRAM block , block size , CACHE_WAIT). I
think that
> > > > > should be enough for cache coherency because the docs say L1D is
> > > > > handled by EDMA. Also ISRAM block is always cache coherent.
Correct?
> > > > >
> > > > > But the data gets all screwed up....
> > > > >
> > > > > Logically, I think I am doing things right.
> > > > >
> > > > > Is there a way to check cache coherency without the debugger or
> > > > > comparing memory via cpu and checking - that in itself pulls
it into
> > > > > L2/L1D cache?
> > > > > It looks like EDMA is doing it's job but caching isn't.
> > > > >
> > > > > What may be the problem here.... Appreciate some ideas.
> > > >
> > > > I think Guy is asking a good question. Won't 96k x 32 fit in
internal
> > > > memory for 6416? So why use SDRAM and EDMA?
> > > >
> > > > If the issue is that at some other time you need internal
memory for
> > > > another reason, then I would first try with L2 data cache
enabled and no
> > > > EDMA. The first time through your data loop you lose the speed
> > > advantage
> > > > of EDMA, but subsequent times your performance is just as
good. And
> > > more
> > > > importantly, that mode forces you to make absolutely sure your
data is
> > > > organized in the most efficient manner, and you have "thought
through"
> > > > exactly the sequence that data moves and cache is used.
> > > >
> > > > Then, enable EDMA as your last step. The performance gain its
going to
> > > > give you in this situation is minimal, should you should get
it working
> > > > last.
> > > >
> > > > -Jeff
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>
Carl-

> If it is programmer error, I sure enough will change the code but I'd
> like to understand the reason. I also noticed there was a bug with the
> cache API wherein it needed to be called twice to work correctly.
> Unfortunately after pouring thro' the forum, the general consensus is
> to use a form of CACHE_CLEAN_ALL_L2 which seems like an overkill.
>
> Here's the pseudo code for this test exercise using first principles.
> 1) Read a 16mb file into SDRAM - ( a simple image that has all zeros).
> 2) Loop for each 128k chunk in the 16mb SDRAM
> Invalidate the SDRAM 128k chunk address in L2 cache, size 128k.
> Synchronous Transfer 128k to IRAM from SDRAM using EDMA.

Invalidate step not needed for C64x, since EDMA destination is L2 SRAM. Cache
controller will snoop this and invalidate for you.

> //Verify IRAM and SDRAM block data by calling a routine similar to
> what one would see in the sample DAT examples
> Change IRAM data - in this case just change all the pixels to white
> (255).

Ok as long as code *only* accesses IRAM (L2 SRAM).

> Synchronous Transfer of the 128k back to same address in SDRAM from
> IRAM using EDMA.
> //Verify IRAM and SDRAM block data by calling a routine similar to
> what one would see in the sample DAT examples
> 3) Write out 16mb file from SDRAM to file
>
> The exact locations of resulting bad data in SDRAM is random but
> consistently within the first 128 bytes of every block.

This sounds like a synchronization issue. What if the cache line gets invalidated
once DMA "touches" its associated address in L2 SRAM, but your code somehow accesses
that line before DMA is finished? Only the first line is affected, after that DMA is
faster than your code so the rest of the block looks normal.

-Jeff
> --- In c..., Jeff Brower wrote:
> >
> > Carl-
> >
> > > I guess I was not very clear with what I am doing.
> > > My issue is with DMA and the cache coherency.
> > > I have to process a lot more data (16mb)in SDRAM and am slicing it up
> > > to be processed in ISRAM.
> > > What I keep seeing is that despite invalidating L2, the output data in
> > > SDRAM at the very end (having processed all the input data from SDRAM)
> > > is corrupt. The only processing going on is as I mentioned earlier
> > > STEPS 1-5.
> > > If I put a break point and view memory at any stage between Steps 1-5,
> > > the debugger seems to handle the cache correctly and output data
> is good.
> > >
> > > Conclusion - Cache controller is acting up or the API is not doing
> > > what it is supposed to do.
> >
> > Or programmer error. I know I know, not what you want to hear...
> but 1000s of
> > engineers use C64x EDMA and cache over the last few years.
> >
> > > What I have noticed is it is more of a cache issue rather than a DMA
> > > problem since the data can be verified. It is just that without the
> > > CPU intervening and using the CACHE API, the data gets distorted.
> >
> > What do you mean "output data in SDRAM at the very end"? End of
> what? Each block?
> > Or end of a bunch of blocks that consume all of SDRAM? If it's just
> the last block
> > or so, then what happens if you reduce your data set to use only 1/2
> of SDRAM? If
> > the situation still occurs, then I might say it's "boundary
> condition" type of error,
> > which usually implies an application / programmer issue rather than
> something else.
> >
> > > P.S : How do I get the posts to show up in this group as a continuous
> > > thread and without the wait.... that would be really cool.
> >
> > The group is moderated so posts can take a while to appear. That's
> a good thing or
> > the group would die to spam, but this group has been strong since 1999.
> >
> > -Jeff
> >
> > >
> > > >I think Guy is asking a good question. Won't 96k x 32 fit in
> internal
> > > > memory for 6416? So why use SDRAM and EDMA?
> > > >
> > > > If the issue is that at some other time you need internal memory for
> > > > another reason, then I would first try with L2 data cache
> enabled and no
> > > > EDMA. The first time through your data loop you lose the speed
> > > advantage
> > > > of EDMA, but subsequent times your performance is just as good. And
> > > more
> > > > importantly, that mode forces you to make absolutely sure your
> data is
> > > > organized in the most efficient manner, and you have "thought
> through"
> > > > exactly the sequence that data moves and cache is used.
> > > >
> > > > Then, enable EDMA as your last step. The performance gain its
> going to
> > > > give you in this situation is minimal, should you should get it
> working
> > > > last.
> > > >
> > > > -Jeff
> > >
> > > --- In c..., "Jeff Brower" wrote:
> > > >
> > > > Carl-
> > > >
> > > > > I have a real big problem with EDMA and cache coherency.
> > > > > Board :6416 Spectrum digital
> > > > > Here's what I am doing.
> > > > > 1) Transfer data from SDRAM to ISRAM.
> > > > > 2) Work with data in ISRAM
> > > > > 3) Transfer back to SDRAM
> > > > >
> > > > > 4) Repeat process for next block in SDRAM to same block in ISRAM
> > > > > .
> > > > > .
> > > > > 5) Finally use SDRAM data.
> > > > >
> > > > > Blocks are 128 byte aligned in ISRAM and SDRAM and processed
> in chunks
> > > > > of multiples of 128 (actually (96k).
> > > > > L2 cache is 128k and enabled.
> > > > >
> > > > > Now before transfer from SDRAM to ISRAM in step 1, I always
> > > > > CACHE_wbInvL2 (SDRAM block , block size , CACHE_WAIT). I think
> that
> > > > > should be enough for cache coherency because the docs say L1D is
> > > > > handled by EDMA. Also ISRAM block is always cache coherent.
> Correct?
> > > > >
> > > > > But the data gets all screwed up....
> > > > >
> > > > > Logically, I think I am doing things right.
> > > > >
> > > > > Is there a way to check cache coherency without the debugger or
> > > > > comparing memory via cpu and checking - that in itself pulls
> it into
> > > > > L2/L1D cache?
> > > > > It looks like EDMA is doing it's job but caching isn't.
> > > > >
> > > > > What may be the problem here.... Appreciate some ideas.
> > > >
> > > > I think Guy is asking a good question. Won't 96k x 32 fit in
> internal
> > > > memory for 6416? So why use SDRAM and EDMA?
> > > >
> > > > If the issue is that at some other time you need internal memory for
> > > > another reason, then I would first try with L2 data cache
> enabled and no
> > > > EDMA. The first time through your data loop you lose the speed
> > > advantage
> > > > of EDMA, but subsequent times your performance is just as good. And
> > > more
> > > > importantly, that mode forces you to make absolutely sure your
> data is
> > > > organized in the most efficient manner, and you have "thought
> through"
> > > > exactly the sequence that data moves and cache is used.
> > > >
> > > > Then, enable EDMA as your last step. The performance gain its
> going to
> > > > give you in this situation is minimal, should you should get it
> working
> > > > last.
> > > >
> > > > -Jeff
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
Carl-

> > Although I still don't fully understand Carl doesn't DMA directly
> into cache space...
>
> Not sure I understood. DMA to cache? you mean SDRAM (EMIF) right?

Sorry I was confusing -- I just mean let the CPU and cache controller do the work.
If your code *always, without exception* accesses L2 SRAM, and you DMA *only* between
SDRAM and L2 SRAM (ping-pong buffers located in L2 SRAM), then the CPU should handle
all cache coherency issues for you. You should not need to make any cache API calls
in CSL.

Probably you have looked this over already, but just in case:

http://focus.ti.com/lit/ug/spru656a/spru656a.pdf

-Jeff

> --- In c..., Jeff Brower wrote:
> >
> > William-
> >
> > > I have what ay be a stupid question, but since I'm having some issues
> > > programming a 6713 system that may be similar It's better to ask the
> > > question and get clarification.
> > >
> > > If you enable the cache controller DSP, do you really need to make any
> > > other cache calls unless you want to free up the L2 ram that the cache
> > > is using? my limited understanding of caching would be that once
> it is
> > > enabled, you access memory using standard memory access commands, and
> > > the cache controller optimizes what it thinks it needs to do, in the
> > > chunk sizes it wants to. Issuing a command to invalidate and
> flush the
> > > cache gives you a controlled state of knowing when the cache has been
> > > flushed, but should not be necessary.
> > >
> > > If you wanted complete control of what was in the L2 ram versus the
> > > external ram, you'd be better off disabling cache altogether,
> freeing up
> > > the cache ram for general purpose use, and paging the data in manually
> > > to do your manipulations.
> > >
> > > Is the 6416 processor significantly different in how it works with
> cache
> > > from the 6713? Am I completely off base in how to deal with a cache
> > > controller?
> >
> > When you use EDMA to move data between external memory and internal
> SRAM (not cache),
> > the CPU doesn't "know" the internal memory has been changed; i.e.
> there is no
> > snooping. Code has to manually invalidate that area of cache. I
> think Carl wants to
> > do it this way because he's got a large amount of external SDRAM
> data (16 Mbyte) and
> > he's sort of "double buffering": moving large slices to internal
> memory while the
> > CPU is chugging away at another slice. This method might better
> utilize CPU internal
> > memory bus bandwidth and keep both CPU and DMA units busy (hopefully).
> >
> > Although I still don't fully understand Carl doesn't DMA directly
> into cache space...
> >
> > Otherwise you're right -- don't use EDMA between SRAM and internal
> memory, let the
> > CPU do the work, and keep cache enabled.
> >
> > -Jeff
> >
> > > carlferns wrote:
> > > >
> > > > Jeff,
> > > >
> > > > Thanks for your comments.
> > > > I guess I was not very clear with what I am doing.
> > > > My issue is with DMA and the cache coherency.
> > > > I have to process a lot more data (16mb)in SDRAM and am slicing
> it up
> > > > to be processed in ISRAM.
> > > > What I keep seeing is that despite invalidating L2, the output
> data in
> > > > SDRAM at the very end (having processed all the input data from
> SDRAM)
> > > > is corrupt. The only processing going on is as I mentioned earlier
> > > > STEPS 1-5.
> > > > If I put a break point and view memory at any stage between
> Steps 1-5,
> > > > the debugger seems to handle the cache correctly and output data
> is good.
> > > >
> > > > Conclusion - Cache controller is acting up or the API is not doing
> > > > what it is supposed to do.
> > > >
> > > > >>
> > > > >> If the issue is that at some other time you need internal
> memory for
> > > > >> another reason, then I would first try with L2 data cache enabled
> > > > and no
> > > > >> EDMA.
> > > > >>
> > > > Here's what I have tried this far.
> > > > NO EDMA and No cache - The algorithm works great .
> > > > NO EDMA and ENABLED Cache , No problem since I do not use any of the
> > > > caching API.
> > > > ENABLED EDMA and ENABLED Cache , the output data is bad.
> > > >
> > > > What I have noticed is it is more of a cache issue rather than a DMA
> > > > problem since the data can be verified. It is just that without the
> > > > CPU intervening and using the CACHE API, the data gets distorted.
> > > >
> > > > Thanks,
> > > > C
> > > >
> > > > P.S : How do I get the posts to show up in this group as a
> continuous
> > > > thread and without the wait.... that would be really cool.
> > > >
> > > > >I think Guy is asking a good question. Won't 96k x 32 fit in
> internal
> > > > > memory for 6416? So why use SDRAM and EDMA?
> > > > >
> > > > > If the issue is that at some other time you need internal
> memory for
> > > > > another reason, then I would first try with L2 data cache
> enabled and no
> > > > > EDMA. The first time through your data loop you lose the speed
> > > > advantage
> > > > > of EDMA, but subsequent times your performance is just as
> good. And
> > > > more
> > > > > importantly, that mode forces you to make absolutely sure your
> data is
> > > > > organized in the most efficient manner, and you have "thought
> through"
> > > > > exactly the sequence that data moves and cache is used.
> > > > >
> > > > > Then, enable EDMA as your last step. The performance gain its
> going to
> > > > > give you in this situation is minimal, should you should get
> it working
> > > > > last.
> > > > >
> > > > > -Jeff
> > > >
> > > > --- In c... , "Jeff
> > > > Brower" wrote:
> > > > >
> > > > > Carl-
> > > > >
> > > > > > I have a real big problem with EDMA and cache coherency.
> > > > > > Board :6416 Spectrum digital
> > > > > > Here's what I am doing.
> > > > > > 1) Transfer data from SDRAM to ISRAM.
> > > > > > 2) Work with data in ISRAM
> > > > > > 3) Transfer back to SDRAM
> > > > > >
> > > > > > 4) Repeat process for next block in SDRAM to same block in ISRAM
> > > > > > .
> > > > > > .
> > > > > > 5) Finally use SDRAM data.
> > > > > >
> > > > > > Blocks are 128 byte aligned in ISRAM and SDRAM and processed
> in chunks
> > > > > > of multiples of 128 (actually (96k).
> > > > > > L2 cache is 128k and enabled.
> > > > > >
> > > > > > Now before transfer from SDRAM to ISRAM in step 1, I always
> > > > > > CACHE_wbInvL2 (SDRAM block , block size , CACHE_WAIT). I
> think that
> > > > > > should be enough for cache coherency because the docs say L1D is
> > > > > > handled by EDMA. Also ISRAM block is always cache coherent.
> Correct?
> > > > > >
> > > > > > But the data gets all screwed up....
> > > > > >
> > > > > > Logically, I think I am doing things right.
> > > > > >
> > > > > > Is there a way to check cache coherency without the debugger or
> > > > > > comparing memory via cpu and checking - that in itself pulls
> it into
> > > > > > L2/L1D cache?
> > > > > > It looks like EDMA is doing it's job but caching isn't.
> > > > > >
> > > > > > What may be the problem here.... Appreciate some ideas.
> > > > >
> > > > > I think Guy is asking a good question. Won't 96k x 32 fit in
> internal
> > > > > memory for 6416? So why use SDRAM and EDMA?
> > > > >
> > > > > If the issue is that at some other time you need internal
> memory for
> > > > > another reason, then I would first try with L2 data cache
> enabled and no
> > > > > EDMA. The first time through your data loop you lose the speed
> > > > advantage
> > > > > of EDMA, but subsequent times your performance is just as
> good. And
> > > > more
> > > > > importantly, that mode forces you to make absolutely sure your
> data is
> > > > > organized in the most efficient manner, and you have "thought
> through"
> > > > > exactly the sequence that data moves and cache is used.
> > > > >
> > > > > Then, enable EDMA as your last step. The performance gain its
> going to
> > > > > give you in this situation is minimal, should you should get
> it working
> > > > > last.
> > > > >
> > > > > -Jeff
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >