TMS320DM642 8bit QDMA transfers and subsampling

Started by Mark Robinson April 6, 2005
Hi all.

I'm using a DM642 to capture a PAL frame from a video port as 8 bit
YCbCr. I want to subsample this frame by half, but I can't use the
video port scaler because I need the full frame too. I tried this
using 8 bit QDMA transfers set up as indexed source/incremented dest.
The problem is that this is the probably the most inefficient use of
the EDMA engine possible, and the whole thing grinds to a halt
(bearing in mind that the EDMA is already loaded by servicing the
video ports).

I wondered if it might be better to do 32 bit QDMA transfers into
internal memory, do the subsampling, and transfer back to external
heap memory. Is this likely to be better, or has anyone any other
suggestions?

Cheers

mark-r

-- 
"Let's meet the panel. You couldn't ask for four finer comedians -
so that answers your next question..."
 -- Humphrey Lyttleton
>
>Hi all. > >I'm using a DM642 to capture a PAL frame from a video port as 8 bit >YCbCr. I want to subsample this frame by half, but I can't use the >video port scaler because I need the full frame too. I tried this >using 8 bit QDMA transfers set up as indexed source/incremented dest. >The problem is that this is the probably the most inefficient use of >the EDMA engine possible, and the whole thing grinds to a halt >(bearing in mind that the EDMA is already loaded by servicing the >video ports). > >I wondered if it might be better to do 32 bit QDMA transfers into >internal memory, do the subsampling, and transfer back to external >heap memory. Is this likely to be better, or has anyone any other >suggestions? > >Cheers > >mark-r > >-- >"Let's meet the panel. You couldn't ask for four finer comedians - >so that answers your next question..." > -- Humphrey Lyttleton >
I am assuming that you are trying to subsample the frame in horizontal direction, i.e. 640x480 comes out as 320x480. You are right that you can do this by the EDMA engine alone but it will be extremely inefficient to a point that it is infeasible. The reason is that the EDMA (and the QDMA by the way) is optimized for 32-bit transfers and for contigous data streams. That is if you put gaps in between data elements to be transferred, the EDMA engine gets less slower (actually submits a transfer request for each data). Your best bet is to bring large chunks of image into internal memory, do a local subsamling in the register file and transfer the subsamled image back out the external memory. You can do this in a double buffering scheme and using two QDMA's with one of them always in flight (i.e. DMA transfer is interlaced with the CPU operations): hinCurrent = DAT_copy(currentSlice); while ( there are more slices ) { if ( more slices needed ) { hinNext = DAT_copy(nextSlice); } DAT_wait(hinCurrent); slice = subsample(currentSlice); DAT_wait(houtCurrent) ; houtCurrent = QDMA(slice); //rotate handles } (I omitted mots of the parameters) This will give you much better efficieny. This message was sent using the Comp.DSP web interface on www.DSPRelated.com
 > [DAT_copy() function]

be careful with the caching ...


I tried to change my processing routine to fetch some data with EDMA and 
did not touch the writing back (one step at a time) ... ehh - gave 
strange results ...

sometimes you manually need to _wb-invalidate some cache_

after figuring out the problems I finally found the right document:
For my DSP its the document spru610: "TMS320C64x DSP Two-Level Internal 
Memory Reference Guide"


bye,
Michael
PS: the original posting did not reach me :-/
[ Thanks to mbelge for a good suggestion ]

Michael Schoeberl wrote:
> > be careful with the caching ...
If you do a DMA from external to internal memory, or vice versa, the cache is not touched at all. So, if any part of your external memory is cached, the cache copy may become corrupt. I guess the way to avoid it is to avoid cacheing your buffer, by not accessing it with the CPU at all (you also need to align buffers on cache line boundaries).
> PS: the original posting did not reach me :-/
I sent it over a month ago, so I guess it will have expired from some places! Cheers mark-r -- "Let's meet the panel. You couldn't ask for four finer comedians - so that answers your next question..." -- Humphrey Lyttleton
> I guess the way to avoid
> it is to avoid cacheing your buffer, by not accessing it with the CPU > at all (you also need to align buffers on cache line boundaries).
thats the easy solution - but this is not necessary ... I'll just describe what I'm doing on my C6416 (I'm sure someone else out there is looking for this like I was ;-) My data is in external SDRAM (and was previously used and might still be in L2) and I want to put it to ISRAM for fast processing. - call CACHE_wbinvL2 on you data, this writes back L1d and L2 and invalidates both (!) for the data ... - call CACHE_wbinvL1 on you destination ... - call DAT_COPY to transfer the data the function CACHE_wbinvL2 did not work for invalidating huge data-arrays (600kByte) at a time but it's working in small chunks ... I guess the problem is the limited register that passes the size to the DMA controller - there is a limit of 256k bytes ... (the API ref guide spru401f does not mention this - it's just in the spru610 document) bye, Michael
Michael Schoeberl wrote:
> > > it is to avoid cacheing your buffer, by not accessing it with the CPU
> thats the easy solution - but this is not necessary ... I'll just
It is necessary within an XDAIS algorithm, since you're not allowed to fiddle with the cache. One mistake I have made is to use memcpy (because I coudn't be bothered to implement IDMA2) on a buffer that a previous algorithm in the channel had DMAed. Disasterous!
> describe what I'm doing on my C6416
[snip] All filed away in my "things are are bound to come in useful" folder, thanks. Cheers mark-r -- "Let's meet the panel. You couldn't ask for four finer comedians - so that answers your next question..." -- Humphrey Lyttleton
Michael Schoeberl wrote:
> > > it is to avoid cacheing your buffer, by not accessing it with the CPU
> thats the easy solution - but this is not necessary ... I'll just
It is necessary within an XDAIS algorithm, since you're not allowed to fiddle with the cache. One mistake I have made is to use memcpy (because I coudn't be bothered to implement IDMA2) on a buffer that a previous algorithm in the channel had DMAed. Disasterous!
> describe what I'm doing on my C6416
[snip] All filed away in my "things are are bound to come in useful" folder, thanks. Cheers mark-r -- "Let's meet the panel. You couldn't ask for four finer comedians - so that answers your next question..." -- Humphrey Lyttleton
> I guess the way to avoid
> it is to avoid cacheing your buffer, by not accessing it with the CPU > at all (you also need to align buffers on cache line boundaries).
thats the easy solution - but this is not necessary ... I'll just describe what I'm doing on my C6416 (I'm sure someone else out there is looking for this like I was ;-) My data is in external SDRAM (and was previously used and might still be in L2) and I want to put it to ISRAM for fast processing. - call CACHE_wbinvL2 on you data, this writes back L1d and L2 and invalidates both (!) for the data ... - call CACHE_wbinvL1 on you destination ... - call DAT_COPY to transfer the data the function CACHE_wbinvL2 did not work for invalidating huge data-arrays (600kByte) at a time but it's working in small chunks ... I guess the problem is the limited register that passes the size to the DMA controller - there is a limit of 256k bytes ... (the API ref guide spru401f does not mention this - it's just in the spru610 document) bye, Michael
[ Thanks to mbelge for a good suggestion ]

Michael Schoeberl wrote:
> > be careful with the caching ...
If you do a DMA from external to internal memory, or vice versa, the cache is not touched at all. So, if any part of your external memory is cached, the cache copy may become corrupt. I guess the way to avoid it is to avoid cacheing your buffer, by not accessing it with the CPU at all (you also need to align buffers on cache line boundaries).
> PS: the original posting did not reach me :-/
I sent it over a month ago, so I guess it will have expired from some places! Cheers mark-r -- "Let's meet the panel. You couldn't ask for four finer comedians - so that answers your next question..." -- Humphrey Lyttleton
 > [DAT_copy() function]

be careful with the caching ...


I tried to change my processing routine to fetch some data with EDMA and 
did not touch the writing back (one step at a time) ... ehh - gave 
strange results ...

sometimes you manually need to _wb-invalidate some cache_

after figuring out the problems I finally found the right document:
For my DSP its the document spru610: "TMS320C64x DSP Two-Level Internal 
Memory Reference Guide"


bye,
Michael
PS: the original posting did not reach me :-/
>
>Hi all. > >I'm using a DM642 to capture a PAL frame from a video port as 8 bit >YCbCr. I want to subsample this frame by half, but I can't use the >video port scaler because I need the full frame too. I tried this >using 8 bit QDMA transfers set up as indexed source/incremented dest. >The problem is that this is the probably the most inefficient use of >the EDMA engine possible, and the whole thing grinds to a halt >(bearing in mind that the EDMA is already loaded by servicing the >video ports). > >I wondered if it might be better to do 32 bit QDMA transfers into >internal memory, do the subsampling, and transfer back to external >heap memory. Is this likely to be better, or has anyone any other >suggestions? > >Cheers > >mark-r > >-- >"Let's meet the panel. You couldn't ask for four finer comedians - >so that answers your next question..." > -- Humphrey Lyttleton >
I am assuming that you are trying to subsample the frame in horizontal direction, i.e. 640x480 comes out as 320x480. You are right that you can do this by the EDMA engine alone but it will be extremely inefficient to a point that it is infeasible. The reason is that the EDMA (and the QDMA by the way) is optimized for 32-bit transfers and for contigous data streams. That is if you put gaps in between data elements to be transferred, the EDMA engine gets less slower (actually submits a transfer request for each data). Your best bet is to bring large chunks of image into internal memory, do a local subsamling in the register file and transfer the subsamled image back out the external memory. You can do this in a double buffering scheme and using two QDMA's with one of them always in flight (i.e. DMA transfer is interlaced with the CPU operations): hinCurrent = DAT_copy(currentSlice); while ( there are more slices ) { if ( more slices needed ) { hinNext = DAT_copy(nextSlice); } DAT_wait(hinCurrent); slice = subsample(currentSlice); DAT_wait(houtCurrent) ; houtCurrent = QDMA(slice); //rotate handles } (I omitted mots of the parameters) This will give you much better efficieny. This message was sent using the Comp.DSP web interface on www.DSPRelated.com
Hi all.

I'm using a DM642 to capture a PAL frame from a video port as 8 bit
YCbCr. I want to subsample this frame by half, but I can't use the
video port scaler because I need the full frame too. I tried this
using 8 bit QDMA transfers set up as indexed source/incremented dest.
The problem is that this is the probably the most inefficient use of
the EDMA engine possible, and the whole thing grinds to a halt
(bearing in mind that the EDMA is already loaded by servicing the
video ports).

I wondered if it might be better to do 32 bit QDMA transfers into
internal memory, do the subsampling, and transfer back to external
heap memory. Is this likely to be better, or has anyone any other
suggestions?

Cheers

mark-r

-- 
"Let's meet the panel. You couldn't ask for four finer comedians -
so that answers your next question..."
 -- Humphrey Lyttleton