DSPRelated.com
Forums

read w/o cache on blackfin?

Started by Andre December 14, 2012
Hi all,

I have the following issue.

A ADI blackfin processor has SDRAM attached, and data cache enabled.
When I read from SDRAM, cache would normally increase performance.

However, I am accessing some data on a random base, only reading single 
or few bytes in a row. Speculative reads of the cache engine seem to eat 
performance in this case quite significantly.

I can of course declare memory regions to be not cacheable.
However, it would be more elegant (and portable) from my point of view 
to have a way to accesss SDRAM with no caching from code.

Any hints?

Best regards,

Andre
"Andre" <lodwig@pathme.de> wrote in message 
news:kaf453$mac$1@speranza.aioe.org...
> Hi all, > > I have the following issue. > > A ADI blackfin processor has SDRAM attached, and data cache enabled. > When I read from SDRAM, cache would normally increase performance.
Indeed. Non-cached RAM acess by core is very inefficient on BlackFin.
> However, I am accessing some data on a random base, only reading single or > few bytes in a row. Speculative reads of the cache engine seem to eat > performance in this case quite significantly.
1. Speculative read and cache line fetch are unrelated things. 2. It would be quite unusual if cache does not improve performance of a real DSP job (as opposed to artificially created problem to mess up with cache). What are you doing?
> I can of course declare memory regions to be not cacheable. > However, it would be more elegant (and portable) from my point of view to > have a way to accesss SDRAM with no caching from code. > > Any hints?
It is very unlucky and unlikely situation if the cache does not recoup itself. The best thing would be rearrange your algorithm and/or your data so the RAM would be accessed in more or less local or sequential manner. Vladimir Vassilevsky DSP and Mixed Signal Consultant www.abvolt.com
Vladimir Vassilevsky <nospam@nowhere.com> wrote:
 
> "Andre" <lodwig@pathme.de> wrote in message > news:kaf453$mac$1@speranza.aioe.org...
(snip)
>> A ADI blackfin processor has SDRAM attached, and data cache enabled. >> When I read from SDRAM, cache would normally increase performance.
> Indeed. Non-cached RAM acess by core is very inefficient on BlackFin.
>> However, I am accessing some data on a random base, only reading single or >> few bytes in a row. Speculative reads of the cache engine seem to eat >> performance in this case quite significantly.
> 1. Speculative read and cache line fetch are unrelated things. > 2. It would be quite unusual if cache does not improve performance of a real > DSP job (as opposed to artificially created problem to mess up with cache). > What are you doing?
(snip)
> It is very unlucky and unlikely situation if the cache does not recoup > itself. > The best thing would be rearrange your algorithm and/or your data so the RAM > would be accessed in more or less local or sequential manner.
I don't know about the OP, but I was once working with very large state machines, maybe hundreds of millions of states. Pretty much, each transition was to a line that wasn't in cache. (I was doing it on an IA32 machine, but the problem would be the same for any other cached memory system.) Now, the executable code would be a very small loop, which I hope would be in cache. -- glen
Dear all,

thanks a lot for the hints.
I have ended up declaring a segment to be not cached and can live with this.

The scenario is as follows:
I want to generate several sine signals at different frequencies from 
within a ISR at audio rate (48k).
I do this with a 32 bit status per signal, which I increment by a value 
that is calculated from the desired frequency and shifted 17 bits right 
to result in a 15 bit index into a 32k sine table.

Putting this table to non-cached SDRAM increased performance by a factor 
of about 5.

best regards,

Andre


On 14.12.2012 21:32, glen herrmannsfeldt wrote:
> Vladimir Vassilevsky <nospam@nowhere.com> wrote: > >> "Andre" <lodwig@pathme.de> wrote in message >> news:kaf453$mac$1@speranza.aioe.org... > > (snip) >>> A ADI blackfin processor has SDRAM attached, and data cache enabled. >>> When I read from SDRAM, cache would normally increase performance. > >> Indeed. Non-cached RAM acess by core is very inefficient on BlackFin. > >>> However, I am accessing some data on a random base, only reading single or >>> few bytes in a row. Speculative reads of the cache engine seem to eat >>> performance in this case quite significantly. > >> 1. Speculative read and cache line fetch are unrelated things. >> 2. It would be quite unusual if cache does not improve performance of a real >> DSP job (as opposed to artificially created problem to mess up with cache). >> What are you doing? > > (snip) > >> It is very unlucky and unlikely situation if the cache does not recoup >> itself. >> The best thing would be rearrange your algorithm and/or your data so the RAM >> would be accessed in more or less local or sequential manner. > > I don't know about the OP, but I was once working with very large > state machines, maybe hundreds of millions of states. Pretty much, > each transition was to a line that wasn't in cache. > > (I was doing it on an IA32 machine, but the problem would be the same > for any other cached memory system.) > > Now, the executable code would be a very small loop, which I hope would > be in cache. > > -- glen >
"Andre" <lodwig@pathme.de> wrote in message 
news:kav7g9$6dn$1@speranza.aioe.org...
> Dear all, > > thanks a lot for the hints. > I have ended up declaring a segment to be not cached and can live with > this. > > The scenario is as follows: > I want to generate several sine signals at different frequencies from > within a ISR at audio rate (48k).
This is very inefficient. Good approach would be generate bunch of samples into a buffer in batch mode, and then output this buffer from ISR or, even better, by DMA.
> I do this with a 32 bit status per signal, which I increment by a value > that is calculated from the desired frequency and shifted 17 bits right to > result in a 15 bit index into a 32k sine table.
With simplest linear interpolation, a 512-size table with phase wrapping into 1/4 of period would make for 16 bit accuracy.
> Putting this table to non-cached SDRAM increased performance by a factor > of about 5.
If you generate the signal in batch mode and implement more efficient sine, and enable cache, that would increase performance by a factor about 100. Vladimir Vassilevsky DSP and Mixed Signal Consultant www.abvolt.com
Vladimir Vassilevsky wrote:
> "Andre" <lodwig@pathme.de> wrote in message >>thanks a lot for the hints. >>I have ended up declaring a segment to be not cached and can live with >>this. >> >>The scenario is as follows: >>I want to generate several sine signals at different frequencies from >>within a ISR at audio rate (48k). > > This is very inefficient. > Good approach would be generate bunch of samples into a buffer in batch > mode, and then output this buffer from ISR or, even better, by DMA.
I don't know what Andre is doing in addition to generating sines, but buffering and DMA add latency, which may or may not be acceptable. Plus, the batch mode buffer gives the cache something to work with, but doesn't eliminate the bad cache performance of the sine table.
>>I do this with a 32 bit status per signal, which I increment by a value >>that is calculated from the desired frequency and shifted 17 bits right to >>result in a 15 bit index into a 32k sine table.
(That value is called "angular velocity", just for the record :-)
> With simplest linear interpolation, a 512-size table with phase wrapping > into 1/4 of period would make for 16 bit accuracy.
That would be the knob I'd try tuning, too. I believe my current signal generator uses somewhere in the magnitude of 128 samples. On the other hand, an uncached SDRAM on the Blackfin read takes around 5 SCLK = 20 CCLK, if I recall correctly. So your interpolator would have to fit into those 20 cycles. Definitely doable, but also easy to exceed, especially when you're tight on registers. Stefan
"Stefan Reuther" <stefan.news@arcor.de> wrote:
> Vladimir Vassilevsky wrote: >> "Andre" <lodwig@pathme.de> wrote:
>>>I do this with a 32 bit status per signal, which I increment by a value >>>that is calculated from the desired frequency and shifted 17 bits right >>>to >>>result in a 15 bit index into a 32k sine table. > > (That value is called "angular velocity", just for the record :-)
Huh?
>> With simplest linear interpolation, a 512-size table with phase wrapping >> into 1/4 of period would make for 16 bit accuracy. > > That would be the knob I'd try tuning, too. I believe my current signal > generator uses somewhere in the magnitude of 128 samples. > > On the other hand, an uncached SDRAM on the Blackfin read takes around 5 > SCLK = 20 CCLK, if I recall correctly. So your interpolator would have > to fit into those 20 cycles. Definitely doable, but also easy to exceed, > especially when you're tight on registers.
Non cached SDRAM read by core is ~10 SCLKs @133 MHz = ~ 45 CCLKs @ 600 MHz It is very inefficient on BlackFin. VLV
Vladimir Vassilevsky <nospam@nowhere.com> wrote:

(snip, someone wrote)

>> (That value is called "angular velocity", just for the record :-)
> Huh?
I didn't trace back to where that came from, but it is probably right. If theta is angular position, then d(theta)/dt is angular velocity, usually omega, and d(omega)/dt angular acceleration, alpha. The all the equation come out in similar form to the linear case, with no extraneous 2pi around. For some reason there is torque (usually tau) instead of angular force. Then angular momentum, and moment of inertia (in place of mass) is a tensor. Everything works out nicely if you use omega, angular velocity, instead of frequency. -- glen
Hi all,

I am running the backfin at "only" 240MHz, for power consumption and EMI 
reasons, so the ratio is not thaaat bad and will end up in the range of 
what Stefan estimated.

Thanks again,

Andre


On 20.12.2012 17:15, Vladimir Vassilevsky wrote:
> "Stefan Reuther" <stefan.news@arcor.de> wrote: >> Vladimir Vassilevsky wrote: >>> "Andre" <lodwig@pathme.de> wrote: > >>>> I do this with a 32 bit status per signal, which I increment by a value >>>> that is calculated from the desired frequency and shifted 17 bits right >>>> to >>>> result in a 15 bit index into a 32k sine table. >> >> (That value is called "angular velocity", just for the record :-) > > Huh? > >>> With simplest linear interpolation, a 512-size table with phase wrapping >>> into 1/4 of period would make for 16 bit accuracy. >> >> That would be the knob I'd try tuning, too. I believe my current signal >> generator uses somewhere in the magnitude of 128 samples. >> >> On the other hand, an uncached SDRAM on the Blackfin read takes around 5 >> SCLK = 20 CCLK, if I recall correctly. So your interpolator would have >> to fit into those 20 cycles. Definitely doable, but also easy to exceed, >> especially when you're tight on registers. > > Non cached SDRAM read by core is ~10 SCLKs @133 MHz = ~ 45 CCLKs @ 600 MHz > It is very inefficient on BlackFin. > > VLV > > >