comp.dsp | read w/o cache on blackfin?

Hi all,

I have the following issue.

A ADI blackfin processor has SDRAM attached, and data cache enabled.
When I read from SDRAM, cache would normally increase performance.

However, I am accessing some data on a random base, only reading single 
or few bytes in a row. Speculative reads of the cache engine seem to eat 
performance in this case quite significantly.

I can of course declare memory regions to be not cacheable.
However, it would be more elegant (and portable) from my point of view 
to have a way to accesss SDRAM with no caching from code.

Any hints?

Best regards,

Andre

Reply by Vladimir Vassilevsky ●December 14, 20122012-12-14

"Andre" <lodwig@pathme.de> wrote in message 
news:kaf453$mac$1@speranza.aioe.org...
> Hi all,
>
> I have the following issue.
>
> A ADI blackfin processor has SDRAM attached, and data cache enabled.
> When I read from SDRAM, cache would normally increase performance.

Indeed. Non-cached RAM acess by core is very inefficient on BlackFin.

> However, I am accessing some data on a random base, only reading single or 
> few bytes in a row. Speculative reads of the cache engine seem to eat 
> performance in this case quite significantly.

1. Speculative read and cache line fetch are unrelated things.
2. It would be quite unusual if cache does not improve performance of a real 
DSP job (as opposed to artificially created problem to mess up with cache). 
What are you doing?

> I can of course declare memory regions to be not cacheable.
> However, it would be more elegant (and portable) from my point of view to 
> have a way to accesss SDRAM with no caching from code.
>
> Any hints?

It is very unlucky and unlikely situation if the cache does not recoup 
itself.
The best thing would be rearrange your algorithm and/or your data so the RAM 
would be accessed in more or less local or sequential manner.

Vladimir Vassilevsky
DSP and Mixed Signal Consultant
www.abvolt.com

Reply by glen herrmannsfeldt ●December 14, 20122012-12-14

Vladimir Vassilevsky <nospam@nowhere.com> wrote:
 
> "Andre" <lodwig@pathme.de> wrote in message 
> news:kaf453$mac$1@speranza.aioe.org...

(snip)
>> A ADI blackfin processor has SDRAM attached, and data cache enabled.
>> When I read from SDRAM, cache would normally increase performance.
 
> Indeed. Non-cached RAM acess by core is very inefficient on BlackFin.
 
>> However, I am accessing some data on a random base, only reading single or 
>> few bytes in a row. Speculative reads of the cache engine seem to eat 
>> performance in this case quite significantly.
 
> 1. Speculative read and cache line fetch are unrelated things.
> 2. It would be quite unusual if cache does not improve performance of a real 
> DSP job (as opposed to artificially created problem to mess up with cache). 
> What are you doing?
 
(snip)

> It is very unlucky and unlikely situation if the cache does not recoup 
> itself.
> The best thing would be rearrange your algorithm and/or your data so the RAM 
> would be accessed in more or less local or sequential manner.

I don't know about the OP, but I was once working with very large
state machines, maybe hundreds of millions of states. Pretty much, 
each transition was to a line that wasn't in cache. 

(I was doing it on an IA32 machine, but the problem would be the same
for any other cached memory system.)

Now, the executable code would be a very small loop, which I hope would
be in cache.

-- glen

Reply by Andre ●December 20, 20122012-12-20

Dear all,

thanks a lot for the hints.
I have ended up declaring a segment to be not cached and can live with this.

The scenario is as follows:
I want to generate several sine signals at different frequencies from 
within a ISR at audio rate (48k).
I do this with a 32 bit status per signal, which I increment by a value 
that is calculated from the desired frequency and shifted 17 bits right 
to result in a 15 bit index into a 32k sine table.

Putting this table to non-cached SDRAM increased performance by a factor 
of about 5.

best regards,

Andre


On 14.12.2012 21:32, glen herrmannsfeldt wrote:
> Vladimir Vassilevsky <nospam@nowhere.com> wrote:
>
>> "Andre" <lodwig@pathme.de> wrote in message
>> news:kaf453$mac$1@speranza.aioe.org...
>
> (snip)
>>> A ADI blackfin processor has SDRAM attached, and data cache enabled.
>>> When I read from SDRAM, cache would normally increase performance.
>
>> Indeed. Non-cached RAM acess by core is very inefficient on BlackFin.
>
>>> However, I am accessing some data on a random base, only reading single or
>>> few bytes in a row. Speculative reads of the cache engine seem to eat
>>> performance in this case quite significantly.
>
>> 1. Speculative read and cache line fetch are unrelated things.
>> 2. It would be quite unusual if cache does not improve performance of a real
>> DSP job (as opposed to artificially created problem to mess up with cache).
>> What are you doing?
>
> (snip)
>
>> It is very unlucky and unlikely situation if the cache does not recoup
>> itself.
>> The best thing would be rearrange your algorithm and/or your data so the RAM
>> would be accessed in more or less local or sequential manner.
>
> I don't know about the OP, but I was once working with very large
> state machines, maybe hundreds of millions of states. Pretty much,
> each transition was to a line that wasn't in cache.
>
> (I was doing it on an IA32 machine, but the problem would be the same
> for any other cached memory system.)
>
> Now, the executable code would be a very small loop, which I hope would
> be in cache.
>
> -- glen
>

Reply by Vladimir Vassilevsky ●December 20, 20122012-12-20

"Andre" <lodwig@pathme.de> wrote in message 
news:kav7g9$6dn$1@speranza.aioe.org...
> Dear all,
>
> thanks a lot for the hints.
> I have ended up declaring a segment to be not cached and can live with 
> this.
>
> The scenario is as follows:
> I want to generate several sine signals at different frequencies from 
> within a ISR at audio rate (48k).

This is very inefficient.
Good approach would be generate bunch of samples into a buffer in batch 
mode, and then output this buffer from ISR or, even better, by DMA.

> I do this with a 32 bit status per signal, which I increment by a value 
> that is calculated from the desired frequency and shifted 17 bits right to 
> result in a 15 bit index into a 32k sine table.

With simplest linear interpolation, a 512-size table with phase wrapping 
into 1/4 of  period would make for 16 bit accuracy.

> Putting this table to non-cached SDRAM increased performance by a factor 
> of about 5.

If you generate the signal in batch mode and implement more efficient sine, 
and enable cache, that would increase performance by a factor about 100.

Vladimir Vassilevsky
DSP and Mixed Signal Consultant
www.abvolt.com

Reply by Stefan Reuther ●December 20, 20122012-12-20

Vladimir Vassilevsky wrote:
> "Andre" <lodwig@pathme.de> wrote in message 
>>thanks a lot for the hints.
>>I have ended up declaring a segment to be not cached and can live with 
>>this.
>>
>>The scenario is as follows:
>>I want to generate several sine signals at different frequencies from 
>>within a ISR at audio rate (48k).
> 
> This is very inefficient.
> Good approach would be generate bunch of samples into a buffer in batch 
> mode, and then output this buffer from ISR or, even better, by DMA.

I don't know what Andre is doing in addition to generating sines, but
buffering and DMA add latency, which may or may not be acceptable.

Plus, the batch mode buffer gives the cache something to work with, but
doesn't eliminate the bad cache performance of the sine table.

>>I do this with a 32 bit status per signal, which I increment by a value 
>>that is calculated from the desired frequency and shifted 17 bits right to 
>>result in a 15 bit index into a 32k sine table.

(That value is called "angular velocity", just for the record :-)

> With simplest linear interpolation, a 512-size table with phase wrapping 
> into 1/4 of  period would make for 16 bit accuracy.

That would be the knob I'd try tuning, too. I believe my current signal
generator uses somewhere in the magnitude of 128 samples.

On the other hand, an uncached SDRAM on the Blackfin read takes around 5
SCLK = 20 CCLK, if I recall correctly. So your interpolator would have
to fit into those 20 cycles. Definitely doable, but also easy to exceed,
especially when you're tight on registers.

  Stefan

Reply by Vladimir Vassilevsky ●December 20, 20122012-12-20

"Stefan Reuther" <stefan.news@arcor.de> wrote:
> Vladimir Vassilevsky wrote:
>> "Andre" <lodwig@pathme.de> wrote:

>>>I do this with a 32 bit status per signal, which I increment by a value
>>>that is calculated from the desired frequency and shifted 17 bits right 
>>>to
>>>result in a 15 bit index into a 32k sine table.
>
> (That value is called "angular velocity", just for the record :-)

Huh?

>> With simplest linear interpolation, a 512-size table with phase wrapping
>> into 1/4 of  period would make for 16 bit accuracy.
>
> That would be the knob I'd try tuning, too. I believe my current signal
> generator uses somewhere in the magnitude of 128 samples.
>
> On the other hand, an uncached SDRAM on the Blackfin read takes around 5
> SCLK = 20 CCLK, if I recall correctly. So your interpolator would have
> to fit into those 20 cycles. Definitely doable, but also easy to exceed,
> especially when you're tight on registers.

Non cached  SDRAM read by core is ~10 SCLKs @133 MHz = ~ 45 CCLKs @ 600 MHz
It is very inefficient on BlackFin.

VLV

Reply by glen herrmannsfeldt ●December 20, 20122012-12-20

Vladimir Vassilevsky <nospam@nowhere.com> wrote:

(snip, someone wrote)

>> (That value is called "angular velocity", just for the record :-)

> Huh?

I didn't trace back to where that came from, but it is probably right.

If theta is angular position, then d(theta)/dt is angular velocity,
usually omega, and d(omega)/dt angular acceleration, alpha.

The all the equation come out in similar form to the linear case,
with no extraneous 2pi around. For some reason there is torque
(usually tau) instead of angular force. Then angular momentum,
and moment of inertia (in place of mass) is a tensor.

Everything works out nicely if you use omega, angular velocity,
instead of frequency.

-- glen

Reply by Andre ●December 21, 20122012-12-21

Hi all,

I am running the backfin at "only" 240MHz, for power consumption and EMI 
reasons, so the ratio is not thaaat bad and will end up in the range of 
what Stefan estimated.

Thanks again,

Andre


On 20.12.2012 17:15, Vladimir Vassilevsky wrote:
> "Stefan Reuther" <stefan.news@arcor.de> wrote:
>> Vladimir Vassilevsky wrote:
>>> "Andre" <lodwig@pathme.de> wrote:
>
>>>> I do this with a 32 bit status per signal, which I increment by a value
>>>> that is calculated from the desired frequency and shifted 17 bits right
>>>> to
>>>> result in a 15 bit index into a 32k sine table.
>>
>> (That value is called "angular velocity", just for the record :-)
>
> Huh?
>
>>> With simplest linear interpolation, a 512-size table with phase wrapping
>>> into 1/4 of  period would make for 16 bit accuracy.
>>
>> That would be the knob I'd try tuning, too. I believe my current signal
>> generator uses somewhere in the magnitude of 128 samples.
>>
>> On the other hand, an uncached SDRAM on the Blackfin read takes around 5
>> SCLK = 20 CCLK, if I recall correctly. So your interpolator would have
>> to fit into those 20 cycles. Definitely doable, but also easy to exceed,
>> especially when you're tight on registers.
>
> Non cached  SDRAM read by core is ~10 SCLKs @133 MHz = ~ 45 CCLKs @ 600 MHz
> It is very inefficient on BlackFin.
>
> VLV
>
>
>

read w/o cache on blackfin?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group