DSPRelated.com
Forums

Parallel Internal / External Memory Access

Started by andor_bariska May 6, 2004
I'm working with a dual 65L setup with shared SDRAM. Only one of the
processors is accessing the SDRAM.

I find that parallel internal / external memory reads results in
incorrect data being read from the external memroy.

Here is an example. This loop is for windowing FFT data. It reads the
data from the input buffer (i5, in external SDRAM) and a windowing
coefficient (i15, internal memory), multiplies the two values
together and stores the result in alternating order for real (i0) and
imaginary (i8) parts in internal memory:

/*** START INCLUDED CODE: ***/

f4=dm(i5,m6), f0=pm(i15,m14);

lcntr = FFT_LENGTH/2, do (pc,3) until lce;
f2* f0, f4=dm(i5,m6), f0=pm(i15,m14);
f1* f0, f4=dm(i5,m6), f0=pm(i15,m14);
dm(i0,m6) pm(i8,m14)

/*** END INCLUDED CODE ***/

This works in the simulator, but on the target hardware the FFT shows
erratic behaviour (the FFT is displayed on a LCD).

If the windowing loop uses a separate access to the external memory,
the FFT on the target hardware looks as it should. Here is the code:

/*** START INCLUDED CODE: ***/

f0=pm(i15,m14);
f4=dm(i5,m6);

lcntr = FFT_LENGTH / 2, do (pc,5) until lce;
f2* f0, f0=pm(i15,m14);
f4=dm(i5,m6);
f1* f0, f0=pm(i15,m14);
f4=dm(i5,m6);
dm(i0,m6) pm(i8,m14)

/*** END INCLUDED CODE ***/

In the simulator, these two loops are equivalent (apart from cycle
count). I therefore feel that this is an issue concerning the
hardware. Possibly it is an anomly of the 65L.

Can anybody confirm / counter?

Regards,

Andor Bariska

WEISS ENGINEERING LTD. - Professional Digital Audio Products
Florastrasse 42 8610 Uster Switzerland
phone: +41 1 940 20 06, fax: +41 1 940 22 14
mailto: web: <http://www.weiss.ch/>
Maillist: http:/groups.yahoo.com/group/weiss-audio




Hi,

Has anyone used circindex function for loop
optimization?

Manual says the following:
The circindex function is used within a loop in order
to implement a circular buffer operation in C/C++.
When optimization is enabled, the operation will be
implemented using the appropriate hardware features (B

registers and L registers) of the SHARC DSP
architecture.

I have used circindex function and simulated on
vdsp3.0
but the loop takes takes 0.18 msec more than the loop
without the circindex function.

Any idea where these circindex and circptr fn is used?
What are other ways for loop optimization?

Bye
Liyju

__________________________________



Liyju Janardhan wrote:
...
> I have used circindex function and simulated on
> vdsp3.0
> but the loop takes takes 0.18 msec more than the loop
> without the circindex function.
>
> Any idea where these circindex and circptr fn is used?
> What are other ways for loop optimization?

Liyju, perhaps you should post your loop - this is the best way we
can describe how to optimize it.

Regards,
Andor




Following is the loop which I want to optimize.

for (i=0;i<(num/2);i++)
{
o = r_out[num-i] + r_out[i];
x[i] = o*o;

o=i_out[num-i] + i_out[i];
y[i] = o*o;
}

r_out, i_out, x and are in data memory. o is a local
variable hence stored in stack. Putting o in program
memory may increase the speed.

There are some loop optimization pragmas, how
are they used?
What does it mean by vectorizing loop?

regards,

Liyju

--- andor_bariska <> wrote:
> Liyju Janardhan wrote:
> ...
> > I have used circindex function and simulated on
> > vdsp3.0
> > but the loop takes takes 0.18 msec more than the
> loop
> > without the circindex function.
> >
> > Any idea where these circindex and circptr fn is
> used?
> > What are other ways for loop optimization?
>
> Liyju, perhaps you should post your loop - this is
> the best way we
> can describe how to optimize it.
>
> Regards,
> Andor >
>
> _____________________________________
> Note: If you do a simple "reply" with your email
> client, only the author of this message will receive
> your answer. You need to do a "reply all" if you
> want your answer to be distributed to the entire
> group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to > To Post: Send an email to
>
> To Leave: Send an email to > Archives: http://groups.yahoo.com/group/adsp
>
> Other Groups: http://www.dsprelated.com/groups.php3
>
> Yahoo! Groups Links >


__________________________________



On Sun, 9 May 2004, Liyju Janardhan wrote:

>
> Following is the loop which I want to optimize.
>
> for (i=0;i<(num/2);i++)
> {
> o = r_out[num-i] + r_out[i];
> x[i] = o*o;
>
> o=i_out[num-i] + i_out[i];
> y[i] = o*o;
> }
>
> r_out, i_out, x and are in data memory. o is a local
> variable hence stored in stack. Putting o in program
> memory may increase the speed.

o is a temp, let the compiler leave it as a register for speed.

> There are some loop optimization pragmas, how
> are they used?
> What does it mean by vectorizing loop?

The modern SHARC's have 2 ALU's. A vector loop uses the same operation
on 2 different sets of data (Single instruction, Multiple data == SIMD)
In one alu you perform the x calculation and in the other alu you perform
the y calculation.

Getting a compiler to see this kind of optimization is really hard.
Usually you must do it by hand. Fortunatly it's pretty easy for this
problem. You first have to set up the alu's so all the pointers are
correct, then let 'em rip (as the beyblade kids say).

Patience, persistence, truth,
Dr. mike



Mike Rosing wrote:
...
> The modern SHARC's have 2 ALU's.

That's a good point - what processor is this loop to be coded on?


Thanks for giving the direction.

I can't vectorize the loop as I using SISD processor
(21060).

Anyways, I have implemented the same loop in asm
function and calling it from C program.

I am saving 10,000+ cycle by this... thats what I
wanted to do.

Thanks again,
regards
Liyju
--- Mike Rosing <> wrote:
> On Sun, 9 May 2004, Liyju Janardhan wrote:
>
> >
> > Following is the loop which I want to optimize.
> >
> > for (i=0;i<(num/2);i++)
> > {
> > o = r_out[num-i] + r_out[i];
> > x[i] = o*o;
> >
> > o=i_out[num-i] + i_out[i];
> > y[i] = o*o;
> > }
> >
> > r_out, i_out, x and are in data memory. o is a
> local
> > variable hence stored in stack. Putting o in
> program
> > memory may increase the speed.
>
> o is a temp, let the compiler leave it as a register
> for speed.
>
> > There are some loop optimization pragmas, how
> > are they used?
> > What does it mean by vectorizing loop?
>
> The modern SHARC's have 2 ALU's. A vector loop uses
> the same operation
> on 2 different sets of data (Single instruction,
> Multiple data == SIMD)
> In one alu you perform the x calculation and in the
> other alu you perform
> the y calculation.
>
> Getting a compiler to see this kind of optimization
> is really hard.
> Usually you must do it by hand. Fortunatly it's
> pretty easy for this
> problem. You first have to set up the alu's so all
> the pointers are
> correct, then let 'em rip (as the beyblade kids
> say).
>
> Patience, persistence, truth,
> Dr. mike
>


__________________________________