I'm working with a dual 65L setup with shared SDRAM. Only one of the processors is accessing the SDRAM. I find that parallel internal / external memory reads results in incorrect data being read from the external memroy. Here is an example. This loop is for windowing FFT data. It reads the data from the input buffer (i5, in external SDRAM) and a windowing coefficient (i15, internal memory), multiplies the two values together and stores the result in alternating order for real (i0) and imaginary (i8) parts in internal memory: /*** START INCLUDED CODE: ***/ f4=dm(i5,m6), f0=pm(i15,m14); lcntr = FFT_LENGTH/2, do (pc,3) until lce; f2* f0, f4=dm(i5,m6), f0=pm(i15,m14); f1* f0, f4=dm(i5,m6), f0=pm(i15,m14); dm(i0,m6) pm(i8,m14) /*** END INCLUDED CODE ***/ This works in the simulator, but on the target hardware the FFT shows erratic behaviour (the FFT is displayed on a LCD). If the windowing loop uses a separate access to the external memory, the FFT on the target hardware looks as it should. Here is the code: /*** START INCLUDED CODE: ***/ f0=pm(i15,m14); f4=dm(i5,m6); lcntr = FFT_LENGTH / 2, do (pc,5) until lce; f2* f0, f0=pm(i15,m14); f4=dm(i5,m6); f1* f0, f0=pm(i15,m14); f4=dm(i5,m6); dm(i0,m6) pm(i8,m14) /*** END INCLUDED CODE ***/ In the simulator, these two loops are equivalent (apart from cycle count). I therefore feel that this is an issue concerning the hardware. Possibly it is an anomly of the 65L. Can anybody confirm / counter? Regards, Andor Bariska WEISS ENGINEERING LTD. - Professional Digital Audio Products Florastrasse 42 8610 Uster Switzerland phone: +41 1 940 20 06, fax: +41 1 940 22 14 mailto: web: <http://www.weiss.ch/> Maillist: http:/groups.yahoo.com/group/weiss-audio |
|
Parallel Internal / External Memory Access
Started by ●May 6, 2004
Reply by ●May 7, 20042004-05-07
Hi, Has anyone used circindex function for loop optimization? Manual says the following: The circindex function is used within a loop in order to implement a circular buffer operation in C/C++. When optimization is enabled, the operation will be implemented using the appropriate hardware features (B registers and L registers) of the SHARC DSP architecture. I have used circindex function and simulated on vdsp3.0 but the loop takes takes 0.18 msec more than the loop without the circindex function. Any idea where these circindex and circptr fn is used? What are other ways for loop optimization? Bye Liyju __________________________________ |
|
Reply by ●May 8, 20042004-05-08
Liyju Janardhan wrote: ... > I have used circindex function and simulated on > vdsp3.0 > but the loop takes takes 0.18 msec more than the loop > without the circindex function. > > Any idea where these circindex and circptr fn is used? > What are other ways for loop optimization? Liyju, perhaps you should post your loop - this is the best way we can describe how to optimize it. Regards, Andor |
|
Reply by ●May 10, 20042004-05-10
Following is the loop which I want to optimize. for (i=0;i<(num/2);i++) { o = r_out[num-i] + r_out[i]; x[i] = o*o; o=i_out[num-i] + i_out[i]; y[i] = o*o; } r_out, i_out, x and are in data memory. o is a local variable hence stored in stack. Putting o in program memory may increase the speed. There are some loop optimization pragmas, how are they used? What does it mean by vectorizing loop? regards, Liyju --- andor_bariska <> wrote: > Liyju Janardhan wrote: > ... > > I have used circindex function and simulated on > > vdsp3.0 > > but the loop takes takes 0.18 msec more than the > loop > > without the circindex function. > > > > Any idea where these circindex and circptr fn is > used? > > What are other ways for loop optimization? > > Liyju, perhaps you should post your loop - this is > the best way we > can describe how to optimize it. > > Regards, > Andor > > > _____________________________________ > Note: If you do a simple "reply" with your email > client, only the author of this message will receive > your answer. You need to do a "reply all" if you > want your answer to be distributed to the entire > group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > To Post: Send an email to > > To Leave: Send an email to > Archives: http://groups.yahoo.com/group/adsp > > Other Groups: http://www.dsprelated.com/groups.php3 > > Yahoo! Groups Links > __________________________________ |
|
Reply by ●May 10, 20042004-05-10
On Sun, 9 May 2004, Liyju Janardhan wrote: > > Following is the loop which I want to optimize. > > for (i=0;i<(num/2);i++) > { > o = r_out[num-i] + r_out[i]; > x[i] = o*o; > > o=i_out[num-i] + i_out[i]; > y[i] = o*o; > } > > r_out, i_out, x and are in data memory. o is a local > variable hence stored in stack. Putting o in program > memory may increase the speed. o is a temp, let the compiler leave it as a register for speed. > There are some loop optimization pragmas, how > are they used? > What does it mean by vectorizing loop? The modern SHARC's have 2 ALU's. A vector loop uses the same operation on 2 different sets of data (Single instruction, Multiple data == SIMD) In one alu you perform the x calculation and in the other alu you perform the y calculation. Getting a compiler to see this kind of optimization is really hard. Usually you must do it by hand. Fortunatly it's pretty easy for this problem. You first have to set up the alu's so all the pointers are correct, then let 'em rip (as the beyblade kids say). Patience, persistence, truth, Dr. mike |
|
Reply by ●May 10, 20042004-05-10
Mike Rosing wrote: ... > The modern SHARC's have 2 ALU's. That's a good point - what processor is this loop to be coded on? |
Reply by ●May 15, 20042004-05-15
Thanks for giving the direction. I can't vectorize the loop as I using SISD processor (21060). Anyways, I have implemented the same loop in asm function and calling it from C program. I am saving 10,000+ cycle by this... thats what I wanted to do. Thanks again, regards Liyju --- Mike Rosing <> wrote: > On Sun, 9 May 2004, Liyju Janardhan wrote: > > > > > Following is the loop which I want to optimize. > > > > for (i=0;i<(num/2);i++) > > { > > o = r_out[num-i] + r_out[i]; > > x[i] = o*o; > > > > o=i_out[num-i] + i_out[i]; > > y[i] = o*o; > > } > > > > r_out, i_out, x and are in data memory. o is a > local > > variable hence stored in stack. Putting o in > program > > memory may increase the speed. > > o is a temp, let the compiler leave it as a register > for speed. > > > There are some loop optimization pragmas, how > > are they used? > > What does it mean by vectorizing loop? > > The modern SHARC's have 2 ALU's. A vector loop uses > the same operation > on 2 different sets of data (Single instruction, > Multiple data == SIMD) > In one alu you perform the x calculation and in the > other alu you perform > the y calculation. > > Getting a compiler to see this kind of optimization > is really hard. > Usually you must do it by hand. Fortunatly it's > pretty easy for this > problem. You first have to set up the alu's so all > the pointers are > correct, then let 'em rip (as the beyblade kids > say). > > Patience, persistence, truth, > Dr. mike > __________________________________ |