Forums

Loop optimization in C6416T Processor

Started by khizra 5 months ago6 replieslatest reply 5 months ago68 views

I am performing cross correlation for frame synchronization. My sample of code is given below:

#include <csl.h>
#include <complex.h>
#include <math.h>
#include <mathf.h>
#include <float.h>
#include <string.h>
#include "stdbool.h"
#include <c6x.h>

#include "fastrts62x64x.h

*Loop for correlation computation.*/


for (n1=0; n1<1400; n1++)
  {
       fine_sync = 0.0+0.0*I;
  for (m1=0; m1<200; m1++)
   {
         fine_sync += mpysp(input[n1+m1],conjf(CMP_MOD_knownsym[m1]));
   }

 fine_sync1_vector_abs[n1]=cabsf(fine_sync);
}

My this section of code is taking about 54,000,000 cycles (54 ms) ,measured from CCS profiling tool . That cannot be afforded for any real time application. I am running this code in C6416 Chip with 1 Ghz clock. 

I need help regarding optimization of this loop. The outer loop is running for 1400 times whereas, inner loop is running for 200 times. 

I know about loop unrolling but that doesn't work so much. I want to know some other method through which I can reduce execution cycles of this loop.  

I am running my code from IRAM and all compiler optimizations are disabled. 

[ - ]
Reply by rrlagicApril 13, 2021

Hi!

As it was suggested on e2e, compiler optimization is the most painless and effortless way to improve code performance, it definitely should be used. 

However, floating point math on fixed point processor would disable key optimization techniques. Thus, any substantial optimization would require efforts.

[ - ]
Reply by rbjApril 13, 2021

Well you're doing a mpysp() operation (and a complex addition, which doesn't cost much) 280,000 times. Why does one run cost 192 instruction cycles?

And is this a sliding window or do you want to scale either n1 or m1 inside of input[]?  Maybe input[200*n1+m1]?  ooops, I get it, n1 is the lag in the cross-correlation.



[ - ]
Reply by rrlagicApril 14, 2021

Well, mpysp is not operation, but function call. As to complex addition, just take a look at ccs\tools\compiler\c6000_7.4.24\lib\src\xxxcaddcc.h:

/* xxxcaddcc.h -- common _[FL]Caddcc functionality */
#include <complex.h>
#include "xmath.h"
_STD_BEGIN

FCTYPE (FNAME(Caddcc))(FCTYPE x, FCTYPE y)
    {    /* find complex sum */
    FTYPE xre = FFUN(creal)(x);
    FTYPE xim = FFUN(cimag)(x);
    FTYPE yre = FFUN(creal)(y);
    FTYPE yim = FFUN(cimag)(y);

    return (FNAME(Cbuild)(xre + xim, yre + yim));
    }
_STD_END

/*
 * Copyright (c) 1992-2004 by P.J. Plauger.  ALL RIGHTS RESERVED.
 * Consult your license regarding permissions and restrictions.
V4.02:1476 */

I bet that's not about four loads, two additions, two stores.

[ - ]
Reply by dgshaw6April 13, 2021

We don't know the format of your data, but it seems that the name "knownsym" implies data communications, so I would guess that fixed point math should be the most logical choice, because symbols usually have a known dynamic range.
Maybe even in the +- 1 to +- 7 type numbers, which should leave lots of room for the accumulations.
I don't know the C6416T processor, so I may be way off base here.
In theory, the complex multiply in the middle shouldn't take more than 4-6 clock ticks without dynamic range checking, and maybe a couple more with range checking and scaling, but that is a lot less than 192 per inner loop iteration right now.


[ - ]
Reply by dudelsoundApril 14, 2021

depending on the signals you are correlating I have an algrithmic suggestion:

Make a coarse search followed by a finer search -> run the entire correlation at a much lower sampletrate.

Afterwards:

perform the high resolution correlation only where the downsampled correlation has high output

[ - ]
Reply by dudelsoundApril 14, 2021

Ahh - and obviously: Correlation in the time domain is conj(multiplication) in the frequency domain, so bringing both signals to the same length, ffting, multiplying A * conj(B), iffting is probably a lot faster