DSPRelated.com
Forums

how to optimize c code of Cordic algorithm

Started by praveen December 11, 2003
Hello,
I have implemented cordic for finding the atan in adsp 2191. But it
takes 3714 cycles for its execution. I have implemented it in c. Can
something tell me how can i optimize the code for that it takes less
than 500 cycles.
my code is
LUT is the lookup table
x and y are the two input whsoe atan to be determined

for(i=0;i<=25;i++)
	{
		x1=x;
		if (y>0)
		{
			x=x+(y>>i);
			y=y-(x1>>i);
			ang=ang+LUT[i];
		}
		else
		{
			x=x-(y>>i);
			y=y+(x1>>i);
			ang=ang-LUT[i];
		} 

Please suggest me technic my which i can reduce the number of cycles

Waiting for reply
With regards
praveen
Praveen:

In general, C compilers for DSPs do not always fully utilize the DSP
hardware. This includes the zero-overhead looping, multi-function
instructions, etc. Thus, the general answer to your problem is to look at
the assembly generated by your compiler for your algorithm (e.g., the .lst
file, or whatever your compiler produces), read the manual on the 2191 to
understand its (powerful) instruction set and hits taken for branches, etc.,
and then hand-optimize the assembly.

Most DSP engineers start by writing in assembly when they know that the
function is time-critical. Or, you can purchase (or be given free from ADI)
libraries of optimized assembly code implementations with C callable
wrappers. Either of these two approaches is much better then the answer to
your question, which is given in the above paragraph.

Also, in my opinion, and for the group, attempts to optimize C code given an
understanding of how it will be compiled requires such an understanding of
the C compiler that one is better off understanding the processor and
writing it in assembly to begin with. Comments?

Jim Gort

"praveen" <praveenkumar_11@yahoo.com> wrote in message
news:d8daf655.0312110246.5faf5bc2@posting.google.com...
> Hello, > I have implemented cordic for finding the atan in adsp 2191. But it > takes 3714 cycles for its execution. I have implemented it in c. Can > something tell me how can i optimize the code for that it takes less > than 500 cycles. > my code is > LUT is the lookup table > x and y are the two input whsoe atan to be determined > > for(i=0;i<=25;i++) > { > x1=x; > if (y>0) > { > x=x+(y>>i); > y=y-(x1>>i); > ang=ang+LUT[i]; > } > else > { > x=x-(y>>i); > y=y+(x1>>i); > ang=ang-LUT[i]; > } > > Please suggest me technic my which i can reduce the number of cycles > > Waiting for reply > With regards > praveen
"Jim Gort" <jgort@comcast.net> wrote in message
news:mJ9Cb.509423$Tr4.1413036@attbi_s03...
> Also, in my opinion, and for the group, attempts to optimize C code given
an
> understanding of how it will be compiled requires such an understanding of > the C compiler that one is better off understanding the processor and > writing it in assembly to begin with. Comments? > > Jim Gort
It's a shame you can't do both, really. C compilers often can't effectively determine when to use special processor features, but they are typically better than programmers at things like register allocation. If I were writing a development system for a DSP, it would probably be a C compiler with a few extra DSP datatypes and a lot of intrinsic functions. The intrinsic functions would be patterned after typical DSP processor features, would have C prototypes, and would be equivalent to C library functions. But the compiler for a given DSP would have a priori knowledge of these functions and would compile calls to them into assembler that explicitly uses the DSP processor features when present. It probably wouldn't take too much work to make a good library of intrinsics that have efficient implementations across a wide variety of DSPs.
Jim Gort wrote:

   ...


> Also, in my opinion, and for the group, attempts to optimize C code given an > understanding of how it will be compiled requires such an understanding of > the C compiler that one is better off understanding the processor and > writing it in assembly to begin with. Comments? > > Jim Gort
... The advantage of C or any other high-level language is that the code is portable. PPT, or Programmers Principal Tautology: portable code is useless if it too slow to be useful. (Otherwise known as AD -- Avins's Duh.) Personally I'd rather "just do it" in assembler than try to psych out a compiler. When I'm just learning a processor, fixing the compiled code (or trying to) makes sense. Once I know it fairly well, it doesn't. For one thing, I'm likely to factor the code differently for assembler, so the compiler output often isn't a good fit to my end result. Jerry Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
praveenkumar_11@yahoo.com (praveen) wrote in message news:<d8daf655.0312110246.5faf5bc2@posting.google.com>...
> Hello, > I have implemented cordic for finding the atan in adsp 2191. But it > takes 3714 cycles for its execution. I have implemented it in c. Can > something tell me how can i optimize the code for that it takes less > than 500 cycles. > my code is >
Optimisation of the algorithm: I guess, knowing the error (in the angle) that one can tolerate, optimisation on the number of entries in the LUT and the Number of iterations is possible. Bye Partha
"Matt Timmermans" <mt0000@sympatico.nospam-remove.ca> wrote in message news:<Q2bCb.11039$aF2.1223801@news20.bellglobal.com>...
> "Jim Gort" <jgort@comcast.net> wrote in message > news:mJ9Cb.509423$Tr4.1413036@attbi_s03... > > Also, in my opinion, and for the group, attempts to optimize C code given > an > > understanding of how it will be compiled requires such an understanding of > > the C compiler that one is better off understanding the processor and > > writing it in assembly to begin with. Comments? > > > > Jim Gort > > It's a shame you can't do both, really. C compilers often can't effectively > determine when to use special processor features, but they are typically > better than programmers at things like register allocation. If I were > writing a development system for a DSP, it would probably be a C compiler > with a few extra DSP datatypes and a lot of intrinsic functions. The > intrinsic functions would be patterned after typical DSP processor features, > would have C prototypes, and would be equivalent to C library functions. > But the compiler for a given DSP would have a priori knowledge of these > functions and would compile calls to them into assembler that explicitly > uses the DSP processor features when present.
Kinda' like Bittware, only free?
> It probably wouldn't take too much work to make a good library of intrinsics > that have efficient implementations across a wide variety of DSPs.
I'd suggest that meeting your goals of "efficient implementation" and "variety" are going to take a bit more work than you anticipate, given that one of my favorite "tricks" is to take working C or C++ code written by really smart people like you guys and make it run really fast (factors of 3-10) in assembly on the same processor or to migrate it to another processor. Within a family's architecture you're absolutely right but moving between vendors the internals can be significantly different especially when moving to or from PC-based code (modify Jerry's "AD" to apply to that latter!). With a little thought (I won't bore you with the details unless you're foolish enough to ask) one can make the translation highly testable and allow direct comparison of the results created by the two code sets. Further,you can provide for later recovery of the C code (using compiler switches) if needed. Warranted, this is not the most efficient code possible in any given environment because the differing methodology one could use starting from scratch in assembler but it seems to represent a suitable compromise between optimization, verification and maintenance to my customer base. Ken
"Ken Asbury" <avoidingspam2001@yahoo.com> wrote in message
news:30600528.0312120623.32e80468@posting.google.com...
> > Kinda' like Bittware, only free?
Not really like that. I'm talking about wee tiny simple functions that have obvious trivial intrinsic translations into CPU features. x86 CPUs, for example, have what amounts to a strcpy() instruction. Back in the days when function call overhead was considered significant, it was common for compilers to replace calls to strcpy with that simple instruction. It's like adding an operator to the language without changing the semantics of the language itself. For DSPs, you could use the same technique to add language support for lots of simple things like saturating arithmetic, fixed-point multiplication, circular buffers, zero overhead loops, etc.
> I'd suggest that meeting your goals of "efficient implementation" > and "variety" are going to take a bit more work than you > anticipate, given that one of my favorite "tricks" is to take > working C or C++ code written by really smart people like you > guys and make it run really fast (factors of 3-10) in assembly on > the same processor or to migrate it to another processor.
Yes, and the goal of an intrinsic library is to let you do much the same trick in C. The intrinsic functions would have trivial and obvious translations into assembly instructions for the features that the processor supports. For reatures that the processor doesn't support, you get some hand-optimized inline function that isn't quite as good. Because the translation to assembly on any given platform is obvious, you can hand-optimize C code for that platform in a predictable way, and your code would remain portable to the extent that your program would have the same semantic meaning across all platforms, even though you might want to re-optimize it for processors that were significantly different. You also get to let the compiler manage register allocation, stack shuffling, instruction scheduling, type checking, and all that tedious stuff that compilers are better at than people these days.
Praveen,

Please provide the following info:

Are you using single precision (16 bits) or double precision (32 bits)
each to store 'x' and 'y'?
Are you using integer or fractional math?
Are the values of 'x' and 'y' using the entire range of the number of
bits they are stored in?
How many bits represent 'ang'?
Why does i go from 0 to 25?
Describe the contents of your LUT.

Have you verified that the result at each iteration of the loop is
what you expected? How about the final results? For what range of
input angles?

Thanks,


Dirk

Dirk A. Bell
DSP Consultant


praveenkumar_11@yahoo.com (praveen) wrote in message news:<d8daf655.0312110246.5faf5bc2@posting.google.com>...
> Hello, > I have implemented cordic for finding the atan in adsp 2191. But it > takes 3714 cycles for its execution. I have implemented it in c. Can > something tell me how can i optimize the code for that it takes less > than 500 cycles. > my code is > LUT is the lookup table > x and y are the two input whsoe atan to be determined > > for(i=0;i<=25;i++) > { > x1=x; > if (y>0) > { > x=x+(y>>i); > y=y-(x1>>i); > ang=ang+LUT[i]; > } > else > { > x=x-(y>>i); > y=y+(x1>>i); > ang=ang-LUT[i]; > } > > Please suggest me technic my which i can reduce the number of cycles > > Waiting for reply > With regards > praveen
Jim Gort wrote:
 > [...]
> Also, in my opinion, and for the group, attempts to optimize C code given an > understanding of how it will be compiled requires such an understanding of > the C compiler that one is better off understanding the processor and > writing it in assembly to begin with. Comments?
AMEN BROTHER!!!!! Jim, you and I think alike! --Randy -- % Randy Yates % "...the answer lies within your soul %% Fuquay-Varina, NC % 'cause no one knows which side %%% 919-577-9882 % the coin will fall." %%%% <yates@ieee.org> % 'Big Wheels', *Out of the Blue*, ELO http://home.earthlink.net/~yatescr
Matt Timmermans wrote:

 >[...]
> It's a shame you can't do both, really.
You can. Write your time-critical code in assembly and make it C-callable. Intrinsics have the same problem as the one Jim addressed - in the time it takes to learn and apply them, you could've written in assembly, and the code is still less readable and unportable - two big reasons for writing in C to begin with. I think folks who cling to C are in denial - you're just gonna have to break down and code in assembly if you want optimum performance. Learn it. Live it. Love it. -- % Randy Yates % "...the answer lies within your soul %% Fuquay-Varina, NC % 'cause no one knows which side %%% 919-577-9882 % the coin will fall." %%%% <yates@ieee.org> % 'Big Wheels', *Out of the Blue*, ELO http://home.earthlink.net/~yatescr