Hello everybody,

I'm developing program that will calculate the parameters of image
segment,

this program will operate on TI TMS320C6416 DSP. For that purpose I'm
using

routine that is based on Levenberg-Marquardt optimization algorithm. The DSP

is fixed point and the variables are floating point of double type, thus
I'm

using RTS library. I would be very grateful for advice and answers for my

questions:

- How to make program run faster? The program calculates 6 parameters

for ~3000 points, the time is 3 minutes.

- I've tried to use the FastRTS library, but there was no obvious

difference. How to check what library used compiler?

- How to see the performance of the different functions?

Thank you, for your time.

Good luck,

Vladimir

# Extremely slow processing

Started by ●May 14, 2007

Reply by ●May 14, 20072007-05-14

Vladimir-

> I'm developing program that will calculate the parameters of image segment,

> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using

> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP

> is fixed point and the variables are floating point of double type, thus I'm

> using RTS library. I would be very grateful for advice and answers for my

> questions:

>

> - How to make program run faster? The program calculates 6 parameters

> for ~3000 points, the time is 3 minutes.

> - I've tried to use the FastRTS library, but there was no obvious

> difference. How to check what library used compiler?

> - How to see the performance of the different functions?

Each floating-point operation on a 64x series device is a software routine. That's

going to slow down your code by a factor of 100 or more. If you could replace any of

these calls with fixed-point operations it would help. 16x16 multiplies would be

fastest (single instruction), 32x32 multiplies using intrinsics would take a few

instructions but still be significantly faster.

-Jeff

> I'm developing program that will calculate the parameters of image segment,

> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using

> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP

> is fixed point and the variables are floating point of double type, thus I'm

> using RTS library. I would be very grateful for advice and answers for my

> questions:

>

> - How to make program run faster? The program calculates 6 parameters

> for ~3000 points, the time is 3 minutes.

> - I've tried to use the FastRTS library, but there was no obvious

> difference. How to check what library used compiler?

> - How to see the performance of the different functions?

Each floating-point operation on a 64x series device is a software routine. That's

going to slow down your code by a factor of 100 or more. If you could replace any of

these calls with fixed-point operations it would help. 16x16 multiplies would be

fastest (single instruction), 32x32 multiplies using intrinsics would take a few

instructions but still be significantly faster.

-Jeff

Reply by ●May 15, 20072007-05-15

> Posted by: "Vladimir Matvejev" V...@gmail.com

> Date: Mon May 14, 2007 5:38 am ((PDT))

>

> I'm developing program that will calculate the parameters of image segment,

> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using

> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP

> is fixed point and the variables are floating point of double type, thus I'm

> using RTS library. I would be very grateful for advice and answers for my

> questions:

>

> - How to make program run faster? The program calculates 6 parameters

> for ~3000 points, the time is 3 minutes.

> - I've tried to use the FastRTS library, but there was no obvious

> difference. How to check what library used compiler?

> - How to see the performance of the different functions?

Hi Vladimir,

Have you compared the number of iterations it takes for the LM to converge

on a Pentium vs 6416? It wouldn't be exactly the same, however it must be

similar on the order of the number. If you saw that the iteration count

is significantly different, there might be a need to check the code. By

the way, do you use the code from the NRC? If yes, you'd better don't :)

Have you tried to calculate it with single precision floats? If you set the

tol to e.g. 0.01% it still falls into single precision accuracy.

The linker resolves unresolved symbols to the first library it encounters,

you may want to set the link order so that the Fast RTS is searched before

the standard RTS - I bet this was the reason you said that there were no

difference between the link with the standard rts vs fast rts. Be careful,

as even if you used the Fast RTS, you'll still need to link in something

from the standard RTS, e.g. stdio functions.

Performance can be measured with either the cpu clock counter

(Profiler->Enable Clock) or in a profile session (Profiler->Start New Session).

I forgot what is the order of LM, and is it constrained/unconstrained? You

might speed up convergence if you use e.g. a second order unconstrained NR,

with finite difference derivatives approximation.

Rgds,

Andrew

> Date: Mon May 14, 2007 5:38 am ((PDT))

>

> I'm developing program that will calculate the parameters of image segment,

> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using

> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP

> is fixed point and the variables are floating point of double type, thus I'm

> using RTS library. I would be very grateful for advice and answers for my

> questions:

>

> - How to make program run faster? The program calculates 6 parameters

> for ~3000 points, the time is 3 minutes.

> - I've tried to use the FastRTS library, but there was no obvious

> difference. How to check what library used compiler?

> - How to see the performance of the different functions?

Hi Vladimir,

Have you compared the number of iterations it takes for the LM to converge

on a Pentium vs 6416? It wouldn't be exactly the same, however it must be

similar on the order of the number. If you saw that the iteration count

is significantly different, there might be a need to check the code. By

the way, do you use the code from the NRC? If yes, you'd better don't :)

Have you tried to calculate it with single precision floats? If you set the

tol to e.g. 0.01% it still falls into single precision accuracy.

The linker resolves unresolved symbols to the first library it encounters,

you may want to set the link order so that the Fast RTS is searched before

the standard RTS - I bet this was the reason you said that there were no

difference between the link with the standard rts vs fast rts. Be careful,

as even if you used the Fast RTS, you'll still need to link in something

from the standard RTS, e.g. stdio functions.

Performance can be measured with either the cpu clock counter

(Profiler->Enable Clock) or in a profile session (Profiler->Start New Session).

I forgot what is the order of LM, and is it constrained/unconstrained? You

might speed up convergence if you use e.g. a second order unconstrained NR,

with finite difference derivatives approximation.

Rgds,

Andrew