Hello everybody,
I'm developing program that will calculate the parameters of image
segment,
this program will operate on TI TMS320C6416 DSP. For that purpose I'm
using
routine that is based on Levenberg-Marquardt optimization algorithm. The DSP
is fixed point and the variables are floating point of double type, thus
I'm
using RTS library. I would be very grateful for advice and answers for my
questions:
- How to make program run faster? The program calculates 6 parameters
for ~3000 points, the time is 3 minutes.
- I've tried to use the FastRTS library, but there was no obvious
difference. How to check what library used compiler?
- How to see the performance of the different functions?
Thank you, for your time.
Good luck,
Vladimir
Extremely slow processing
Started by ●May 14, 2007
Reply by ●May 14, 20072007-05-14
Vladimir-
> I'm developing program that will calculate the parameters of image segment,
> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using
> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP
> is fixed point and the variables are floating point of double type, thus I'm
> using RTS library. I would be very grateful for advice and answers for my
> questions:
>
> - How to make program run faster? The program calculates 6 parameters
> for ~3000 points, the time is 3 minutes.
> - I've tried to use the FastRTS library, but there was no obvious
> difference. How to check what library used compiler?
> - How to see the performance of the different functions?
Each floating-point operation on a 64x series device is a software routine. That's
going to slow down your code by a factor of 100 or more. If you could replace any of
these calls with fixed-point operations it would help. 16x16 multiplies would be
fastest (single instruction), 32x32 multiplies using intrinsics would take a few
instructions but still be significantly faster.
-Jeff
> I'm developing program that will calculate the parameters of image segment,
> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using
> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP
> is fixed point and the variables are floating point of double type, thus I'm
> using RTS library. I would be very grateful for advice and answers for my
> questions:
>
> - How to make program run faster? The program calculates 6 parameters
> for ~3000 points, the time is 3 minutes.
> - I've tried to use the FastRTS library, but there was no obvious
> difference. How to check what library used compiler?
> - How to see the performance of the different functions?
Each floating-point operation on a 64x series device is a software routine. That's
going to slow down your code by a factor of 100 or more. If you could replace any of
these calls with fixed-point operations it would help. 16x16 multiplies would be
fastest (single instruction), 32x32 multiplies using intrinsics would take a few
instructions but still be significantly faster.
-Jeff
Reply by ●May 15, 20072007-05-15
> Posted by: "Vladimir Matvejev" V...@gmail.com
> Date: Mon May 14, 2007 5:38 am ((PDT))
>
> I'm developing program that will calculate the parameters of image segment,
> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using
> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP
> is fixed point and the variables are floating point of double type, thus I'm
> using RTS library. I would be very grateful for advice and answers for my
> questions:
>
> - How to make program run faster? The program calculates 6 parameters
> for ~3000 points, the time is 3 minutes.
> - I've tried to use the FastRTS library, but there was no obvious
> difference. How to check what library used compiler?
> - How to see the performance of the different functions?
Hi Vladimir,
Have you compared the number of iterations it takes for the LM to converge
on a Pentium vs 6416? It wouldn't be exactly the same, however it must be
similar on the order of the number. If you saw that the iteration count
is significantly different, there might be a need to check the code. By
the way, do you use the code from the NRC? If yes, you'd better don't :)
Have you tried to calculate it with single precision floats? If you set the
tol to e.g. 0.01% it still falls into single precision accuracy.
The linker resolves unresolved symbols to the first library it encounters,
you may want to set the link order so that the Fast RTS is searched before
the standard RTS - I bet this was the reason you said that there were no
difference between the link with the standard rts vs fast rts. Be careful,
as even if you used the Fast RTS, you'll still need to link in something
from the standard RTS, e.g. stdio functions.
Performance can be measured with either the cpu clock counter
(Profiler->Enable Clock) or in a profile session (Profiler->Start New Session).
I forgot what is the order of LM, and is it constrained/unconstrained? You
might speed up convergence if you use e.g. a second order unconstrained NR,
with finite difference derivatives approximation.
Rgds,
Andrew
> Date: Mon May 14, 2007 5:38 am ((PDT))
>
> I'm developing program that will calculate the parameters of image segment,
> this program will operate on TI TMS320C6416 DSP. For that purpose I'm using
> routine that is based on Levenberg-Marquardt optimization algorithm. The DSP
> is fixed point and the variables are floating point of double type, thus I'm
> using RTS library. I would be very grateful for advice and answers for my
> questions:
>
> - How to make program run faster? The program calculates 6 parameters
> for ~3000 points, the time is 3 minutes.
> - I've tried to use the FastRTS library, but there was no obvious
> difference. How to check what library used compiler?
> - How to see the performance of the different functions?
Hi Vladimir,
Have you compared the number of iterations it takes for the LM to converge
on a Pentium vs 6416? It wouldn't be exactly the same, however it must be
similar on the order of the number. If you saw that the iteration count
is significantly different, there might be a need to check the code. By
the way, do you use the code from the NRC? If yes, you'd better don't :)
Have you tried to calculate it with single precision floats? If you set the
tol to e.g. 0.01% it still falls into single precision accuracy.
The linker resolves unresolved symbols to the first library it encounters,
you may want to set the link order so that the Fast RTS is searched before
the standard RTS - I bet this was the reason you said that there were no
difference between the link with the standard rts vs fast rts. Be careful,
as even if you used the Fast RTS, you'll still need to link in something
from the standard RTS, e.g. stdio functions.
Performance can be measured with either the cpu clock counter
(Profiler->Enable Clock) or in a profile session (Profiler->Start New Session).
I forgot what is the order of LM, and is it constrained/unconstrained? You
might speed up convergence if you use e.g. a second order unconstrained NR,
with finite difference derivatives approximation.
Rgds,
Andrew