DSPRelated.com
Forums

Need of efficient Sorting algorithm for C64x

Started by venk...@yahoo.co.in May 28, 2008
I am looking for most efficient sorting algorithm tailored for c64x (if not, for any dsp). My requirements and assumptions are

1. Number of elements for searching (Min, Max)= (2, 4300)
2. Data type is integer.
3. max. stack size = 4KB (if used by recursive sortings).

qosrt is available in rts.src and rts64plus.lib. I have checked the qsort.c in rts.src and observed it is not optimized. Under rts64plus.lib when I tried to extract the files using "ar6x.exe x rts64plus.lib" it generates the .obj files. I don't know whether this qosrt is optimized or not. Which document describes about these library contents?.

But anyway this is a recursive algorithm, which I would like to use at low priority, provided it is optimized, if there is no other choice. Can someone help on this.

Thanks in advance.

-Venki
One of my teacher told me one day, the bubble sort algorithm was the fastest one, he would like to find another one faster, but couldn't...And then came the qsort algorithm a lot faster (statistically of course).
Have you tried to use C directives for optimisation, it is usually really good.
It is strange to hear that the bubble sort could be fastest. Infact, it falls into the slowest sorting algorithms of complexity O(n2). Yeah I tried with C pragma directives, but no use. Any other algorithms??

-Venki
I am looking for most efficient sorting algorithm tailored for c64x (if not, for any dsp). My requirements and assumptions are
>
>1. Number of elements for searching (Min, Max)= (2, 4300)
>2. Data type is integer.
>3. max. stack size = 4KB (if used by recursive sortings).
>
>qosrt is available in rts.src and rts64plus.lib. I have checked the qsort.c in rts.src and observed it is not optimized. Under rts64plus.lib when I tried to extract the files using "ar6x.exe x rts64plus.lib" it generates the .obj files. I don't know whether this qosrt is optimized or not. Which document describes about these library contents?.
>
>But anyway this is a recursive algorithm, which I would like to use at low priority, provided it is optimized, if there is no other choice. Can someone help on this.
>
>Thanks in advance.
>
>-Venki
QSORT is the fastest (from the ons I know) but for sure faster than bubble.
So did yout try to optimise it with C directives?
Since the data type is known to be int then the fastest sort would generally
be a radix sort. The biggest problem that you could run into with using it
on a 32 bit type would be the maximum recursion depth can be 32.

Greg

On Wed, May 28, 2008 at 5:32 AM, wrote:

> It is strange to hear that the bubble sort could be fastest. Infact, it
> falls into the slowest sorting algorithms of complexity O(n2). Yeah I tried
> with C pragma directives, but no use. Any other algorithms??
>
> -Venki
> I am looking for most efficient sorting algorithm tailored for c64x (if
> not, for any dsp). My requirements and assumptions are
> >
> >1. Number of elements for searching (Min, Max)= (2, 4300)
> >2. Data type is integer.
> >3. max. stack size = 4KB (if used by recursive sortings).
> >
> >qosrt is available in rts.src and rts64plus.lib. I have checked the
> qsort.c in rts.src and observed it is not optimized. Under rts64plus.lib
> when I tried to extract the files using "ar6x.exe x rts64plus.lib" it
> generates the .obj files. I don't know whether this qosrt is optimized or
> not. Which document describes about these library contents?.
> >
> >But anyway this is a recursive algorithm, which I would like to use at low
> priority, provided it is optimized, if there is no other choice. Can someone
> help on this.
> >
> >Thanks in advance.
> >
> >-Venki
>
>
>
>
>
>
>
Christopher, others,

Actually, for large data sets, a sieve sort is much faster.
This is because a sieve sort uses QSORT for the major operations and when the interval size is below
~10 entries, invokes INSERTION sort for the small intervals,
Note:
INSERTION sort is much faster for small data set intervals (say 10 entries or less) than is QSORT.

Also, if the data is longer than say a word or two, then it is much faster to have an array of
pointers and only move the pointers during the sorting operation. This is because any data movement
is very time consuming.

Also Hashing the data as it is collected and then only sorting the individual hash lists is much
faster for certain kinds of data collections.
R. Williams
---------- Original Message -----------
From: christophe blouet
To: ,
Sent: Wed, 28 May 2008 16:56:22 +0200
Subject: RE: [c6x] Re: Need of efficient Sorting algorithm for C64x

> QSORT is the fastest (from the ons I know) but for sure faster than bubble.
> So did yout try to optimise it with C directives?
------- End of Original Message -------
> Subject: Need of efficient Sorting algorithm for C64x
> Posted by: "v...@yahoo.co.in" v...@yahoo.co.in
> Date: Wed May 28, 2008 12:12 am ((PDT))
>
> I am looking for most efficient sorting algorithm tailored for c64x (if not,
> for any dsp). My requirements and assumptions are
>
> 1. Number of elements for searching (Min, Max)= (2, 4300)
>
> 2. Data type is integer.

It suggests to use LDDW to bring several words into the registers for
comparisons.

> 3. max. stack size = 4KB (if used by recursive sortings).
>
> qosrt is available in rts.src and rts64plus.lib. I have checked the qsort.c
> in rts.src and observed it is not optimized. Under rts64plus.lib when I tried
> to extract the files using "ar6x.exe x rts64plus.lib" it generates the .obj
> files. I don't know whether this qosrt is optimized or not. Which document
> describes about these library contents?.

No, I do not think it is optimal. First of all, the qsort() uses (that is
calls) a function to compare two keys. Its swap_item() function is also
seem to be not optimal. It needs some work to be done before it processes
integers in a near best optimal way.

Basically I'd replaced two (*compar)() calls with simple < and > ops and
also perform swapping of words inline, i.e. to remove swap_item() calls.
But then, there are two recursion calls of qsort() remain, which result in
stack save/restore loads/stores and four jumps instructions.

> But anyway this is a recursive algorithm, which I would like to use at low
> priority, provided it is optimized, if there is no other choice. Can someone
> help on this.

There is a review of a variety of sorting algorithms on Wikipedia; however so
far I did not see any thorough analysis of a set of such in therms of performance
specifically for the C6000 architecture.

The problem is that the C6000 architecture is a deep pipeline. Once its flow
is interrupted (e.g. by a jump due to a comparison condition), the performance
will drop. Sorting with its plentitude of comparisons would interrupt the flow
many many times. Consider the price of restoring pipeline (in number of CPU
cycles) is higher that a compare operation, which is just a single CPU cycle,
sorting algorithms might be a nightmare to perform on the C6000. Of course it
depends on keys' type. If there is a complex algorithm to compare two keys,
one would need to optimize the (*compar)() function.

Good luck in the search.

--
Andrew

> Thanks in advance.
>
> -Venki
>
Yes I have already gone through the comprehensive description of all sorting algorithms from Wikipedia. But not convinced of using any of those algos for my requirement. But I hoped someone would have already used sorting algs as part of C6x optimization and I might get the one. I may need to tailor any of the existing algorithm to suit for my app..My search is on and on....
 
-Venki

--- On Thu, 29/5/08, Andrew Nesterov <a...@techemail.com> wrote:

From: Andrew Nesterov <a...@techemail.com>
Subject: [c6x] Re: Need of efficient Sorting algorithm for C64x
To: c...
Date: Thursday, 29 May, 2008, 2:04 AM

> Subject: Need of efficient Sorting algorithm for C64x
> Posted by: "venkisoft4u@ yahoo.co. in" venkisoft4u@ yahoo.co. in
> Date: Wed May 28, 2008 12:12 am ((PDT))
>
> I am looking for most efficient sorting algorithm tailored for c64x (if not,
> for any dsp). My requirements and assumptions are
>
> 1. Number of elements for searching (Min, Max)= (2, 4300)
>
> 2. Data type is integer.

It suggests to use LDDW to bring several words into the registers for
comparisons.

> 3. max. stack size = 4KB (if used by recursive sortings).
>
> qosrt is available in rts.src and rts64plus.lib. I have checked the qsort.c
> in rts.src and observed it is not optimized. Under rts64plus.lib when I tried
> to extract the files using "ar6x.exe x rts64plus.lib" it generates the .obj
> files. I don't know whether this qosrt is optimized or not. Which document
> describes about these library contents?.

No, I do not think it is optimal. First of all, the qsort() uses (that is
calls) a function to compare two keys. Its swap_item() function is also
seem to be not optimal. It needs some work to be done before it processes
integers in a near best optimal way.

Basically I'd replaced two (*compar)() calls with simple < and > ops and
also perform swapping of words inline, i.e. to remove swap_item() calls.
But then, there are two recursion calls of qsort() remain, which result in
stack save/restore loads/stores and four jumps instructions.

> But anyway this is a recursive algorithm, which I would like to use at low
> priority, provided it is optimized, if there is no other choice. Can someone
> help on this.

There is a review of a variety of sorting algorithms on Wikipedia; however so
far I did not see any thorough analysis of a set of such in therms of performance
specifically for the C6000 architecture.

The problem is that the C6000 architecture is a deep pipeline. Once its flow
is interrupted (e.g. by a jump due to a comparison condition), the performance
will drop. Sorting with its plentitude of comparisons would interrupt the flow
many many times. Consider the price of restoring pipeline (in number of CPU
cycles) is higher that a compare operation, which is just a single CPU cycle,
sorting algorithms might be a nightmare to perform on the C6000. Of course it
depends on keys' type. If there is a complex algorithm to compare two keys,
one would need to optimize the (*compar)() function.

Good luck in the search.

--
Andrew

> Thanks in advance.
>
> -Venki
>
> Posted by: "venkatesh m" v...@yahoo.co.in
> Date: Fri May 30, 2008 4:17 am ((PDT))
>
> Yes I have already gone through the comprehensive description of all sorting
> algorithms from Wikipedia. But not convinced of using any of those algos for
> my requirement. But I hoped someone would have already used sorting algs as
> part of C6x optimization and I might get the one. I may need to tailor any of
> the existing algorithm to suit for my app..My search is on and on....
>  
> -Venki

I thought that I had lost this reference, but eventually I found it
again on yet another machine. It has a collection of sorting algorithms:

http://www.cs.ubc.ca/spider/harrison/Java/sorting-demo.html

Once (in fact, long ago) I tried to benchmark these subroutines; without
assembly optimizations, just plain ANSI C code. The data was a 32 bit int
array of 1024 random numbers generated by rand() calls.

C6711 @ 150MHz CACHE OFF CACHE ON
CPU CLOCK 89601868 BidirectionalBubbleSort 3065746
CPU CLOCK 119410211 BubbleSort2 4206564
CPU CLOCK 4808573 CombSort11 173927
CPU CLOCK 4319722 EQSort 187900
CPU CLOCK 8669056 ExtraStorageMergeSort 440712
CPU CLOCK 2639429 FastQSort 133676
CPU CLOCK 3652629 HeapSort 262553
CPU CLOCK 48988231 InsertionSort 2027650
CPU CLOCK 52607276 MergeSort 1751966
CPU CLOCK 4246608 QSort 198221
CPU CLOCK 3863494 QubbleSort 167380
CPU CLOCK 884415975 RadixSort 101566934
CPU CLOCK 112162429 SelectionSort 3685924
CPU CLOCK 67119799 ShakerSort 1850762
CPU CLOCK 4964646 ShellSort 272722
CPU CLOCK 11947415 StdLib qsort 886936

C6416 @ 600MHz ONCHIP SRAM
CPU CLOCK 3065348 BidirectionalBubbleSort
CPU CLOCK 4205783 BubbleSort2
CPU CLOCK 176139 CombSort11
CPU CLOCK 186200 EQSort
CPU CLOCK 419065 ExtraStorageMergeSort
CPU CLOCK 138958 FastQSort
CPU CLOCK 257569 HeapSort
CPU CLOCK 2028026 InsertionSort
CPU CLOCK 1748467 MergeSort
CPU CLOCK 196739 QSort
CPU CLOCK 169104 QubbleSort
CPU CLOCK 91154554 RadixSort
CPU CLOCK 3750264 SelectionSort
CPU CLOCK 1885050 ShakerSort
CPU CLOCK 277583 ShellSort
CPU CLOCK 759184 StdLib qsort

Interesting, how a java example applet may differ from compiled code
performance, if you would compare e.g. the RadixSort(). The FastQSort()
definitely looks better than the RTS lib qsort().

Regards,
Andrew
Hi Andrew,
That was good information. My requirement point of view, the sorting is a part of my whole application and we expect its Mcps contiribution to the application to be ignorable. At last in our application, the no. of elements for comparision are reduced significantly to alleviate the sorting Mcps effect.
 
It was good learing from all these posts. Thanks for all the inputs and suggestions.
 
-Venki 

--- On Mon, 2/6/08, Andrew Nesterov <a...@techemail.com> wrote:

From: Andrew Nesterov <a...@techemail.com>
Subject: [c6x] Re: Need of efficient Sorting algorithm for C64x
To: c...
Date: Monday, 2 June, 2008, 1:34 AM

> Posted by: "venkatesh m" venkisoft4u@ yahoo.co. in
> Date: Fri May 30, 2008 4:17 am ((PDT))
>
> Yes I have already gone through the comprehensive description of all sorting
> algorithms from Wikipedia. But not convinced of using any of those algos for
> my requirement. But I hoped someone would have already used sorting algs as
> part of C6x optimization and I might get the one. I may need to tailor any of
> the existing algorithm to suit for my app..My search is on and on....
> &nbsp;
> -Venki

I thought that I had lost this reference, but eventually I found it
again on yet another machine. It has a collection of sorting algorithms:

http://www.cs. ubc.ca/spider/ harrison/ Java/sorting- demo.html

Once (in fact, long ago) I tried to benchmark these subroutines; without
assembly optimizations, just plain ANSI C code. The data was a 32 bit int
array of 1024 random numbers generated by rand() calls.

C6711 @ 150MHz CACHE OFF CACHE ON
CPU CLOCK 89601868 BidirectionalBubble Sort 3065746
CPU CLOCK 119410211 BubbleSort2 4206564
CPU CLOCK 4808573 CombSort11 173927
CPU CLOCK 4319722 EQSort 187900
CPU CLOCK 8669056 ExtraStorageMergeSo rt 440712
CPU CLOCK 2639429 FastQSort 133676
CPU CLOCK 3652629 HeapSort 262553
CPU CLOCK 48988231 InsertionSort 2027650
CPU CLOCK 52607276 MergeSort 1751966
CPU CLOCK 4246608 QSort 198221
CPU CLOCK 3863494 QubbleSort 167380
CPU CLOCK 884415975 RadixSort 101566934
CPU CLOCK 112162429 SelectionSort 3685924
CPU CLOCK 67119799 ShakerSort 1850762
CPU CLOCK 4964646 ShellSort 272722
CPU CLOCK 11947415 StdLib qsort 886936

C6416 @ 600MHz ONCHIP SRAM
CPU CLOCK 3065348 BidirectionalBubble Sort
CPU CLOCK 4205783 BubbleSort2
CPU CLOCK 176139 CombSort11
CPU CLOCK 186200 EQSort
CPU CLOCK 419065 ExtraStorageMergeSo rt
CPU CLOCK 138958 FastQSort
CPU CLOCK 257569 HeapSort
CPU CLOCK 2028026 InsertionSort
CPU CLOCK 1748467 MergeSort
CPU CLOCK 196739 QSort
CPU CLOCK 169104 QubbleSort
CPU CLOCK 91154554 RadixSort
CPU CLOCK 3750264 SelectionSort
CPU CLOCK 1885050 ShakerSort
CPU CLOCK 277583 ShellSort
CPU CLOCK 759184 StdLib qsort

Interesting, how a java example applet may differ from compiled code
performance, if you would compare e.g. the RadixSort(). The FastQSort()
definitely looks better than the RTS lib qsort().

Regards,
Andrew