Sort Home | Count Sort, Quick Sort, SQL |
Source Code | History |
Thanks to the National Institute of Standards for linking to this page.
Sorts using this method have been benchmarked at 40% to 300% faster than Quick Sort.
A count or radix sort can sort data in two passes. I find the term histogram sort has certain advantages, so I will use that term here, although the basic concept is the same as for the count or radix sort. I find the term histogram easier to work with than "a table of the counts of the keys of the items to be sorted", though the meaning is the same.
I believe these sorts to be currently thought of as double-buffered sorts, but I have discovered a method of performing a histogram sort with only one buffer, using an extra histogram instead of the second buffer. Practical version of the histogram sort use nested histograms. The double-buffered version of the histogram sort would use two buffers for the items to be sorted, and a number of histograms, depending on the length of the keys involved and the size of the histograms. If the double-buffered version of the sort requires N histograms then the single buffered version requires N+1 histograms.
The algorithm is as follows: The items to be sorted are divided into a number of regions, one for each key. If the key length for a particular pass of the algorithm is 8 bits, then there would be 256 such regions. Two identical histograms are created. The histograms are integrated using a process of numerical integration to convert them to memory allocation pointers. One histogram is designated as the active histogram, and the other is designated as the reference histogram.
Region zero is designated as a donor region. An item is removed from the donor region, and is converted into a index to another region. The index thus created is used to select an element of the active histogram, and the item is stored in a memory location pointed to by the selected element of the active histogram, but before that is done, the data in that memory location is saved to a variable. The selected element of the active histogram is incremented. The data saved to the variable is converted into an index to another region. The process is repeated until the selected region is the donor region.
At that point, the active data is placed into the vacant location of the donor region and the member of the active histogram is incremented. The incremented element of the active histogram is compared to the next element of the reference histogram. If the increment element is less than the reference element, then new data is taken from the donor region at the location pointed to by the incremented element, and the process is repeated.
If the incremented element equals the reference element, all the items in the donor region have been sorted to their correct locations, and a new donor region must be found. The donor region pointer is incremented. If the active histogram element pointed to by the new donor region is less than the next reference histogram element, then data is taken from that region to continue sorting. If the active histogram element is equal to the next element of the reference histogram, then the donor region pointer is incremented again. When the donor region pointer points past the last element in the active histogram, then the algorithm is completed, and all the items are sorted.
This sounds complex, but can be done in a few lines of code. A more complicated version of this sort runs faster than the double buffered version of the histogram sort on a Pentium system, probably due to caching. More details are available in PDF format. Right click to download. Left click to read