I have the impression that hashing as it is usually practiced requires storing both a hash data structure and the original data.
It occurred to me that it would be possible to use histograms to transform a sorted list of data items into a structure that was smaller than the original data, but also supported hashing, so that you would only need the hash data, You could then remove the original data, and re-create it from the hash data if required.
If this hasn't already been done, then here it is. If it has already been done, perhaps this code could still be of help to someone.
I have two programs. One processes a list of 32 bit integers, and the other does a list of variable length strings containing only the letters allowed in C symbols.
Here is the annotated output of the program creating a hash table for 50 million 32 bit integers. Source code for that program is at:
http://mcky.net/bin/hsh/hsh.tar.gz
http://mcky.net/bin/hsh/hsh.zip
The code was developed on a Linux 64 bit system.
Note that data compression could be improved if we only want to know if the accessed data was in the original data set, but we don't care about where it was located.
The hash data is smaller than the original data. It would not be possible to just replace the original data with the hash data while bulding the hash data, but it probably woudn't be necessary to go with a completely double buffered solution. Some extra storage would be needed, but not two times the size of the data.
integers | 50000000 | The program has been told to create an array of 50 million integers |
max list length | 2048 | Longer lists give more compression but slower access |
count sort time | 11.4600 | Count sort time in seconds |
sort check | ok | Count sort results correct |
unique items | 49421938 | Removed redundant integers |
hash bytes | 99172838 | Size of the hash table |
compression ratio | 0.501664 | Size of hash table divided by bytes in unique items |
hash time | 1.2100 | Time to create the hash table |
count sort time plus hash time | 12.6700 | Sort and create hash quicker than quick sort |
hash check | ok | Checks that all items can be accessed and some coverage for false positives |
hash check time | 15.1500 | Time for more than 90 million accesses |
quick sort time | 17.0700 | Quick sort time in seconds |
sort check | ok | Quick sort done correctly |
Here are the collated results for several runs of the integer code
Integer hash results |
|||||||||
integers | 100000 | 200000 | 500000 | 1000000 | 2000000 | 5000000 | 10000000 | 20000000 | 50000000 |
max list length | 2048 | 2048 | 2048 | 2048 | 2048 | 2048 | 2048 | 2048 | 2048 |
count sort time | 0.0300 | 0.0500 | 0.1000 | 0.2000 | 0.3900 | 1.0900 | 2.5400 | 5.3800 | 11.9700 |
count sort check | ok | ok | ok | ok | ok | ok | ok | ok | ok |
unique items | 99998 | 199989 | 499943 | 999778 | 1999095 | 4994194 | 9976870 | 19907305 | 49422954 |
hash bytes | 301788 | 601761 | 1328848 | 2328518 | 4327152 | 10317350 | 20282702 | 40143572 | 99174870 |
compression ratio | 0.754485 | 0.752243 | 0.664500 | 0.582259 | 0.541139 | 0.516467 | 0.508243 | 0.504131 | 0.501664 |
hash time | 0.0000 | 0.0000 | 0.0200 | 0.0300 | 0.0600 | 0.1300 | 0.2500 | 0.4800 | 1.1900 |
count sort time plus hash time | 0.0300 | 0.0500 | 0.1200 | 0.2300 | 0.4500 | 1.2200 | 2.7900 | 5.8600 | 13.1600 |
hash check | ok | ok | ok | ok | ok | ok | ok | ok | ok |
hash check time | 0.0300 | 0.0600 | 0.0800 | 0.1900 | 0.4200 | 1.1800 | 2.5500 | 5.5200 | 14.6400 |
quick sort time | 0.0300 | 0.0400 | 0.1200 | 0.2800 | 0.5600 | 1.5000 | 3.1300 | 6.5300 | 17.1700 |
quick sort check | ok | ok | ok | ok | ok | ok | ok | ok | ok |