Home About
Complex Discrete Logs Compressed Histogram Hash
DIY crypto Fusion
Heterodyne Physics McCoskey Count Sort
Oil Spill Fix Personal Rail
Rational Residue Residue Arithmetic
Reverse Translation Analysis Rotary Engine
Subatomic Energy

Histogram Hash

I have the impression that hashing as it is usually practiced requires storing both a hash data structure and the original data.

It occurred to me that it would be possible to use histograms to transform a sorted list of data items into a structure that was smaller than the original data, but also supported hashing, so that you would only need the hash data, You could then remove the original data, and re-create it from the hash data if required.

If this hasn't already been done, then here it is. If it has already been done, perhaps this code could still be of help to someone.

I have two programs. One processes a list of 32 bit integers, and the other does a list of variable length strings containing only the letters allowed in C symbols.

Here is the annotated output of the program creating a hash table for 50 million 32 bit integers. Source code for that program is at:



The code was developed on a Linux 64 bit system.

Note that data compression could be improved if we only want to know if the accessed data was in the original data set, but we don't care about where it was located.

The hash data is smaller than the original data. It would not be possible to just replace the original data with the hash data while bulding the hash data, but it probably woudn't be necessary to go with a completely double buffered solution. Some extra storage would be needed, but not two times the size of the data.

integers 50000000 The program has been told to create an array of 50 million integers
max list length 2048 Longer lists give more compression but slower access
count sort time 11.4600 Count sort time in seconds
sort check ok Count sort results correct
unique items 49421938 Removed redundant integers
hash bytes 99172838 Size of the hash table
compression ratio 0.501664 Size of hash table divided by bytes in unique items
hash time 1.2100 Time to create the hash table
count sort time plus hash time 12.6700 Sort and create hash quicker than quick sort
hash check ok Checks that all items can be accessed and some coverage for false positives
hash check time 15.1500 Time for more than 90 million accesses
quick sort time 17.0700 Quick sort time in seconds
sort check ok Quick sort done correctly

Here are the collated results for several runs of the integer code

Integer hash results

integers 100000 200000 500000 1000000 2000000 5000000 10000000 20000000 50000000
max list length 2048 2048 2048 2048 2048 2048 2048 2048 2048
count sort time 0.0300 0.0500 0.1000 0.2000 0.3900 1.0900 2.5400 5.3800 11.9700
count sort check ok ok ok ok ok ok ok ok ok
unique items 99998 199989 499943 999778 1999095 4994194 9976870 19907305 49422954
hash bytes 301788 601761 1328848 2328518 4327152 10317350 20282702 40143572 99174870
compression ratio 0.754485 0.752243 0.664500 0.582259 0.541139 0.516467 0.508243 0.504131 0.501664
hash time 0.0000 0.0000 0.0200 0.0300 0.0600 0.1300 0.2500 0.4800 1.1900
count sort time plus hash time 0.0300 0.0500 0.1200 0.2300 0.4500 1.2200 2.7900 5.8600 13.1600
hash check ok ok ok ok ok ok ok ok ok
hash check time 0.0300 0.0600 0.0800 0.1900 0.4200 1.1800 2.5500 5.5200 14.6400
quick sort time 0.0300 0.0400 0.1200 0.2800 0.5600 1.5000 3.1300 6.5300 17.1700
quick sort check ok ok ok ok ok ok ok ok ok