Histogram Hash

Home	About
Complex Discrete Logs	Compressed Histogram Hash
DIY crypto	Fusion
Heterodyne Physics	McCoskey Count Sort
Oil Spill Fix	Personal Rail
Rational Residue	Residue Arithmetic
Reverse Translation Analysis	Rotary Engine
Subatomic Energy

Histogram Hash

I have the impression that hashing as it is usually practiced requires storing both a hash data structure and the original data.

It occurred to me that it would be possible to use histograms to transform a sorted list of data items into a structure that was smaller than the original data, but also supported hashing, so that you would only need the hash data, You could then remove the original data, and re-create it from the hash data if required.

If this hasn't already been done, then here it is. If it has already been done, perhaps this code could still be of help to someone.

I have two programs. One processes a list of 32 bit integers, and the other does a list of variable length strings containing only the letters allowed in C symbols.

Here is the annotated output of the program creating a hash table for 50 million 32 bit integers. Source code for that program is at:

http://mcky.net/bin/hsh/hsh.tar.gz

http://mcky.net/bin/hsh/hsh.zip

The code was developed on a Linux 64 bit system.

Note that data compression could be improved if we only want to know if the accessed data was in the original data set, but we don't care about where it was located.

The hash data is smaller than the original data. It would not be possible to just replace the original data with the hash data while bulding the hash data, but it probably woudn't be necessary to go with a completely double buffered solution. Some extra storage would be needed, but not two times the size of the data.

integers	50000000	The program has been told to create an array of 50 million integers
max list length	2048	Longer lists give more compression but slower access
count sort time	11.4600	Count sort time in seconds
sort check	ok	Count sort results correct
unique items	49421938	Removed redundant integers
hash bytes	99172838	Size of the hash table
compression ratio	0.501664	Size of hash table divided by bytes in unique items
hash time	1.2100	Time to create the hash table
count sort time plus hash time	12.6700	Sort and create hash quicker than quick sort
hash check	ok	Checks that all items can be accessed and some coverage for false positives
hash check time	15.1500	Time for more than 90 million accesses
quick sort time	17.0700	Quick sort time in seconds
sort check	ok	Quick sort done correctly

Here are the collated results for several runs of the integer code

Integer hash results
integers	100000	200000	500000	1000000	2000000	5000000	10000000	20000000	50000000
max list length	2048	2048	2048	2048	2048	2048	2048	2048	2048
count sort time	0.0300	0.0500	0.1000	0.2000	0.3900	1.0900	2.5400	5.3800	11.9700
count sort check	ok	ok	ok	ok	ok	ok	ok	ok	ok
unique items	99998	199989	499943	999778	1999095	4994194	9976870	19907305	49422954
hash bytes	301788	601761	1328848	2328518	4327152	10317350	20282702	40143572	99174870
compression ratio	0.754485	0.752243	0.664500	0.582259	0.541139	0.516467	0.508243	0.504131	0.501664
hash time	0.0000	0.0000	0.0200	0.0300	0.0600	0.1300	0.2500	0.4800	1.1900
count sort time plus hash time	0.0300	0.0500	0.1200	0.2300	0.4500	1.2200	2.7900	5.8600	13.1600
hash check	ok	ok	ok	ok	ok	ok	ok	ok	ok
hash check time	0.0300	0.0600	0.0800	0.1900	0.4200	1.1800	2.5500	5.5200	14.6400
quick sort time	0.0300	0.0400	0.1200	0.2800	0.5600	1.5000	3.1300	6.5300	17.1700
quick sort check	ok	ok	ok	ok	ok	ok	ok	ok	ok

Histogram Hash

Integer hash results