Click here to Skip to main content
15,860,972 members
Articles / Programming Languages / C#

A New Hash Function - ZobHash

Rate me:
Please Sign up or sign in to vote.
4.86/5 (68 votes)
25 Jan 2018CPOL5 min read 61.6K   1.2K   56   45
Can we write a new better hash function?

Introduction

I think most of the existing hash functions were developed many years ago, by very smart people. It is probably not possible to improve hashing algorithms any longer, still, I had to try this idea below.

Background

But first, as you know, a hash function is a function that converts any data of any size to a fixed sized number. It is widely used in, e.g., lookup tables.

In a previous article of mine, on chess engine programming1, I learned a very cool hashing technique for chess board states. Zobrist hashing5 proved to be able to give an almost unique hash code for every chess position in a game, even though there are almost infinitive states a chess game can be in. I think that we can use Zobrist algorithm to hash for example something more common like words. Results revealed at the end. (It worked!)

As you probably know, .NET already contains a hash function for strings. The method is called GetHashCode and it converts any object to an integer.

I want to know if we can create a better function using Zobrist ideas and with less collisions.

Hash Collisions

A hash collision occurs when two different objects, in this case strings, result in the same hash code.

I'm using a list of 466,545 English words3. The function GetHashCode in .NET produces only 48 collisions on this data set, so it is probably good enough in most scenarios. If you are fine with 48 collisions, you don’t have to read on. Here are few examples of words with same .NET hash code.

  • furnaces ferrate
  • matinee cistronic
  • toeholds argon
  • unsmart pulped

You might argue that since the dataset is known, it would be very easy to create a perfect hash algorithm without collisions. We could just give each word its unique number.

However, that is not the purpose of this investigation. We want the hash function to work for any word in any language and with a very low probability of collisions.

Zobrist Hash

Zobrist hashing works by bit-wise XOR-ing random numbers. In a chess engine, random numbers are generated for moves, and a move is defined by removing a piece from a square and then putting it on another.
There are two player colors, six piece types and 64 squares. 2 * 6 * 64 = 768 random numbers.
(Not taking into account castling and enpassant, but you get the point.)

When hashing words, I thought that we can generate random numbers for each character and then do the same thing as in Zobrist hashing. One bit-wise XOR operation for every character in a word.

C#
public static int ZobHash(this string text)
{
    var hash = ZobRandom.Start;
    foreach (var c in text)
        hash ^= ZobRandom.Numbers[c];
    return hash;
}

But when doing this, I get 218,163 collisions. Not very encouraging, but instead of giving up, let’s look at why there are so many collisions.

Here are a few examples:

  • abied abide
  • abiegh abeigh
  • acidly acidyl
  • acron acorn

As you can see, Zobrist hash gives the same hash code for words independent of the order of the letters, i.e., the order of the XOR operations.
That is how XOR works and it's perfect for storing chess game states, because the order of moves that led to a board position doesn’t matter, it is still the same game state. But for hashing of words, it does not work.

Improve Zobrist Algorithm

Let’s try to extend Zobrist hash so the order of the letters also affects the resulting hash code.
What if we try to do a cyclic bit-shift (very fast operations) for each character random number?
A cyclic bit shift pushes the bits in a number to the left (or to the right if you want) like this:

When the number 153, which is binary 10011001
is cyclically shifted to the left, the result is 00110011
and when you push it two times (next char) 01100110

This is the function for cyclic left bit shift.

C#
private static uint RotateLeft(uint value, int count)
{
    var r = count % 32;
    return (value << r) | (value >> (32 - r));
}

And the final hash function looks like:

C#
public static uint ZobHash(this string text)
{
    var hash = ZobRandom.Start;
    var i = 1;
    foreach (var c in text)
        hash ^= RotateLeft(ZobRandom.Numbers[c], i++);
    return hash;
}

Random Seeds

As I mentioned above, Zobrist hash uses a list of random numbers. For some reason, the function works better with some random numbers.
You might be unlucky to select a bad random seed that in the end generates more hash collisions.
So, implementing a Zobrist hashing function is also about selecting a good random seed.

The Results

Below is the result of counting performance and number of collisions for the list of 466,545 unique English words.
Also comparing with a few other hash functions from Brandon Dahler nuget System.Data.HashFunction4.
If you have a suggestion on more hash functions we should compare with, please comment below.

These are the results from hashing byte-arrays.

Functions with 32-bit hash codes:

Function 106 Words/s Collisions
ZobHash 4.67 5
GetHashCode .net 7.52 1592
JenkinsOneAtATime 3.99 36
BernsteinHash 4.57 343
CRC 3.17 23
ELF64 4.32 4347
JenkinsLookup2 3.67 23
JenkinsLookup3 3.64 30
ModifiedBernsteinHash 5.69 320
MurmurHash1 3.86 27
xxHash 3.15 16
Djb2 5.55 344

Functions with 64-bit hash codes:

Function 106 Words/s Collisions
ZobHashLong 5.3 0
FNV1 3.24 0
FNV1a 3.11 0

These are the results from hashing strings (a lot faster without conversion to byte-array):

Hash function 106 Words/s Collisions
ZobHash 24.6 5
ZobHashLong 25.9 0
GetHashCode .net 51.8 48

Conclusion

ZobHash performs well compared to all tested functions. For functions that generate 32-bit hash codes, it has the lowest rate of collisions. There are functions that are faster, but they produce more collisions.
Interestingly .NET GetHashCode gives different number of collisions depending on if you hash a byte array or a string.

For 64-bit hash code functions, none of the tested functions generates collisions but ZobHash is the fastest.

Links

  1. Test-Driven-Chess-Artificial-Intelligence
  2. https://en.wikipedia.org/wiki/Hash_function
  3. https://github.com/dwyl/english-words
  4. https://github.com/brandondahler/Data.HashFunction/
  5. https://en.wikipedia.org/wiki/Zobrist_hashing

That was all for now and don't forget to vote!

Thanks!

History

  • January 16, 2018 - Version 1.0.0
  • January 19, 2018 - Version 1.0.1
    • Fixed a bug on right shift for ints. Changed int to uint
    • Added simple implementation in C

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Sweden Sweden
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionlot of collisions Pin
Member 146429263-Nov-19 1:18
Member 146429263-Nov-19 1:18 
Praisegood Pin
Member 1400182729-Sep-18 1:14
Member 1400182729-Sep-18 1:14 
GeneralRe: good Pin
KristianEkman30-May-19 22:17
KristianEkman30-May-19 22:17 
QuestionHash Function Pin
Member 1391942919-Jul-18 23:18
Member 1391942919-Jul-18 23:18 
PraiseRe: Hash Function Pin
KristianEkman2-Sep-18 6:16
KristianEkman2-Sep-18 6:16 
QuestionHash Function Pin
Member 1391942919-Jul-18 23:17
Member 1391942919-Jul-18 23:17 
QuestionProvide a link for Zobrist hashing Pin
RenniePet21-Jan-18 18:52
RenniePet21-Jan-18 18:52 
AnswerRe: Provide a link for Zobrist hashing Pin
KristianEkman22-Jan-18 0:20
KristianEkman22-Jan-18 0:20 
QuestionTable Searching Pin
Rick York19-Jan-18 5:06
mveRick York19-Jan-18 5:06 
AnswerRe: Table Searching Pin
KristianEkman19-Jan-18 8:08
KristianEkman19-Jan-18 8:08 
Questionsmall error in C implementation Pin
Armando A Bouza19-Jan-18 2:19
Armando A Bouza19-Jan-18 2:19 
Kristian, I am very happy that you provided a plain C implementation of your algorithm, as I am an "old fashion" user of C or C99.

The following line in ZobHash(char *data) is wrong:

int length = sizeof(data)/sizeof(data[0]);

I ran the ZobHash() for "Hello World" and for "Hello World 1234" and the result did not change:

<b>Hash is 0x4b13672e
Long Hash is 0xfcdd00a9e0149e6b</b>

<i>NOTE: I changed the output to hexa.</i>


I guess that you were trying to obtain the number of bytes pointed by "data" .
Unfortunately, in plain C you have to use strlen(data), if data is a zero terminated string, and this is a dangerous assumption.
A safer method would be:

<b>long ZobHash(const char *data, int length)
{
unsigned int hash = RandomStart;
for(int i = 0; i < length; i++)
hash ^= RotateLeft(RandomNumbers[(int)data[i]], i);
return hash;
}</b>

With this small change the result is:

<b>Hello World
Hash is e1a431a2
Long Hash is fcdd01c54aa3c98b
Hello World 1234
Hash is 329494ce
Long Hash is fcdd14a99993798b</b>

Also, putting "length" as a parameter, your hash function can be used on zero terminated strings of any binary data.

Thanks.
AnswerRe: small error in C implementation Pin
KristianEkman19-Jan-18 8:04
KristianEkman19-Jan-18 8:04 
Questionuseful for integrity check Pin
Armando A Bouza19-Jan-18 0:32
Armando A Bouza19-Jan-18 0:32 
AnswerRe: useful for integrity check Pin
KristianEkman19-Jan-18 0:41
KristianEkman19-Jan-18 0:41 
GeneralRe: useful for integrity check Pin
Armando A Bouza19-Jan-18 2:59
Armando A Bouza19-Jan-18 2:59 
Questionfinding a collision in your hash and crc Pin
Dismember18-Jan-18 8:09
Dismember18-Jan-18 8:09 
AnswerRe: finding a collision in your hash and crc Pin
KristianEkman18-Jan-18 8:28
KristianEkman18-Jan-18 8:28 
QuestionRandom factor in hash function Pin
peixaxo18-Jan-18 5:31
peixaxo18-Jan-18 5:31 
AnswerRe: Random factor in hash function Pin
KristianEkman18-Jan-18 5:50
KristianEkman18-Jan-18 5:50 
GeneralRe: Random factor in hash function Pin
Member 1232497918-Jan-18 8:51
Member 1232497918-Jan-18 8:51 
GeneralRe: Random factor in hash function Pin
KristianEkman18-Jan-18 9:33
KristianEkman18-Jan-18 9:33 
GeneralRe: Random factor in hash function Pin
42Bastian18-Jan-18 10:06
42Bastian18-Jan-18 10:06 
GeneralThanks for inspiration Pin
Julius Adam18-Jan-18 3:41
Julius Adam18-Jan-18 3:41 
GeneralRe: Thanks for inspiration Pin
KristianEkman18-Jan-18 5:53
KristianEkman18-Jan-18 5:53 
QuestionYou can strengthen it as follows ... Pin
Member 1330167918-Jan-18 3:06
Member 1330167918-Jan-18 3:06 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.