A single header hash function with both vectorized and scalar versions. The function is quite fast when vectorized achieving approximately an average of ~9.6 GB/s on a 7 year old Xeon E3-1230 v5.
Additionally, it also passes all the SMHasher hash function quality tests: https://github.com/rurban/smhasher
Moreover, it is quite easy to choose a new function at runtime by just using new seed as shown below:
#include "khashv.h"
void foo() {
/*
code ....
*/
khashvSeed seed;
khashv_prep_seed64(&seed, a_64_bit_value);
uint64_t hash = khashv64(&seed, your_data, data_len);
/*
code ....
*/
}
This is not a cryptographic hash function, and it should not be used in for such applications.
When testing on 1.25 GB and 512 KB of random data I get the following on averages:
Processor | 1.25 GB Time | 1.25 GB Speed | 512 KB Time | 512 KB Speed | OS | Compiler | Type |
---|---|---|---|---|---|---|---|
Xeon E3-1230 v5 | 0.1298 s | 9.6285 GB/s | 052.5107 us | 9.2987 GB/s | Linux | GCC 12.1.0 | Vectorized |
Xeon E3-1230 v5 | 1.1911 s | 1.0495 GB/s | 494.1932 us | 0.9880 GB/s | Linux | GCC 12.1.0 | Scalar |
Xeon E3-1230 v5 | 0.1418 s | 8.8142 GB/s | 055.9333 us | 8.7297 GB/s | Linux | Clang 14.0.6 | Vectorized |
The scalar version is slower at a tad over ~1 GB/s on my system when compiling test_speed.c with gcc using -O3
.
On windows Microsoft's compiler does not seem to generate as performant code from the intrinsics, but the GCC mingw64 compiler generates pretty comparable numbers for me at least.
Definitely, want to add other machines to this table. But if you are curious how it performs on your machine compile test_speed.c with -O3 -march=native
and -O3 -march=native -D KHASHV_SCALAR
.
// Prepares a seed from a 32-bit value
void khashv_prep_seed32(khashvSeed* seed_prepped, uint32_t seed)
// Prepares a seed from a 64-bit value
void khashv_prep_seed64(khashvSeed* seed_prepped, uint64_t seed)
// Sets 128-bits to be the seed
void khashv_prep_seed128(khashvSeed* seed_prepped, const uint32_t seed[4])
// Produces a 32-bit hash from the input data
uint32_t khashv32(const khashvSeed* seed, const uint8_t* data, size_t data_len)
// Produces a 64-bit hash from the input data
uint64_t khashv64(const khashvSeed* seed, const uint8_t* data, size_t data_len)
Here is the output of the 64 bit hash of the integers [0, 259199] using 0x1dcedff1a8b17e89 as the seed.
Here is the output of the 32 bit hash of the integers [0, 518399] using 0x6bb75f13 as the seed.
The output of the above images was generated by basically doing the following for a hash.
for(int i = 0; i < sizeof(hash_bytes); i++) {
pixel[img_offset + i].r = hash_bytes[i];
pixel[img_offset + i].g = hash_bytes[i];
pixel[img_offset + i].b = hash_bytes[i];
pixel[img_offset + i].a = 255;
}
When thinking about things to improve the code and hash function these are the first few things that come to mind for me.
-
The main thing would be try to get both Clang and MSVC to output code that runs as fast GCC or as close as possible. They both seem to do some silly things when compared to GCC losing some performance when looking at the generated assembly. Microsoft's compiler being the worst, and probably the fastest fix for me to implement would be to write some assembly code. However, it then would no longer be a single header file hash function since MSVC does not support inline assembly for 64-bit builds, and thusly would require a separate file.
-
Then probably consider using intrinsics for some other systems like ARM NEON, but the for now there is scalar code and code written using GCC's vector built-ins that will generate vectorized code for other architectures that GCC supports.
-
Probably, the next thing I could think of is to choose a better value for S1 and S2 that are used to basically substitute bytes. The current values where found randomly checking a small set of criteria. Mainly focusing on each bit of S1 and S2 as columns. Then Xor-ing them effectively creating an 8 bit input boolean function, and making sure the entire thing maps each input to a unique value. There likely are better values that could chosen, and criteria to look at that look at all bits at once. However, the search space is huge effectively 2^(2*8*16) possible permutations for S1 and S2. However, the current values do seem to work well, from my testing.
I am open to any other suggestions or improvments.