c++ - Normalize values to a smaller range -



c++ - Normalize values to a smaller range -

description

i have rather big set of (string, string, string) unique tuples (around 40mln can bigger). tuple each compute single unsigned int value. store values somewhere after generating them reused (even after application goes downwards in memory storage out of question, databases unfortunately).

at first stored them in file tuple (string, string, string, value) reading in 40mln records takes time (and need instantly).

i decided first compute hash value of each (string, string, string) tuple normalize [0, n] (where n number of values) , store values in binary file in sorted order (sorted normalized hash value). after can mmap() file , values mmap[normalize(hash(string, string, string))].

my hash function pretty straightforward fast , works in case (didn't notice collisions):

concatenatedstring = s1+"."+s2+"."+s3 unsigned int hash = 31; for(int = 0; < concatenatedstring.length(); i++) { hash = hash * 101 + (unsigned int) concatenatedstring[i]; }

same normalization (straightforward):

((long) n * hash) / max_value

n - upper bound of normalized range (so around 40mln, take n not (n - lower_bound) because lowe_bound = 0)

max_value - maximum value of old set (uint_max in case, min_value = 0 not include equation)

problem

my hash function doesn't produce uniformly distributed values (don't see how that) in range of 0 4,294,967,295 (unsigned int). because of after normalization have quite few collisions result in info loss (overwriting values under same array index).

are there clever ways want without collisions?

i aware collision might occur. thing approach tend occur way often. hashing range 100 times bigger number of elements i'm guessing there might way haven't yet figured out how.

solution in end changed hash murmurhash, changed normalizing method simple "modulo newrange" , changed format of file (i store info (string string string value)) - file pretty big able implement simple collision detection mechanism (double hashing).

i'm surprised not getting collisions before normalize range of hash values. looks using unnormalized range of [0,2^32). looking @ birthday problem chart here probability of collision 4*10^7 elements should higher 75%. in case, normalizing hash outputs range equal size of set of elements practically guaranteeing non-trivial number of collisions. unless you're willing utilize counter hash values don't see how you'll able avoid that.

edit: saw edit. range 100 times number of elements (which 4*10*9) still lot of collisions. indicated above, probability of 1 or more collisions on 75%.

there 2 things suggest:

choose different hash function

as noted, while hash function fast, not distribute values randomly in range [0,2^32). there several hash functions out there both fast , improve job distributing hash values across hash function range. 1 i've used in past murmurhash.

use larger range

using larger range should cut down risk of collisions. looking @ chart here 1 time again looks 64 bits should plenty cut down risk of collision less 10^-6. murmurhash64a , murmurhash64b variants useful in case.

c++ linux algorithm

Comments

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -