Print

How to generate a Base36 quick code sequence using the SHA2 hash

On a 'slowish' 2GHz machine it can generate 20,000 quickcodes per minute. The spread of the resulting quickcodes is near-perfectly even, which means that substrings of the quickcode can be used to construct a hashed directory tree for holding, for example, the huge amount of product image files associated with a product catalog. Using a hashed directory tree is a quick and efficient method to host millions of separate files for quick, random access, as most file systems only perform optimally with less than 1,000 contained in a directory

This deterministic techique uses a hash of the sequential Integer Id of an item to generate a typical product quickcode for it, as found in many shopping catalogues. Example of how a quickcode is generated from its primary key integer Id:

You can download the Base36 Quickcode Generator test script, which demonstrates an implementation in Perl and MySQL.

What are Quickcodes?

Base36 Quickcodes are used as a human-friendly way to indentify items in very large catalogues as opposed to having to reel off endlessly-long and ambigous integer identifiers - consider why car registrations are not all integers. You nearly always find them alongside items in printed product catalogs and online shops. Airline bookings are also identified with such quickcodes, called PNR's (Passenger Name Records). One well known and altogether different quickcode from that shown here is the one from IKEA, which actually attempts to make spoken words out of their product quickcodes. This results in constant hilarity and occasionally, outrageous cultural insensitivity.

How do I use Quickcodes?

Why are Quickcodes useful?

Durability of this sequence

Durability of such a hash sequence can be quantified in terms of the length of the sequence before a previously-used value occurs and the evenness of the distribution of all the characters in any given column over the whole sequence. This sequence is based on the SHA hash. SHA1 and SHA2 work equally well, but avoid MD5, as it fails this important criterion; the Strict Avalanche Criterion (SAC), i.e. the inversion of any one bit in the input should cause half or more of the bits in the output to change. The result is that the output varies dramatically from one integer input to the next incremental integer input. Here is a statistical view of the evenness of the distribution after 4,500,000 quickcodes have been generated, looking at the first term only of all quickcodes:

# of quickcodesstddevvarianceavgminmaxStdDev/Avg %
4546094 1218.8684 1485640.1265 126280.3889 122419 127540 0.96520797%

The evenness of the distribution can be seen in the relatively low standard deviation from the average.

How does it work?

To make sure that you have a unique sequence that no-one else can replicate, you need to salt the sequence with a secret term. The function below takes the integer value and with your salt value generates an SHA1/SHA2 hash in hex. We keep 11 of the 40 hex digits and convert it to a base-36 value. Sometimes we may end  with a base36 result that is longer than 8 characters long, so we always keep only the right-most 8 characters. The important thing here is that the process is consistent. The code below can be

use Math::BigInt => "GMP";
use Math::Base36 qw(:all);
use Digest::SHA2;
use constant HASH_SALT             => "My Salt String";
# Creates a deterministic Quickcode
# Given an integer, generates an 8-digit base36 value
sub Int2QuickCode {
my $hash = Digest::SHA2->new;
$hash->add(shift);
$hash->add(HASH_SALT);
# Generate hash in hex form:
my $hexhash=$hash->hexdigest();
my $x = Math::BigInt->new('0x'.$hexhash);
$x->band('0xFFFFFFFFFFF');
my $quickcode = encode_base36($x);
# Rightmost 8 chars part of base36 hash
$quickcode=substr($quickcode,0,QUICKCODE_SIZE);
$quickcode=sprintf("%0".QUICKCODE_SIZE."s",$quickcode);
LOGDIE("$quickcode too short") if length($quickcode)!=QUICKCODE_SIZE;
return $quickcode;
}

But collisions will eventually occur!

Why? Because and 8-digit base36 value is 42 bits wide and the SHA2 hash is 256/384/512 bits wide . We only use the 42 bits of the SHA2 hash and discard the remainder of the hash - it is irrelevant which 42 bits we use, as long we are consistent through the sequence.

Experiments with the downloaded Base36 Quickcode Generator test script create unique sequences of  between 1,000,000 and 4,000,000 quickcodes for various salts before a new quickcode collides with a previously-created quickcode. This may or may not be a sufficiently long sequence for you. If you need a longer sequence, you can only find one through trial and error with different salts.