How to generate your own Base36 Product Quickcode sequence using the SHA2 hash
Run this process for a few hours and you too can have your very own unique sequence of millions of unique Quickcodes that will keep your inventory management system well-fed for life.
The statistical spread of the characters in the resulting series of Quickcodes is very even, which means that substrings of the Quickcode can be used to construct a hashed directory tree for holding. The typical example of where this would be used is to manage the huge amount of product image files associated with a product catalogue: Using a hashed directory tree is a quick and efficient method to host millions of separate files for quick, random access, as most file systems only perform optimally with less than 1,000 files or subdirectories held in a directory.
This deterministic technique uses a hash of the sequential Integer Id of an item to generate a typical product Quickcode for it, as found in many shopping catalogues. An arbitrary example of how a Quickcode is generated from its primary key integer Id would be something like:
- 1 => 8M9LFLN2
- 2 => HZ40H3K0
- 3 => 02LUJYQ2
- etc..
You can download the Base36 Quickcode Generator test script, which demonstrates an implementation in Perl and MySQL.
What are Quickcodes?
Base36 Quickcodes are used as a human-friendly way to identify items in very large catalogues as opposed to having to reel off endlessly-long and ambiguous integer identifiers - consider why car registrations are not all integers.
Using the counting base of 36 means that we start counting from as normal from 0, and after 9 we continue with capital A and end with capital Z, using the modern English version of the Latin alphabet: this amounts to 36 unique counting symbols before we "carry over" 1 to the adjacent left-hand column.
Base36 is also a very compact way of representing a value in string form. Consider the integer value 1234567890 (in Base10, the base that we mortals normally use), which can be represented in Base36 as 99T4C2. Not only is it shorter, but there would be less confusion of communicating this over a noisy telephone / radio connection.
One often finds them alongside items in printed product catalogs and online shops. Airline bookings are also identified with such quickcodes, called PNRs (Passenger Name Records), and are usually only 6 characters long.
One well-known but altogether different type of quickcode from that shown here is the one from IKEA, which actually attempts to make spoken words out of their product quickcodes. This results in constant hilarity and occasionally in cultural insensitivity, and are always guaranteed to have woke snowflakes twitching away in safe places from some-or-other perceived offence.
How do I use Quickcodes?
- Use Quickcodes for public references to your products, instead of integers. This way, no-one can track or infer your inventory size or product history.
- Use a sequential Integer key for internal database references of your product as you normally would, for best performance
- Your sequence should be unique to your business. You can do this my applying your own 'salt' to the initial hash.
- Decide how many Integer Keys your application may require over its lifetime.
- Run the test script for the number of Integer Keys and check that you don't get any duplicate collisions
- Store your unique sequence of Quickcodes in a table
Why are Quickcodes useful?
- It is a more human-friendly way of reading and writing a unique product identifier compared to say, an integer identifier.
- They hide your product tables primary key from the outside word, like web apps. This way no-one can guess the size of, say, your product inventory or when last you added new product to your inventory. An adversary can easily iterate along the integer sequence, starting at 1, and slurp your entire product inventory off your website.
- It is near impossible to guess an 8-character base36 Quickcode. For example; in a product database of 1,000,000 items, an adversary has a 1 in 2,800,000 chance of guessing a correct product. The chances of winning the UK Lottery are only 5 times worse.
- It is impossible to determine any sequential value in such a sequence, so it is great way to hide the size of your product range from competitive eyes. Not so with sequential integers for product codes: you simply add 1 to the key and test if the product code exists.
- Because of the random, and implied even distribution of Quickcodes (using this technique at least) one can use it as the basis of a hashed directory tree to hold millions of product images, which makes image harvesting techniques much more difficult.
Durability of this sequence
Durability of such a hash sequence can be quantified in terms of the length of the sequence before a previously-used value occurs and the evenness of the distribution of all the characters in any given column over the whole sequence. This sequence is based on the SHA hash. SHA1 and SHA2 work equally well, but avoid MD5, as it fails this important criterion; the Strict Avalanche Criterion (SAC), i.e. the inversion of any one bit in the input should cause half or more of the bits in the output to change. The result is that the output varies dramatically from one integer input to the next incremental integer input. Here is a statistical view of the evenness of the distribution after 4,500,000 Quickcodes have been generated, looking at the first term only of all quickcodes:
| # of quickcodes | stddev | variance | avg | min | max | StdDev/Avg % |
|---|---|---|---|---|---|---|
| 4546094 | 1218.8684 | 1485640.1265 | 126280.3889 | 122419 | 127540 | 0.96520797% |
The evenness of the distribution can be seen in the relatively low standard deviation from the average.
How does it work?
To make sure that you have a unique sequence that no-one else can replicate, you need to salt the sequence with a secret term. The function below takes the integer value and with your salt value generates an SHA1/SHA2 hash in hex. We keep 11 of the 40 hex digits and convert it to a base-36 value. Sometimes we may end up with a base36 result that is longer than 8 characters long, so we always keep only the right-most 8 characters. The important thing to note here is that the process is consistent. The prototype code demonstrates the Quickcode generation process:
use Math::BigInt => "GMP";use Math::Base36 qw(:all);use Digest::SHA2;use constant HASH_SALT => "My Salt String";# Creates a deterministic Quickcode # Given an integer, generates an 8-digit base36 valuesub Int2QuickCode {my $hash = Digest::SHA2->new;$hash->add(shift);$hash->add(HASH_SALT);# Generate hash in hex form:my $hexhash=$hash->hexdigest();my $x = Math::BigInt->new('0x'.$hexhash);$x->band('0xFFFFFFFFFFF');my $quickcode = encode_base36($x);# Rightmost 8 chars part of base36 hash$quickcode=substr($quickcode,0,QUICKCODE_SIZE);$quickcode=sprintf("%0".QUICKCODE_SIZE."s",$quickcode);LOGDIE("$quickcode too short") if length($quickcode)!=QUICKCODE_SIZE;return $quickcode;}But collisions will eventually occur!
Why? Because an 8-digit Base36 value is 42 bits wide and the SHA2 hash is 256/384/512 bits wide . We only use the 42 bits of the SHA2 hash and discard the remainder of the hash - it is irrelevant which 42 bits we use, as long we are consistent - a hash is a hash, after all.
The Base36 Quickcode Generator script below can create a unique sequence of between 1,000,000 and 4,000,000 Quickcodes for a given hash salt before a new Quickcode collides with a previously-created Quickcode. This may or may not be a sufficiently long sequence for you. If you need a longer sequence, you can generate further ones by trying different salts in the hash.
On a 'slowish' 2GHz machine it can generate 20,000 Quickcodes per minute.
#!/usr/bin/perl
use strict;use DBI;use Log::Log4perl qw(:easy);
use constant QUICKCODE_SIZE => 8;use constant HASH_SALT => "l2b";
# Test preparation:# Create a table in your test database:# create table quickcodes(id int, quickcode varchar(8), constraint unique key uc1 (quickcode));## Statistical Test:# Should have an even spread over the range of the count characters in string position.# Here, we look at character 1:# select sum(c) as quickcodes ,stddev(c), variance(c),avg(c), min(c), max(c), concat((stddev(c)/avg(c))*100,'%') as StdDevMean from # (select count(*) as c, substring(quickcode,1,1) from quickcodes where substring(quickcode,1,1) is not null group by substring(quickcode,1,1)) a;#+------------+-----------+-------------+-----------+--------+--------+-------------+#| quickcodes | stddev(c) | variance(c) | avg(c) | min(c) | max(c) | StdDevMean |#+------------+-----------+-------------+-----------+--------+--------+-------------+#| 45424 | 35.9746 | 1294.1728 | 1261.7778 | 1197 | 1361 | 2.85110536% |#+------------+-----------+-------------+-----------+--------+--------+-------------+
my $starttime = time;my ($db,$user,$pwd,$host) = ("test","root","tic206d2","localhost");
Log::Log4perl->easy_init({ level => $DEBUG, file => ">>Int2QuickCode.log", layout => '[%d][%p][%F{1}-%M-%L] %m%n'}, );
DEBUG "Connect to Database $db";my $connectionstring = "DBI:mysql:$db".(defined($host)?";host=$host":"");my ($dbh,$sql,$cursor);$dbh = DBI->connect($connectionstring, $user, $pwd, {RaiseError => 1}) || LOGDIE ("Unable to connect to $db: $DBI::errstr");DEBUG "Connected to Database $db";
DEBUG "Truncate test table";$dbh->do("truncate table quickcodes");
# Do a proper DBI prepare as we are going to perform this operation many times:my ($insert_sql, $insert_sth);$insert_sql = 'insert into quickcodes (id, quickcode) values (?,?)';$insert_sth = $dbh->prepare($insert_sql) || LOGDIE "Could not prepare sql statment:\n\$insert_sql\n".$insert_sth->errstr();
INFO("Test for collisions in sequence using salt value: ".HASH_SALT);for(my $id=1;$id<=100000000;$id++){ my $quickcode = Int2QuickCode($id); eval { $insert_sth->execute($id,$quickcode); }; if($@){ INFO "Duration: ".(time-$starttime)." seconds"; LOGDIE(sprintf("Quickcode Collision! Error inserting (%d, %s) for salt '%s': %s",$id, $quickcode,HASH_SALT,$insert_sth->errstr())); } INFO("$id quickcodes tested") if ($id % 10000 == 0);}INFO("Done");INFO(sprintf("Duration: %d minutes", (time-$starttime)/60));
# Creates a deterministic Quickcode # Given an integer, generates an 8-digit base36 valueuse Math::BigInt::GMP;use Math::Base36 qw(:all);use Digest::SHA2;
sub Int2QuickCode { my $hash = Digest::SHA2->new; $hash->add(shift); $hash->add(HASH_SALT); # Generate hash in hex form: my $hexhash=$hash->hexdigest(); my $x = Math::BigInt->new('0x'.$hexhash); $x->band('0xFFFFFFFFFFF'); my $quickcode = encode_base36($x,QUICKCODE_SIZE); # Rightmost 8 chars part of base36 hash $quickcode=substr($quickcode,0,QUICKCODE_SIZE); LOGDIE("$quickcode too short") if length($quickcode)!=QUICKCODE_SIZE; return $quickcode;}
