Math: ID System, Number Base vs Number of Digits

By Xah Lee. Date: . Last updated: .

ID is a sequence of numbers or letters. For example 8946729361, or h0fa48f2pt.

How many symbols you have to introduce in order to reduce the number of digits?

y = Log[10^10]/Log[x]. Number Base x vs Number of digits y, to reach 10^10. (that's 1 followed by 10 zeros.)

If you look at the plot, from x range 10 to 13, you can see that it takes 3 more symbols to drop a digit. From 13 to 18, it means it takes 5 more symbols to drop a digit. Then it takes 9….

We have:

β ^ δ = τ

For example, if β is 10, δ is 3, then τ is 10^3 or 1000. Examples of such ID string: {000, 001, 002, …, 999}.

For example, if β is 2, δ is 3, then τ is 2^3 or 8. If we use the symbol “0” and “1”, then the complete set of ID is: {000, 001, 010, 011, 100, 101, 110, 111}. If we use the symbol “a” and “b”, then the complete set of ID is: {aaa, aab, aba, abb, baa, bab, bba, bbb}.

Number of Digits vs Number of Symbols

Let's say we want a ID system for 10 billion IDs. Let's say we use ten decimal digits. So we have β = 10; δ = 10; τ = 10^10. That covers “10,000,000,000” items.

What if we increase the possible symbols to 20? Let's use these symbols: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, a, b, c, d, e, f, g, h, i, j}. If we use 7 such digits, then we have 20^7 = 1,280,000,000. Not quite enough. If we use 8 digits, then 20^8 = 25,600,000,000, that does it.

We increased our symbols by 10 yet we only save 2 digits!

In general, increasing the number of symbols does not help saving digits much. From our formula β ^ δ = τ, solving for β or δ, we get:

τ = β^δ
β = τ^(1/δ)       (* β is the δth root of τ *)
δ = Log[τ]/Log[β] (* Log is natural log. δ is Log τ of base β. *)

Look at the formula δ = Log[τ]/Log[β]. That means, δ is a function of τ and β. If the total τ is fixed, and the possible symbols β varies, then the total number of digits δ is a function of the form C/Log[x] for some constant C. This means, it gets smaller very very slowly. We have to increase a lot of symbols just to reduce one digit.

Now, let's look at our original question, about the design of a ID system, on what's the optimal number of symbols and digits. As we know now, that increasing number of symbols does not reduce digits much. Let's say we use 0 to 9 plus a to z. That's 36 symbols, and it requires 7 digits to cover 10^10. If we add 96 Japanese phonetic alphabets, we have a total of 36+96 = 132 symbols. We still need 5 digits (Log[10^10]/Log[96]). If we go to the extreme, using 3 thousand commonly used Chinese characters, we need 3 digits (Log[10^10]/Log[3000] ≈ 2.87).

If we use 60 thousand symbols (that's about all Chinese characters and all alphabets of the world's languages), we still need 2 digits. We'll need 10^10 symbols to cover 10^10 space with just one single symbol. At this extreme, there's a problem. ID string in general needs to be at least lots of digits (say, 10) to be recognized as ID string, otherwise we need to add some other way to identify it as a ID string. For example, if we use 60 thousand symbols with just 2 digits, then, of might be a ID string, but how do you know it's a English word or a ID string?

There's a standard ID system called Universally Unique Identifier (aka UUID or CLSID). It uses 32 digits of 16 symbols (conventional hexadecimal representation, with symbols 0 to 9 and a to f), covering possible total of 16^32, or 340282366920938463463374607431768211456. To cover that much with 65536 (2^16) symbols, we need 8 digits. 〔➤see What's Windows CLSID? Second Life UUID?