Friendlier UUID URLs in Ruby

Friendlier UUID URLs in Ruby

In this article we will discuss and demonstrate how we can use Ruby to encode UUIDs into URL friendly representations. This article does not assume any previous knowledge about UUIDs. Instead we will first discuss what exactly a UUID is. We look at all the reasons we would prefer using UUIDs over conventional incremental integers.

You can look forward to some binary math and adding a simple but effective encoding algorithm to your tool belt.

What is a UUID

A UUID (Universally Unique IDentifier) or GUID (Globally Unique IDentifier) is a data type that can replace the frequently used integer data type for the ID column in a database table. A UUID has to adhere to the format set by the RFC 4122 specification opens a new window .

A UUID consists of 32 hexadecimal digits separated by hyphens into five groups. The first group is 8 digits long followed by three groups of 4 digits and the last group has 12 digits. The individual groups don’t convey any significant meaning. We only care about the uniqueness of the entire string.

The algorithms, there are 5 variations, that generate UUIDs are designed to make it very unlikely for the same sequence to ever be generated twice by the same system. This uniqueness guarantee allows separate parts of a distributed system to generate identifiers without needing to check for uniqueness by a central registry.

A UUID looks like this: 4302cfd8-a080-437d-b870-28730dc67498. Notice the grouping as we described it earlier. The example UUID shows that a hexadecimal digit ranges from 0-9 or from a-f. Since each hexadecimal digit can be represented by 4 bits opens a new window we can see that a UUID requires 128 bits (32 x 4 = 128).

Generating a UUID in ruby

We can generate a UUID in ruby by making use of the securerandom library. Instead of talking about UUIDs, how about we take a moment to generate a UUID? To generate a UUID in ruby we can follow these steps:

  1. Open IRB opens a new window , by typing irb in your shell/terminal.
  2. Once IRB is open you will first need to require securerandom.
  3. Then you can generate a UUID by executing:
    $ irb
    >> require 'securerandom'
    >> SecureRandom.uuid
    

A generated UUID will look similar to this: 4302cfd8-a080-437d-b870-28730dc67498.

Why do we use UUIDS

There are many advantages as well as disadvantages of using UUIDs. In this section we will list the reasons for making use of UUIDs over frequently used incremental integers.

Improved privacy

Sequential integer IDs can provide a curious onlooker with information we might not want to share. One example is when a user signs up and they notice that they are user number 100. The newly signed up user can be fairly certain that there are 99 other users.

Organizations would typically not share this information with the public, and especially not with their competition. As software engineers it is good for us to be well aware of the information we share and it is our job to weigh up the alternatives.

A UUID is not sequential and it would therefore not give any information away about the size of our user base or database tables.

Improved security

When a malicious user realizes that there are say a 1000 user profiles stored in our application. They can use this information to poke for holes in the security of our application.

The Insecure Direct Object Reference vulnerability opens a new window arises when we take user supplied information to access records in our database. Consider for example the following scenario.

A user determines that they can update their email address by posting to users/6/update_email. Let us say that this application does not properly check user permissions and incorrectly takes user input to update database records. A malicious user then posts to users/9/update_email and updates the unsuspecting user’s email address. The malicious user can then gain access to that account and all they had to do was guess a number other than their own database id number.

This security vulnerability can and must be addressed by making sure we correctly apply permissions to database transactions. Additionally, if we made use of UUIDs instead it would be much harder to guess a user’s ID. We can therefore see the security benefit of using a UUID that has no guessable or sequential order.

Improved user experience

Frontend clients can generate UUIDs with guaranteed uniqueness in distributed systems. Now the frontend can generate new records on the fly, without needing to first persist records to a database.

This means that frontend clients can work and generate records while offline. Saving data to a server or API can become a background task which could lead to a snappier user experience. We would be able to get rid of a few loading or waiting screens which is also a user experience win.

Easier to have multiple databases

Sharding across databases is much easier when you know that the IDs used across all databases are unique. Data can be moved from one database to another or merged together without any ID conflicts.

Infinite scale

In 2018 Basecamp experienced an outage because the ID column for their tracking table was set as integer rather than big integer. The integer data type runs out of numbers at 2147483647. Big integers can go all the way up to 9223372036854775807. However UUID can be thought of as infinitely large when compared to integers or big integers.

UUIDs are a good choice when we anticipate storing large amounts of records.

How to make friendly UUID URLs

UUIDs have many advantages, but they have disadvantages too. One disadvantage is that some might argue that UUIDs are too long and ugly to be used in a URL. To overcome this negative aspect of UUIDs we can encode it into a more URL friendly representation.

We need to remember that all data can be represented by bits, and in this case we can represent a UUID with 128 bits. Next we need to find other URL safe ways to represent 128 bits.

URL safe characters

The RFC 3986 opens a new window specification goes into detail to explain the syntax of a URL. Feel free to take a look at it, but for our purposes we will see what it says about safe characters to use in a URL.

The specification uses the term unreserved characters when it refers to characters that are allowed in a URL. The following characters are allowed:

  • Upper- and lowercase alphabet characters A-Z and a-z
  • Decimal digits 0-9
  • Hyphens -
  • Periods .
  • Underscores _
  • Tilde ~

As a side note, ever wondered why some URLs have percent signs (%)? The specification does allow unreserved characters to be encoded into their respective percent-encoded US-ASCII format.

Encoding a UUID

Now that we have a goal in mind (Make a UUID URL friendly) and constraints (Only URL safe characters allowed) we can proceed. Our approach is to take a UUID and encode or transform it so that it can be represented by a shorter string.

The steps we will take to transform or encode the UUID are:

  • Define our encoding alphabet
  • Remove grouping hyphens.
  • Convert UUID to binary representation.
  • Convert binary into integers (base 10).
  • Use integers to reference characters from our encoding alphabet.
  • Return encoded string.

1. Define our encoding alphabet

As we have already seen, UUIDs consist of hexadecimal digits. There are 16 characters in the hexadecimal alphabet so to speak. We would need a larger alphabet to encode the same amount of data more compactly.

The alphabet we define will represent the set of characters we use to encode the UUID. If our alphabet contains only the letters a b c, then we will only be able to encode data using those letters.

Instead of coming up with our own URL safe alphabet we can use the Base64 encoding scheme. Base64 is an encoding scheme that represents bits in ASCII opens a new window string format. Base64 is often used to encode data meant to be included in a URL.

There is a Ruby gem opens a new window that takes integers and transforms them into Base62 characters. The transformed string then represents the original integer.

Let us not reinvent the wheel. Instead we can take inspiration from others that came before us.

The earlier mention of Base62 is no typo. One problem with using Base64 characters in a URL is that they might include + and / characters. These are not URL safe characters so we could decide to create our own encoding scheme derived from the Base64 character set.

Base62 is what remains when you remove the two unsafe characters. Another approach would be to replace + (plus) with - (hyphen) and / (forward slash) with _ (underscore). This way we still have 64 characters in our alphabet.

Let us not get overly mathematical, but consider for a moment why this alphabet will be able to represent data more compactly. Each hexadecimal digit requires up to 4 bits. Each character in a 64 character set would need 6 bits. That means each character in our alphabet will represent 2, that is 1.5 times, more bits than hexadecimal digits can. All of this means that our alphabet will require 33% less characters than hexadecimal to represent the same amount of data.

Each character in our alphabet has an index starting from 0 all the way up to 63. Similar to how arrays work we will reference characters from our alphabet using their index numbers

# our alphabet
ALPHABET = %w[0 1 2 3 4 5 6 7 8 9
              A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
              a b c d e f g h i j k l m n o p q r s t u v w x y z
              - _].freeze

2. Remove grouping hyphens

We already know that UUID grouping is standardized which means we don’t need to encode the hyphens. Should we need to decode back into UUID form we can easily add the grouping hyphens back where they belong.

Removing the hyphens before we encode the UUID will save a further 4 characters.

There are many ways we can do that in Ruby, here is one simple way:

uuid = "bf39f02b-caa2-47e3-887b-b1a9a5849092"
result = uuid.split('-').join
# >>  "bf39f02bcaa247e3887bb1a9a5849092"

3. Convert to Binary

We cannot map directly from hexadecimal to our Base64 alphabet. Instead we should convert the hexadecimal digits into their binary representation.

We work with the assumption that we want to be able to decode back into UUID form later on. To make decoding possible our algorithm requires that we represent each hexadecimal digit with the same amount of binary digits. That means that 0 will be represented by 0000 even though it can be represented by a single 0 binary digit. This binary encoding is known as binary-coded decimal (BCD) opens a new window .

# convert to BCD
result =  "70a879c66dc34c4cb4640a5549618d7f"
result = result.chars.map { |c| c.hex.to_s(2).rjust(4,'0') }.join
# >> "0111000010101000011110011100011001101101110000110100110001001100101101...

4. Convert to base 10 integers

Let us take a moment to consider the entire hexadecimal set and look at their BCD values.

%w[0 1 2 3 4 5 6 7 8 9 A B C D E F].map { |c| c.hex.to_s(2).rjust(4, '0') }
>
"0000", # 0
"0001", # 1
"0010", # 2
"0011", # 3
"0100", # 4
"0101", # 5
"0110", # 6
"0111", # 7
"1000", # 8
"1001", # 9
"1010", # A
"1011", # B
"1100", # C
"1101", # D
"1110", # E
"1111"] # F

Imagine we take 32 hexadecimal digits and join them together in one 128 digit long binary sequence. Now we want to find a way to reference characters in our alphabet from this binary sequence. We need to convert the ones and zeros in the binary sequence into integers numbers between 0 and 63.

To represent a number between 0 and 63 we need 6 binary digits. If you take 128 and you divide it by 6 you will notice that we get 21.3. If only we had 132 binary digits, then we would have been able to group them into exactly 22 groups.

Since we are working with binary, there would be no problem to just add a couple of zeros to the front (left) of the number. This works just like decimal numbers, 0008 remains 8 no matter how many zeros you add to the left of the number.

So the next step for us is to pad our binary sequence with four additional zeros, then we group the entire sequence into groups of 6 binary digits.

# add zero padding
result = "11011111001001110001100110"
# >> "000011011111001001110001100110"
result = result.prepend("0000")
# >> "000011011111001001110001100110"

#group binary digits
result = result.scan(/.{6}/)
# >> ["000011", "011111", "001001", "110001", "100110"]

5. Convert to encoding alphabet

Each group of six binary digits represents an integer number between 0 and 63. Now we can very easily convert the binary numbers into their integer (base 10) equivalent and use the value to reference a character from our encoding alphabet.

# convert to base 10 integers
result = ["000011", "011111", "001001", "110001", "100110"]
result = result.map { |x| x.to_i(2) }
# >> [3, 31, 9, 49, 38]
# retrieve alphabet characters
result = result.map { |x| ALPHABET[x] }
# >>  ["3", "V", "9", "n", "c"]

6. Return encoded string

All there is left to do is to join the characters from our alphabet together and return the encoded string back to our client caller.

result = ["3", "V", "9", "n", "c"]
result.join
# >> "3V9nc"

We can now save this encoded value to a database and use it just like we would have used a slug to retrieve a record from a table. Saving it to a database means we only need to compute the encoded value once. We also don’t have to be concerned about the uniqueness of the encoded value, since we already know that the UUID is unique.

If we don’t save the encoded value to a database, then we would need to decode it before we will be able to make use of it. There is no one size fits all solution, but it is important to consider the performance implications of saving and computing the encoded value.

Decoding back to UUID

It is completely possible to decode a string back into its original UUID form. For brevity I will present the algorithm below in the form of a Ruby script. It is essentially just a reversal of the six steps we took to encode the UUID.

#  "70a879c6-6dc3-4c4c-b464-0a5549618d7f"
encoded = "1mg7d6RSDCJBHa2bL9OOr_"

# 1. Split string
result =  encoded.split('')
# >> ["1","m","g","7","d","6","R","S","D","C","J","B","H","a","2","b","L","9","O","O","r","_"]

# 2. Convert to base 10 integer values
result = result.map { |x| ALPHABET.index(x) }
# >> [1,48,42,7,39,6,27,28,13,12,19,11,17,36,2,37,21,9,24,24,53,63]

# 3. Convert to 6 bit binaries
result = result.map {|x| x.to_s(2).rjust(6, '0').rjust(6, '0') }
# ["000001",
#  "110000",
#  "101010",
#  "000111",
#  "100111",
#  "000110",
#  "011011",
#  "011100",
#  "001101",
#  "001100",
#  "010011",
#  "001011",
#  "010001",
#  "100100",
#  "000010",
#  "100101",
#  "010101",
#  "001001",
#  "011000",
#  "011000",
#  "110101",
#  "111111"]

# 4. Join binaries
result = result.join
# >> "000001110000101010000111100111000110011011011100001101001100010011001011010001100100000010100101010101001001011000011000110101111111"

# 5. Group into BCD
result = result.scan(/.{4}/)
# ["0000",
#  "0111",
#  "0000",
#  "1010",
#  "1000",
#  "0111",
#  "1001",
#  "1100",
#  "0110",
#  "0110",
#  "1101",
#  "1100",
#  "0011",
#  "0100",
#  "1100",
#  "0100",
#  "1100",
#  "1011",
#  "0100",
#  "0110",
#  "0100",
#  "0000",
#  "1010",
#  "0101",
#  "0101",
#  "0100",
#  "1001",
#  "0110",
#  "0001",
#  "1000",
#  "1101",
#  "0111",
#  "1111"]

# 6. Remove 4 leading zeros
# shift removes first item in result array
result.shift

# 7. Convert BCD to hexadecimal
result = result.map { |x| x.to_i(2).to_s(16) }
# ["7",
#  "0",
#  "a",
#  "8",
#  "7",
#  "9",
#  "c",
#  "6",
#  "6",
#  "d",
#  "c",
#  "3",
#  "4",
#  "c",
#  "4",
#  "c",
#  "b",
#  "4",
#  "6",
#  "4",
#  "0",
#  "a",
#  "5",
#  "5",
#  "4",
#  "9",
#  "6",
#  "1",
#  "8",
#  "d",
#  "7",
#  "f"]

# 8. Join and add hyphens
result =  result.join.unpack('a8a4a4a4a12').join('-')
# "70a879c6-6dc3-4c4c-b464-0a5549618d7f"

Further considerations

A different encoding scheme

A Base64 encoded string is difficult to type and even harder to pronounce. We could consider a different encoding scheme that might be more user friendly. The Base32 opens a new window is just such an encoding scheme.

The Base32 encoding scheme is less compact than Base64 and would require 26 characters to encode a 32 character UUID. Using this encoding scheme would produce strings that are URL safe and also much more user friendly.

Using Ruby's unpack

I have not tried it yet, but I am sure we would be able to simplify the algorithm quite a bit by making use of unpack opens a new window . We could for example use unpack to Base64 encode a UUID, and then use gsub opens a new window to replace + and / characters.

Non-reversible shorted slugs

We don’t always need to be able to reverse the encoding process. Instead, if we are most interested in short slugs then we can use a substring of the encoded string. As long as we save the substring we will be able to make use of it very much like we would use a slug in other situations.

We could start off with a 4 character substring. Before we save the substring we would need to make sure that it is unique. We then only save the generated substring, or slug, once we confirm its uniqueness. We then gradually increase the length of our substring until we are able to confirm that it is unique.

This approach will work because we know that in the worst case scenario we would need to use the full length encoded string for it will always be unique. Doing multiple uniqueness checks will have a performance impact we need to be mindful of.

There is also a similar but stateless approach taken by FriendlyUUID opens a new window to compute a truncated UUID on the fly.

Conclusion

There are many other approaches we could have taken. And you don’t need to look too far on Github to find many other approaches. The goal of this post was to show you that this can be done without using a gem, but mostly to have some fun with UUIDs. And so I hope you had some fun, and if this was your first introduction to UUIDs then I hope it was informative.

You can take a look at these interesting approaches taken by others who had similar goals as us in mind. (Disclosure, not all of these approaches are in Ruby)

Further reading

Get the book