blog/content/rnd/a_serialized_mystery/index.md
2023-06-28 11:28:46 -07:00

4.1 KiB

+++ title = "A One-Part Serialized Mystery" slug = "one-part-serialized-mystery" date = "2023-06-22" updated = "2023-06-22" [taxonomies] tags = ["software", "rnd", "proclamation", "upscm"] [extra] toc = false +++

Mise en Scene

I recently spent a couple days moving from one type of universally unique identifier to a different one, for an in-progress database-backed web-app. The initial work didn't take very long, but debugging the serialization and deserialization of the new IDs took another day and a half. So come with me on an exciting voyage of discovery, and once again, learn from my folly!

Keys, primarily

Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy facade for some kind of database. Facebook? That's a database. Gmail? That's a database.

that's a database

wikipedia? that's a database.

In most databases, each entry ("row") has a field that acts as a primary key, used to uniquely identify that row inside the table it's in. Since databases typically contain multiple tables, and primary keys have to be unique only within their own table, you could just use a simple integer that's automatically incremented every time you add a new record, and in many databases, if you create a table without specifying a primary key, they will automatically and implicitly use a mechanism like that. You may also recognize the idea of "serial numbers", which is what these sorts of IDs are.

This is often totally fine! If you only ever have one copy of the database, and never have to worry about inserting rows from a different instance of the database, then you can just use those simple values and move on your merry way.

However, if you ever think you might want to have multiple instances of your database running, and want to make sure they're eventually consistent with each other, then you might want to use a fancier identifier for your primary keys, to avoid collisions between primary keys.

UUIDs

A popular type for these is called a v4 UUIDs. These are 128-bit random numbers1, and when turned into a string, usually look something like 1c20104f-e04f-409e-9ad3-94455e5f4fea; this is called the "hyphenated" form, for fairly obvious reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're a programmer, this sort of conspicous waste is unconscionsable.

You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes as the actual data requires. If you never have to actually display the ID inside the database, then the simplest thing to do is just store it as a blob of 16 bytes2. Finally, optimal representation and efficiency!

Indexes?

Imagine, if you will, that you're a computer programmer. One common trait among such creatures is a desire to be "efficient".

  • programmers like efficiency
  • databases have primary keys and keep indices
  • uuids are useful but wasteful (note: NO BENCHMARKS!)
  • ulids seem cool
  • endianness
  • profit

First steps

A puzzling failure



  1. Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are reserved for version metadata. ↩︎

  2. Some databases have direct support for 128-bit primitive values (numbers). The database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support arbitrary-length sequences of bytes called "blobs". ↩︎