diff --git a/content/rnd/a_serialized_mystery/index.md b/content/rnd/a_serialized_mystery/index.md index dd45fa7..7191f48 100644 --- a/content/rnd/a_serialized_mystery/index.md +++ b/content/rnd/a_serialized_mystery/index.md @@ -1,17 +1,71 @@ +++ -title = "A Serialized Mystery, in One Part" -slug = "serialized-mystery-one-part" +title = "A One-Part Serialized Mystery" +slug = "one-part-serialized-mystery" date = "2023-06-22" updated = "2023-06-22" [taxonomies] -tags = ["software", "rnd", "proclamation", "upcsm"] +tags = ["software", "rnd", "proclamation", "upscm"] [extra] toc = false +++ # *Mise en Scene* -Imagine, if you will, that you're a computer programmer. +I recently spent a couple days moving from [one type of universally unique +identifier](https://commons.apache.org/sandbox/commons-id/uuid.html) to a [different +one](https://github.com/ulid/spec), for an in-progress [database-backed +web-app](https://gitlab.com/nebkor/ww). The [initial +work](https://gitlab.com/nebkor/ww/-/commit/be96100237da56313a583be6da3dc27a4371e29d#f69082f7433f159d627269b207abdaf2ad52b24c) +didn't take very long, but debugging the [serialization and +deserialization](https://en.wikipedia.org/wiki/Serialization) of the new IDs took another day and a half. So come with me on +an exciting voyage of discovery, and [once again, learn from my +folly](@/sundries/a-thoroughly-digital-artifact/index.md)! + +# Keys, primarily + +Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy +facade for some kind of database. Facebook? Database. Gmail? Database. + +![that's a database][thats_a_database] +
wikipedia? that's a database.
+ +In most databases, each entry ("row") has a field that acts as a [primary +key](https://en.wikipedia.org/wiki/Primary_key), used to uniquely identify that row inside the table +it's in. Since databases typically contain multiple tables, and primary keys have to be unique only +within their own table, you could just use a simple integer that's automatically incremented every +time you add a new record, and in many databases, if you create a table without specifying a primary +key, they will [automatically and implicitly use a +mechanism](https://www.sqlite.org/lang_createtable.html#rowid) like that. + +This is often totally fine! If you only ever have one copy of the database, and never have to worry +about inserting rows from a different instance of the database, then you can just use those simple +values and move on your merry way. + +However, if you ever think you might want to have multiple instances of your database running, and +want to make sure they're eventually consistent with each other, then you might want to use a +fancier identifier for your primary keys, to avoid collisions between primary keys. + +## UUIDs + +One very common type for these is called a +[UUIDv4](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random +numbers[^uuidv4_random], and when turned into a string, usually look something like +`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious +reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to +store 16 bytes' worth of data, which is more than twice as many bytes than necessary. And if you're +a programmer, this sort of conspicous waste is unconscionsable. + +You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes +as the actual data requires. If you never have to actually display the ID inside the database, then +the simplest thing to do is just store it as a blob of 16 bytes[^blob-of-bytes]. Finally, optimal +representation and efficiency! + +## Indexes? + + + +Imagine, if you will, that you're a computer programmer. One common trait among such creatures is a +desire to be "efficient". - programmers like efficiency - databases have primary keys and keep indices @@ -23,3 +77,10 @@ Imagine, if you will, that you're a computer programmer. # First steps ## A puzzling failure + +---- + +[^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are +reserved for version metadata. + +[thats_a_database]: ./thats_a_database.png "that's a database" diff --git a/content/rnd/a_serialized_mystery/thats_a_database.png b/content/rnd/a_serialized_mystery/thats_a_database.png new file mode 100644 index 0000000..a98b5af Binary files /dev/null and b/content/rnd/a_serialized_mystery/thats_a_database.png differ