blog/content/rnd/a_serialized_mystery/index.md

158 lines
8.2 KiB
Markdown
Raw Normal View History

2023-06-24 21:00:00 +00:00
+++
2023-06-24 23:51:51 +00:00
title = "A One-Part Serialized Mystery"
slug = "one-part-serialized-mystery"
2023-06-24 21:00:00 +00:00
date = "2023-06-22"
updated = "2023-06-22"
[taxonomies]
2023-06-24 23:51:51 +00:00
tags = ["software", "rnd", "proclamation", "upscm"]
2023-06-24 21:00:00 +00:00
[extra]
toc = false
+++
# *Mise en Scene*
2023-06-24 23:51:51 +00:00
I recently spent a couple days moving from [one type of universally unique
identifier](https://commons.apache.org/sandbox/commons-id/uuid.html) to a [different
one](https://github.com/ulid/spec), for an in-progress [database-backed
web-app](https://gitlab.com/nebkor/ww). The [initial
work](https://gitlab.com/nebkor/ww/-/commit/be96100237da56313a583be6da3dc27a4371e29d#f69082f7433f159d627269b207abdaf2ad52b24c)
didn't take very long, but debugging the [serialization and
2023-06-27 00:51:42 +00:00
deserialization](https://en.wikipedia.org/wiki/Serialization) of the new IDs took another day and a
half, and in the end, the alleged mystery of why it wasn't working was a red herring due to my own
stupidity. So come with me on an exciting voyage of discovery, and [once again, learn from my
2023-06-24 23:51:51 +00:00
folly](@/sundries/a-thoroughly-digital-artifact/index.md)!
# Keys, primarily
Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy
2023-06-26 19:16:33 +00:00
facade for some kind of database. Facebook? That's a database. Gmail? That's a database.
2023-06-24 23:51:51 +00:00
![that's a database][thats_a_database]
<center><span class="caption">wikipedia? that's a database.</span></center>
In most databases, each entry ("row") has a field that acts as a [primary
key](https://en.wikipedia.org/wiki/Primary_key), used to uniquely identify that row inside the table
it's in. Since databases typically contain multiple tables, and primary keys have to be unique only
within their own table, you could just use a simple integer that's automatically incremented every
time you add a new record, and in many databases, if you create a table without specifying a primary
key, they will [automatically and implicitly use a
2023-06-26 19:16:33 +00:00
mechanism](https://www.sqlite.org/lang_createtable.html#rowid) like that. You may also recognize the
idea of "serial numbers", which is what these sorts of IDs are.
2023-06-24 23:51:51 +00:00
This is often totally fine! If you only ever have one copy of the database, and never have to worry
about inserting rows from a different instance of the database, then you can just use those simple
values and move on your merry way.
However, if you ever think you might want to have multiple instances of your database running, and
want to make sure they're eventually consistent with each other, then you might want to use a
fancier identifier for your primary keys, to avoid collisions between primary keys.
## UUIDs
2023-06-26 19:16:33 +00:00
A popular type for these is called a
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
2023-06-24 23:51:51 +00:00
numbers[^uuidv4_random], and when turned into a string, usually look something like
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to
2023-06-26 19:16:33 +00:00
store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're
2023-06-24 23:51:51 +00:00
a programmer, this sort of conspicous waste is unconscionsable.
You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes
as the actual data requires. If you never have to actually display the ID inside the database, then
the simplest thing to do is just store it as a blob of 16 bytes[^blob-of-bytes]. Finally, optimal
representation and efficiency!
## Indexes?
2023-06-27 00:51:42 +00:00
And at first, that's what I did. The [external library](https://docs.rs/sqlx/latest/sqlx/) I'm using
to interface with my database automatically writes UUIDs as a sequence of sixteen bytes, if you
specified the type in the database[^sqlite-dataclasses] as "[blob](https://www.sqlite.org/datatype3.html)", which [I
did](https://gitlab.com/nebkor/ww/-/commit/65a32f1f20df6c572580d796e1044bce807fd3b6#f1043d50a0244c34e4d056fe96659145d03b549b_0_5).
2023-06-24 23:51:51 +00:00
2023-06-27 00:51:42 +00:00
But then I saw a [blog post](https://shopify.engineering/building-resilient-payment-systems) where
the following tidbit was mentioned:
2023-06-24 23:51:51 +00:00
2023-06-27 00:51:42 +00:00
> We prefer using an Universally Unique Lexicographically Sortable Identifier (ULID) for these
> idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by
> 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the
> b-tree data structure databases use for indexing. In one high-throughput system at Shopify weve
> seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for
> idempotency keys.
2023-06-24 21:00:00 +00:00
2023-06-27 00:51:42 +00:00
Whoa, that sounds great! But [this youtube
video](https://www.youtube.com/watch?v=f53-Iw_5ucA&t=590s) tempered my expectations a bit, by
describing the implementation-dependent reasons for that dramatic
improvement. Still, switching from UUIDs to ULIDs couldn't *hurt*[^no-stinkin-benches], right? Plus,
by encoding the time of creation (at least to the nearest millisecond), I could remove a "created
at" field from every table that used them as primary keys. Which, in my case, would be all of them,
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
anyway.
Plus, I was familiar with the idea of using sortable IDs, from
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
using KSUIDs from the get-go, but discarded that for two main reasons:
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
- I'd have to manually implement serialization/deserialization for them anyway, since SQLx didn't
have native support for them
In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that
significant, and I'd have to do the manual serialization stuff anyway.
I was ready to do this thing.
# Serial problems
"Deserilization" is the act of converting a static, non-native representation of some kind of
datatype into a dynamic, native computer programming object, so that you can do the right computer
programming stuff to it. It can be as simple as when a program reads in a string of digit characters
and parses it into a real number, but of course the ceiling on complexity is limitless.
In my case, it was about getting those sixteen bytes out of the database and turning them into
ULIDs. Technically, I could have let Rust [handle that for me](https://serde.rs/derive.html) by
automatically deriving that functionality. There were a couple snags with that course, though:
- the default serialized representation of a ULID in the library I was using to provide them [is as
26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html)
- you could tell it to serialize as a [128-bit
number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that only kicked the
problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously
discussed, so I'd still have to manually do something for them
This meant going all-in on fully custom serialization and deserialization, something I'd never done
before, but how hard could it be? (actually not that hard!)
## Great coders steal
steal the uuid serde impls from sqlx
2023-06-24 21:00:00 +00:00
## A puzzling failure
2023-06-24 23:51:51 +00:00
2023-06-27 00:51:42 +00:00
# When in trouble, be sure to change many things at once
## Death to the littlendians, obviously
- endianness
- profit
2023-06-24 23:51:51 +00:00
----
[^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are
reserved for version metadata.
2023-06-26 19:16:33 +00:00
[^blob-of-bytes]: Some databases have direct support for 128-bit primitive values (numbers). The
database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
arbitrary-length sequences of bytes called "blobs".
2023-06-27 00:51:42 +00:00
[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I
plan to dive into in a different post, but "blob" is specific to SQLite. In general, you'll probably
want to take advantage of implementation-specific features of whatever database you're using, which
means that your table definitions won't be fully portable to a different database. This is fine and
good, actually!
[^no-stinkin-benches]: You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha,
you must have never met a programmer before! So, no, obviously not. But that's coming in a follow-up.
2023-06-24 23:51:51 +00:00
[thats_a_database]: ./thats_a_database.png "that's a database"