2023-06-24 21:00:00 +00:00
|
|
|
|
+++
|
2023-06-24 23:51:51 +00:00
|
|
|
|
title = "A One-Part Serialized Mystery"
|
|
|
|
|
slug = "one-part-serialized-mystery"
|
2023-06-24 21:00:00 +00:00
|
|
|
|
date = "2023-06-22"
|
|
|
|
|
updated = "2023-06-22"
|
|
|
|
|
[taxonomies]
|
2023-06-24 23:51:51 +00:00
|
|
|
|
tags = ["software", "rnd", "proclamation", "upscm"]
|
2023-06-24 21:00:00 +00:00
|
|
|
|
[extra]
|
|
|
|
|
toc = false
|
|
|
|
|
+++
|
|
|
|
|
|
|
|
|
|
# *Mise en Scene*
|
|
|
|
|
|
2023-06-24 23:51:51 +00:00
|
|
|
|
I recently spent a couple days moving from [one type of universally unique
|
|
|
|
|
identifier](https://commons.apache.org/sandbox/commons-id/uuid.html) to a [different
|
|
|
|
|
one](https://github.com/ulid/spec), for an in-progress [database-backed
|
|
|
|
|
web-app](https://gitlab.com/nebkor/ww). The [initial
|
|
|
|
|
work](https://gitlab.com/nebkor/ww/-/commit/be96100237da56313a583be6da3dc27a4371e29d#f69082f7433f159d627269b207abdaf2ad52b24c)
|
|
|
|
|
didn't take very long, but debugging the [serialization and
|
2023-06-27 00:51:42 +00:00
|
|
|
|
deserialization](https://en.wikipedia.org/wiki/Serialization) of the new IDs took another day and a
|
|
|
|
|
half, and in the end, the alleged mystery of why it wasn't working was a red herring due to my own
|
|
|
|
|
stupidity. So come with me on an exciting voyage of discovery, and [once again, learn from my
|
2023-06-24 23:51:51 +00:00
|
|
|
|
folly](@/sundries/a-thoroughly-digital-artifact/index.md)!
|
|
|
|
|
|
|
|
|
|
# Keys, primarily
|
|
|
|
|
|
|
|
|
|
Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy
|
2023-06-26 19:16:33 +00:00
|
|
|
|
facade for some kind of database. Facebook? That's a database. Gmail? That's a database.
|
2023-06-24 23:51:51 +00:00
|
|
|
|
|
|
|
|
|
![that's a database][thats_a_database]
|
|
|
|
|
<center><span class="caption">wikipedia? that's a database.</span></center>
|
|
|
|
|
|
|
|
|
|
In most databases, each entry ("row") has a field that acts as a [primary
|
|
|
|
|
key](https://en.wikipedia.org/wiki/Primary_key), used to uniquely identify that row inside the table
|
|
|
|
|
it's in. Since databases typically contain multiple tables, and primary keys have to be unique only
|
|
|
|
|
within their own table, you could just use a simple integer that's automatically incremented every
|
|
|
|
|
time you add a new record, and in many databases, if you create a table without specifying a primary
|
|
|
|
|
key, they will [automatically and implicitly use a
|
2023-06-26 19:16:33 +00:00
|
|
|
|
mechanism](https://www.sqlite.org/lang_createtable.html#rowid) like that. You may also recognize the
|
|
|
|
|
idea of "serial numbers", which is what these sorts of IDs are.
|
2023-06-24 23:51:51 +00:00
|
|
|
|
|
|
|
|
|
This is often totally fine! If you only ever have one copy of the database, and never have to worry
|
|
|
|
|
about inserting rows from a different instance of the database, then you can just use those simple
|
|
|
|
|
values and move on your merry way.
|
|
|
|
|
|
|
|
|
|
However, if you ever think you might want to have multiple instances of your database running, and
|
|
|
|
|
want to make sure they're eventually consistent with each other, then you might want to use a
|
|
|
|
|
fancier identifier for your primary keys, to avoid collisions between primary keys.
|
|
|
|
|
|
|
|
|
|
## UUIDs
|
|
|
|
|
|
2023-06-26 19:16:33 +00:00
|
|
|
|
A popular type for these is called a
|
|
|
|
|
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
|
2023-06-24 23:51:51 +00:00
|
|
|
|
numbers[^uuidv4_random], and when turned into a string, usually look something like
|
|
|
|
|
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
|
|
|
|
|
reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to
|
2023-06-26 19:16:33 +00:00
|
|
|
|
store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're
|
2023-06-24 23:51:51 +00:00
|
|
|
|
a programmer, this sort of conspicous waste is unconscionsable.
|
|
|
|
|
|
|
|
|
|
You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes
|
|
|
|
|
as the actual data requires. If you never have to actually display the ID inside the database, then
|
|
|
|
|
the simplest thing to do is just store it as a blob of 16 bytes[^blob-of-bytes]. Finally, optimal
|
|
|
|
|
representation and efficiency!
|
|
|
|
|
|
|
|
|
|
## Indexes?
|
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
And at first, that's what I did. The [external library](https://docs.rs/sqlx/latest/sqlx/) I'm using
|
|
|
|
|
to interface with my database automatically writes UUIDs as a sequence of sixteen bytes, if you
|
|
|
|
|
specified the type in the database[^sqlite-dataclasses] as "[blob](https://www.sqlite.org/datatype3.html)", which [I
|
|
|
|
|
did](https://gitlab.com/nebkor/ww/-/commit/65a32f1f20df6c572580d796e1044bce807fd3b6#f1043d50a0244c34e4d056fe96659145d03b549b_0_5).
|
2023-06-24 23:51:51 +00:00
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
But then I saw a [blog post](https://shopify.engineering/building-resilient-payment-systems) where
|
|
|
|
|
the following tidbit was mentioned:
|
2023-06-24 23:51:51 +00:00
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
> We prefer using an Universally Unique Lexicographically Sortable Identifier (ULID) for these
|
|
|
|
|
> idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by
|
|
|
|
|
> 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the
|
|
|
|
|
> b-tree data structure databases use for indexing. In one high-throughput system at Shopify we’ve
|
|
|
|
|
> seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for
|
|
|
|
|
> idempotency keys.
|
2023-06-24 21:00:00 +00:00
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
Whoa, that sounds great! But [this youtube
|
|
|
|
|
video](https://www.youtube.com/watch?v=f53-Iw_5ucA&t=590s) tempered my expectations a bit, by
|
|
|
|
|
describing the implementation-dependent reasons for that dramatic
|
|
|
|
|
improvement. Still, switching from UUIDs to ULIDs couldn't *hurt*[^no-stinkin-benches], right? Plus,
|
|
|
|
|
by encoding the time of creation (at least to the nearest millisecond), I could remove a "created
|
|
|
|
|
at" field from every table that used them as primary keys. Which, in my case, would be all of them,
|
|
|
|
|
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
|
|
|
|
|
anyway.
|
|
|
|
|
|
|
|
|
|
Plus, I was familiar with the idea of using sortable IDs, from
|
|
|
|
|
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
|
|
|
|
|
using KSUIDs from the get-go, but discarded that for two main reasons:
|
|
|
|
|
|
|
|
|
|
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
|
|
|
|
|
- I'd have to manually implement serialization/deserialization for them anyway, since SQLx didn't
|
|
|
|
|
have native support for them
|
|
|
|
|
|
|
|
|
|
In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that
|
|
|
|
|
significant, and I'd have to do the manual serialization stuff anyway.
|
|
|
|
|
|
|
|
|
|
I was ready to do this thing.
|
|
|
|
|
|
|
|
|
|
# Serial problems
|
|
|
|
|
|
|
|
|
|
"Deserilization" is the act of converting a static, non-native representation of some kind of
|
|
|
|
|
datatype into a dynamic, native computer programming object, so that you can do the right computer
|
|
|
|
|
programming stuff to it. It can be as simple as when a program reads in a string of digit characters
|
|
|
|
|
and parses it into a real number, but of course the ceiling on complexity is limitless.
|
|
|
|
|
|
|
|
|
|
In my case, it was about getting those sixteen bytes out of the database and turning them into
|
|
|
|
|
ULIDs. Technically, I could have let Rust [handle that for me](https://serde.rs/derive.html) by
|
|
|
|
|
automatically deriving that functionality. There were a couple snags with that course, though:
|
|
|
|
|
|
|
|
|
|
- the default serialized representation of a ULID in the library I was using to provide them [is as
|
|
|
|
|
26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html)
|
|
|
|
|
- you could tell it to serialize as a [128-bit
|
|
|
|
|
number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that only kicked the
|
|
|
|
|
problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously
|
|
|
|
|
discussed, so I'd still have to manually do something for them
|
|
|
|
|
|
|
|
|
|
This meant going all-in on fully custom serialization and deserialization, something I'd never done
|
|
|
|
|
before, but how hard could it be? (actually not that hard!)
|
|
|
|
|
|
|
|
|
|
## Great coders steal
|
|
|
|
|
|
|
|
|
|
steal the uuid serde impls from sqlx
|
2023-06-24 21:00:00 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## A puzzling failure
|
2023-06-24 23:51:51 +00:00
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
# When in trouble, be sure to change many things at once
|
|
|
|
|
|
|
|
|
|
## Death to the littlendians, obviously
|
|
|
|
|
- endianness
|
|
|
|
|
- profit
|
|
|
|
|
|
|
|
|
|
|
2023-06-24 23:51:51 +00:00
|
|
|
|
----
|
|
|
|
|
|
|
|
|
|
[^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are
|
|
|
|
|
reserved for version metadata.
|
|
|
|
|
|
2023-06-26 19:16:33 +00:00
|
|
|
|
[^blob-of-bytes]: Some databases have direct support for 128-bit primitive values (numbers). The
|
|
|
|
|
database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
|
|
|
|
|
arbitrary-length sequences of bytes called "blobs".
|
|
|
|
|
|
2023-06-27 00:51:42 +00:00
|
|
|
|
[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I
|
|
|
|
|
plan to dive into in a different post, but "blob" is specific to SQLite. In general, you'll probably
|
|
|
|
|
want to take advantage of implementation-specific features of whatever database you're using, which
|
|
|
|
|
means that your table definitions won't be fully portable to a different database. This is fine and
|
|
|
|
|
good, actually!
|
|
|
|
|
|
|
|
|
|
[^no-stinkin-benches]: You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha,
|
|
|
|
|
you must have never met a programmer before! So, no, obviously not. But that's coming in a follow-up.
|
|
|
|
|
|
2023-06-24 23:51:51 +00:00
|
|
|
|
[thats_a_database]: ./thats_a_database.png "that's a database"
|