blog/content/rnd/a_serialized_mystery/index.md
2023-07-29 16:46:18 -07:00

535 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

+++
title = "A One-Part Serialized Mystery"
slug = "one-part-serialized-mystery"
date = "2023-06-29"
updated = "2023-07-29"
[taxonomies]
tags = ["software", "rnd", "proclamation", "upscm", "rust", "ulid", "sqlite"]
+++
# *Mise en Scene*
I recently spent a couple days moving from [one type of universally unique
identifier](https://commons.apache.org/sandbox/commons-id/uuid.html) to a [different
one](https://github.com/ulid/spec), for an in-progress [database-backed
web-app](https://gitlab.com/nebkor/ww). The [initial
work](https://gitlab.com/nebkor/ww/-/commit/be96100237da56313a583be6da3dc27a4371e29d#f69082f7433f159d627269b207abdaf2ad52b24c)
didn't take very long, but debugging the [serialization and
deserialization](https://en.wikipedia.org/wiki/Serialization) of the new IDs took another day and a
half, and in the end, the alleged mystery of why it wasn't working was a red herring due to my own
stupidity. So come with me on an exciting voyage of discovery, and [once again, learn from my
folly](@/sundries/a-thoroughly-digital-artifact/index.md)!
# Keys, primarily
Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy
facade for some kind of database. Facebook? That's a database. Gmail? That's a database.
![that's a database][thats_a_database]
<div class="caption">wikipedia? that's a database.</div>
In most databases, each entry ("row") has a field that acts as a [primary
key](https://en.wikipedia.org/wiki/Primary_key), used to uniquely identify that row inside the table
it's in. Since databases typically contain multiple tables, and primary keys have to be unique only
within their own table, you could just use a simple integer that's automatically incremented every
time you add a new record, and in many databases, if you create a table without specifying a primary
key, they will [automatically and implicitly use a
mechanism](https://www.sqlite.org/lang_createtable.html#rowid) like that. You may also recognize the
idea of "serial numbers", which is what these sorts of IDs are.
This is often totally fine! If you only ever have one copy of the database, and never have to worry
about inserting rows from a different instance of the database, then you can just use those simple
values and move on your merry way.
However, if you ever think you might want to have multiple instances of your database running, and
want to make sure they're eventually consistent with each other, then you might want to use a
fancier identifier for your primary keys, to avoid collisions between them.
## UUIDs
A popular type for fancy keys is called a
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
numbers[^uuidv4_random], and when turned into a string, usually look something like
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to
store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're
a programmer, this sort of conspicuous waste is unconscionable.
You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes
as the actual data requires. If you never have to actually display the ID inside the database, then
the simplest thing to do is just store it as a blob of 16 bytes[^blob-of-bytes]. Finally, optimal
representation and efficiency!
## Indexes?
And at first, that's what I did. The [external library](https://docs.rs/sqlx/latest/sqlx/) I'm using
to interface with my database automatically writes UUIDs as a sequence of sixteen bytes, if you
specified the type in the database[^sqlite-dataclasses] as "[blob](https://www.sqlite.org/datatype3.html)", which [I
did](https://gitlab.com/nebkor/ww/-/commit/65a32f1f20df6c572580d796e1044bce807fd3b6#f1043d50a0244c34e4d056fe96659145d03b549b_0_5).
But then I saw a [blog post](https://shopify.engineering/building-resilient-payment-systems) where
the following tidbit was mentioned:
> We prefer using an Universally Unique Lexicographically Sortable Identifier (ULID) for these
> idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by
> 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the
> b-tree data structure databases use for indexing. In one high-throughput system at Shopify weve
> seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for
> idempotency keys.
Whoa, that sounds great! But [this youtube
video](https://www.youtube.com/watch?v=f53-Iw_5ucA&t=590s) tempered my expectations a bit, by
describing the implementation-dependent reasons for that dramatic
improvement. Still, switching from UUIDs to ULIDs couldn't *hurt*[^no-stinkin-benches], right? Plus,
by encoding the time of creation (at least to the nearest millisecond), I could remove a "created
at" field from every table that used them as primary keys. Which, in my case, would be all of them,
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
anyway.
I was actually already familiar with the idea of using time-based sortable IDs, from
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
using them from the get-go, but discarded that for two main reasons:
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
have native support for them
In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that
significant, and I'd have to do the manual serialization stuff for anything besides a
less-than-8-bytes number or a normal UUID. Still, four bytes is four bytes, and all other things
being equal, I'd rather go for the trimmer, 64-bit-aligned value.
Finally, I'd recently finished with adding some ability to actually interact with data in a
meaningful way, and to add new records to the database, which meant that it was now or never for
standardizing on a type for the primary keys. I was ready to do this thing.
# Serial problems
"Deserilization" is the act of converting a static, non-native representation of some kind of
datatype into a dynamic, native computer programming object, so that you can do the right computer
programming stuff to it. It can be as simple as when a program reads in a string of digit characters
and parses it into a real number, but of course the ceiling on complexity is limitless.
In my case, it was about getting those sixteen bytes out of the database and turning them into
ULIDs. Technically, I could have let Rust [handle that for me](https://serde.rs/derive.html) by
automatically deriving that functionality. There were a couple snags with that course, though:
- the default serialized representation of a ULID in the library I was using to provide them [is as
26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html), and I wanted to use only
16 bytes in the database
- you could tell it to serialize as a [128-bit
number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that merely kicked the
problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously
discussed, so I'd still have to manually do something for them
This meant going all-in on fully custom serialization and deserialization, something I'd never done
before, but how hard could it be? (spoiler: actually not that hard!)
## Great coders steal
Something I appreciate about the [Rust programming language](https://www.rust-lang.org/) is that
because of the way the compiler works[^rust-generics], the full source code almost always has to be
available to you, the end-user coder. The culture around it is also very biased toward open source,
and so all the extremely useful libraries are just sitting there, ready to be studied and copied. So
the first thing I did was take a look at how [SQLx handled
UUIDs](https://github.com/launchbadge/sqlx/blob/main/sqlx-sqlite/src/types/uuid.rs):
``` rust
impl Type<Sqlite> for Uuid {
fn type_info() -> SqliteTypeInfo {
SqliteTypeInfo(DataType::Blob)
}
fn compatible(ty: &SqliteTypeInfo) -> bool {
matches!(ty.0, DataType::Blob | DataType::Text)
}
}
impl<'q> Encode<'q, Sqlite> for Uuid {
fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
args.push(SqliteArgumentValue::Blob(Cow::Owned(
self.as_bytes().to_vec(),
)));
IsNull::No
}
}
impl Decode<'_, Sqlite> for Uuid {
fn decode(value: SqliteValueRef<'_>) -> Result<Self, BoxDynError> {
// construct a Uuid from the returned bytes
Uuid::from_slice(value.blob()).map_err(Into::into)
}
}
```
There's not a ton going on there, as you can see. To "encode" it just gets the bytes out of the
UUID, and to "decode" it just gets the bytes out of the database. I couldn't use that exactly as
done by the SQLx authors, as they were using datatypes that were private to their crate, but it was
close enough; here's mine:
``` rust
impl sqlx::Type<sqlx::Sqlite> for DbId {
fn type_info() -> <sqlx::Sqlite as sqlx::Database>::TypeInfo {
<&[u8] as sqlx::Type<sqlx::Sqlite>>::type_info()
}
}
impl<'q> Encode<'q, Sqlite> for DbId {
fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
args.push(SqliteArgumentValue::Blob(Cow::Owned(self.bytes().to_vec())));
IsNull::No
}
}
impl Decode<'_, Sqlite> for DbId {
fn decode(value: SqliteValueRef<'_>) -> Result<Self, sqlx::error::BoxDynError> {
let bytes = <&[u8] as Decode<Sqlite>>::decode(value)?;
let bytes: [u8; 16] = bytes.try_into().unwrap_or_default();
Ok(u128::from_ne_bytes(bytes).into())
}
}
```
(In order to implement the required methods from SQLx, I had to wrap the ULID in a new, custom type,
which I called `DbId`, to comply with the [orphan rules](https://github.com/Ixrec/rust-orphan-rules).)
That's only half the story, though. If all I had to worry about was getting data in and out of the
database, that would be fine, but because I'm building a web app, I need to be able to include my
new ID type in messages sent over a network or as part of a web page, and for that, it needed to
implement some functionality from a different library, called [Serde](https://serde.rs/). My
original implementation for *deserializing* looked like this:
``` rust
struct DbIdVisitor;
impl<'de> Visitor<'de> for DbIdVisitor {
type Value = DbId;
// make a DbId from a slice of bytes
fn visit_bytes<E>(self, v: &[u8]) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
...
}
// make a DbId from a Vec of bytes
fn visit_byte_buf<E>(self, v: Vec<u8>) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
...
}
// you get the picture
fn visit_string() ...
fn visit_u128() ...
fn visit_i128() ...
}
```
In my mind, the only important pieces were the `visit_bytes()` and `visit_byte_buf()` methods,
which worked basically the same as the `decode()` function for SQLx. I mean, as far as I could tell,
the only time something would be encountering a serialized `DbId` would be in the form of raw bytes
from the database; no one else would be trying to serialize one as something else that I didn't
anticipate, right?
RIGHT???
(wrong)
## A puzzling failure
As soon as my code compiled, I ran my tests. Everything passed... except for one, that tested
logging in.
This was very strange. All the other tests were passing, and basically every operation requires
getting one of these IDs into or out of the database. But at this point, it was late, and I set it
down until the next day.
# When in doubt, change many things at once
The next day I sat back down to get back to work, and in the course of examining what was going on,
realized that I'd missed something crucial: these things were supposed to be *sortable*. But the way
I was inserting them meant that they weren't, because of endianness.
## More like shmexicographic, amirite
"ULID" stands for "Universally Unique Lexicographically Sortable
Identifier"[^uulsid]. "[Lexicographic order](https://en.wikipedia.org/wiki/Lexicographic_order)"
basically means, "like alphabetical, but for anything with a defined total order". Numbers have a
defined total order; bigger numbers always go after smaller.
But sometimes numbers get sorted out of order, if they're not treated as numbers. Like say you had a
directory with twelve files in it, called "1.txt" up through "12.txt". If you were to ask to see
them listed out in lexicographic order, it would go like:
``` text
$ ls
10.txt
11.txt
12.txt
1.txt
2.txt
3.txt
4.txt
5.txt
6.txt
7.txt
8.txt
9.txt
```
This is because '10' is "less than" '2' (and '0' is "less than" '.', which is why "10.txt" is before "1.txt"). The solution, as all
data-entering people know, is to pad the number with leading '0's:
``` text
$ ls
01.txt
02.txt
03.txt
04.txt
05.txt
06.txt
07.txt
08.txt
09.txt
10.txt
11.txt
12.txt
```
Now the names are lexicographically sorted in the right numerical order[^confusing-yes].
So, now that we're all expert lexicographicographers, we understand that our IDs are just
supposed to naturally sort themselves in the correct order, based on when they were created; IDs
created later should sort after IDs created earlier.
The implementation for my ULIDs only guaranteed this property for the string form of them, but I was
not storing them in string from. Fundamentally, the ULID was a simple [128-bit primitive
integer](https://doc.rust-lang.org/std/primitive.u128.html), capable of holding values between 0 and
340,282,366,920,938,463,463,374,607,431,768,211,455.
But there's a problem: I was storing the ID in the database as a sequence of 16 bytes. I was asking
for those bytes in "native endian", which in my case, meant "little endian". If you're not familiar
with endianness, there are two varieties: big, and little. "Big" makes the most sense for a lot of
people; if you see a number like "512", it's big-endian; the end is the part that's left-most, and
"big" means that it is the most-significant-digit. This is the same as what westerners think of as
"normal" numbers. In the number "512", the "most significant digit" is `5`, which correspends to
`500`, which is added to the next-most-significant digit, `1`, corresponding to `10`, which is added
to the next-most-significant digit, which is also the least-most-significant-digit, which is `2`,
which is just `2`, giving us the full number `512`.
If we put the least-significant-digit first, we'd write the number `512` as "215"; the order when
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
come before "215", which is backwards.
Little-endianness is like that. If a multibyte numeric value is on a little-endian system, the
least-significant bytes will come first, and a lexicographic sorting of those bytes would be
non-numeric.
The solution, though, is simple: just write them out in big-endian order! This was literally a
one-line change in the code, to switch from `to_ne_bytes()` ("ne" for "native endian") to
`to_be_bytes()`. I confirmed that the bytes written into were being written in the correct
lexicographic order:
``` sql
sqlite> select hex(id), username from users order by id asc;
018903CDDCAAB0C6872A4509F396D388|first_user
018903D0E591525EA42202FF461AA5FA|second_user
```
Note the first six characters are the same, for these two users created some time apart[^ulid-timestamps].
Boom. "Sorted".
## The actual problem
Except that the logins were still broken; it wasn't just the test. What was even stranger is that
with advanced debugging techniques[^advanced-debugging], I confirmed that the login *was*
working. By which I mean, when the user submitted a login request, the function that handled the
request was:
- correctly confirming password match
- retrieving the user from the database
The second thing was required for the first. It was even creating a session in the session table:
``` sql
sqlite> select * from async_sessions;
..|..|{"id":"ZY...","expiry":"...","data":{"_user_id":"[1,137,3,205,220,170,176,198,135,42,69,9,243,150,211,136]","_auth_id":"\"oM..."}}
```
I noticed that the ID was present in the session entry, but as what looked like an array of decimal
values. The less not-astute among you may have noticed that the session table seemed to be using
JSON to store information. This wasn't my code, but it was easy enough to find the
[culprit](https://github.com/http-rs/async-session/blob/d28cef30c7da38f52639b3d60fc8cf4489c92830/src/session.rs#L214):
``` rust
pub fn insert(&mut self, key: &str, value: impl Serialize) -> Result<(), serde_json::Error> {
self.insert_raw(key, serde_json::to_string(&value)?);
Ok(())
}
```
This was in the [external library](https://docs.rs/async-session/latest/async_session/) I was using
to provide cookie-based sessions for my web app, and was transitively invoked when I called the
`login()` method in my own code. Someone else was serializing my IDs, in a way I hadn't anticipated!
The way that Serde decides what code to call is based on its [data
model](https://serde.rs/data-model.html). And wouldn't you know it, the following words are right
there, hiding in plain sight, as they had been all along:
> When deserializing a data structure from some format, the Deserialize implementation for the data
> structure is responsible for mapping the data structure into the Serde data model by passing to
> the Deserializer a Visitor implementation that can receive the various types of the data model...
>
> [...]
>
> * seq
> - A variably sized heterogeneous sequence of values, for example Vec&lt;T&gt; or HashSet&lt;T&gt;. ...
>
> [...]
>
> The flexibility around mapping into the Serde data model is profound and powerful. When
> implementing Serialize and Deserialize, be aware of the broader context of your type that may make
> the most instinctive mapping not the best choice.
Well, when you put it that way, I can't help but understand: I needed to implement a `visit_seq()`
method in my deserialization code.
![fine, fine, I see the light][see_the_light]
<div class = "caption">fine, fine, i see the light</div>
You can see that
[here](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L194-216)
if you'd like, but I'll actually come back to it in a second. The important part was that my logins
were working again; time to party!
## Wait, why *isn't* it broken?
I'd just spent the day banging my head against this problem, and so when everything worked again, I
committed and pushed the change and signed off. But something was still bothering me, and the next
day, I dove back into it.
All my serialization code was calling a method called
[`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18),
which simply called another method that would return an array of 16 bytes, in big-endian order, so
it could go into the database and be sortable, as discussed.
But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes
were
*little*-endian](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L212). Which
lead me to ask:
what the fuck?
Like, everything was *working*. Why did I need to construct from a different byte order? I felt like
I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community
and presented my case.
Basically, I showed that bytes were written correctly, resident in the DB in big-endian form, but
then were "backwards" coming out and "had to be" cast using little-endian constructors
("`from_ne_bytes()`").
What had actually happened is that as long as there was agreement about what order to use for reconstructing the
ID from the bytes, it didn't matter if it was big or little-endian, it just had to be the same on
both the
[SQLx](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_106_105)
side and on the
[Serde](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_210_209)
side. This is also irrespective of the order they were written out in, but again, the two sides must
agree on the convention used. Inside the Serde method, I had added some debug printing of the bytes
it was getting, and they were in little-endian order. What I had not realized is that that was
because they were first passing through the SQLx method which reversed them.
Mmmmm, delicious, delicous red herring.
Two people were especially helpful, Julia Evans and Nicole Tietz-Sokolskaya; Julia grabbed a copy of
my database file and poked it with Python, and could not replicate the behavior I was seeing, and
Nicole did the same but with a little Rust program she wrote. Huge thanks to both of them (but not
just them) for the extended [rubber ducking](https://en.wikipedia.org/wiki/Rubber_duck_debugging)!
And apologies for the initial gas-lighting; Julia was quite patient and diplomatic when pushing back
against "the bytes are coming out of the db backwards".
# Lessons learned
Welp, here we are, the end of the line; I hope this has been informative, or barring that, at least
entertaining. Or the other way around, I'm not that fussy!
Obviously, the biggest mistake was to futz with being clever about endianness before understanding
why the login code was now failing. Had I gotten it working correctly first, I would have been able to
figure out the requirement for agreement on convention between the two different serialization
systems much sooner, and I would not have wasted mine and others' time on misunderstanding.
On the other hand, it's hard to see these things on the first try, especially when you're on your
own, and are on your first fumbling steps in a new domain or ecosystem; for me, that was getting
into the nitty-gritty with Serde, and for that matter, dealing directly with serialization-specific
issues. Collaboration is a great technique for navigating these situations, and I definitely need to
focus a bit more on enabling that[^solo-yolo-dev].
In the course of debugging this issue, I tried to get more insight via
[testing](https://gitlab.com/nebkor/ww/-/commit/656e6dceedf0d86e2805e000c9821e931958a920#ce34dd57be10530addc52a3273548f2b8d3b8a9b_143_251),
and though that helped a little, it was not nearly enough; the problem was that I misunderstood how
something worked, not that I had mistakenly implemented something I was comfortable with. Tests
aren't a substitute for understanding!
And of course, I'm now much more confident and comfortable with Serde; reading the Serde code for
other things, like [UUIDs](https://github.com/uuid-rs/uuid/blob/main/src/external/serde_support.rs),
is no longer an exercise in eye-glaze-control. Maybe this has helped you with that too?
----
[^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are
reserved for version information.
[^blob-of-bytes]: Some databases have direct support for 128-bit primitive values (numbers). The
database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
arbitrary-length sequences of bytes called "blobs".
[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I plan
to dive into in a different post, but "blob" is specific to it. In general, you'll probably want
to take advantage of implementation-specific features of whatever database you're using, which
means that your table definitions won't be fully portable to a different database. This is fine
and good, actually!
[^no-stinkin-benches]: You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha,
you must have never met a programmer before! No, of course not. But, that's coming in a
follow-up.
[^rust-generics]: If the code you're using has
[generics](https://doc.rust-lang.org/book/ch10-01-syntax.html) in it, then the compiler needs to
generate specialized versions of that generic code based on how you use it; this is called
"[monomorphization](https://doc.rust-lang.org/book/ch10-01-syntax.html#performance-of-code-using-generics)",
and it requires the original generic source to work. That's also true in C++, which is why most
templated code is [header-only](https://isocpp.org/wiki/faq/templates#templates-defn-vs-decl),
but Rust doesn't have header files.
[^uulsid]: I guess the extra 'U' and 'S' are invisible.
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
[^ulid-timestamps]: The 6 most-significant bytes make up the timestamp in a ULID, which in the hex
dump form pasted there would be the first twelve characters, since each byte is two hex
digits.
[^advanced-debugging]: "adding `dbg!()` statements in the code"
[^actually_not_all]: Upon further review, I discovered that the only methods that were constructing
with little-endian order were the SQLx `decode()` method, and the Serde `visit_seq()` method,
which were also the only ones that were being called at all. The
[`visit_bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L152)
and `visit_byte_buf()` methods, that I had thought were so important, were correctly treating
the bytes as big-endian, but were simply never actually used. I fixed [in the next
commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b)
[^solo-yolo-dev]: I've described my current practices as "solo-yolo", which has its plusses and
minuses, as you may imagine.
[thats_a_database]: ./thats_a_database.png "simpsons that's-a-paddling guy"
[see_the_light]: ./seen_the_light.png "jake blues seeing the light"