Joe Ardent 64d5597802 finish with endianness section

2023-06-28 12:57:52 -07:00

17 KiB

Raw Blame History

+++ title = "A One-Part Serialized Mystery" slug = "one-part-serialized-mystery" date = "2023-06-22" updated = "2023-06-22" [taxonomies] tags = ["software", "rnd", "proclamation", "upscm"] [extra] toc = false +++

Mise en Scene

I recently spent a couple days moving from one type of universally unique identifier to a different one, for an in-progress database-backed web-app. The initial work didn't take very long, but debugging the serialization and deserialization of the new IDs took another day and a half, and in the end, the alleged mystery of why it wasn't working was a red herring due to my own stupidity. So come with me on an exciting voyage of discovery, and once again, learn from my folly!

Keys, primarily

Most large distributed programs that people interact with daily via HTTP are, in essence, a fancy facade for some kind of database. Facebook? That's a database. Gmail? That's a database.

wikipedia? that's a database.

In most databases, each entry ("row") has a field that acts as a primary key, used to uniquely identify that row inside the table it's in. Since databases typically contain multiple tables, and primary keys have to be unique only within their own table, you could just use a simple integer that's automatically incremented every time you add a new record, and in many databases, if you create a table without specifying a primary key, they will automatically and implicitly use a mechanism like that. You may also recognize the idea of "serial numbers", which is what these sorts of IDs are.

This is often totally fine! If you only ever have one copy of the database, and never have to worry about inserting rows from a different instance of the database, then you can just use those simple values and move on your merry way.

However, if you ever think you might want to have multiple instances of your database running, and want to make sure they're eventually consistent with each other, then you might want to use a fancier identifier for your primary keys, to avoid collisions between primary keys.

UUIDs

A popular type for these is called a v4 UUIDs. These are 128-bit random numbers¹, and when turned into a string, usually look something like 1c20104f-e04f-409e-9ad3-94455e5f4fea; this is called the "hyphenated" form, for fairly obvious reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're a programmer, this sort of conspicuous waste is unconscionable.

You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes as the actual data requires. If you never have to actually display the ID inside the database, then the simplest thing to do is just store it as a blob of 16 bytes². Finally, optimal representation and efficiency!

Indexes?

And at first, that's what I did. The external library I'm using to interface with my database automatically writes UUIDs as a sequence of sixteen bytes, if you specified the type in the database³ as "blob", which I did.

But then I saw a blog post where the following tidbit was mentioned:

We prefer using an Universally Unique Lexicographically Sortable Identifier (ULID) for these idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the b-tree data structure databases use for indexing. In one high-throughput system at Shopify we’ve seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for idempotency keys.

Whoa, that sounds great! But this youtube video tempered my expectations a bit, by describing the implementation-dependent reasons for that dramatic improvement. Still, switching from UUIDs to ULIDs couldn't hurt⁴, right? Plus, by encoding the time of creation (at least to the nearest millisecond), I could remove a "created at" field from every table that used them as primary keys. Which, in my case, would be all of them, and I'm worried less about the speed of inserts than I am about keeping total on-disk size down anyway.

Plus, I was familiar with the idea of using sortable IDs, from KSUIDs. It's an attractive concept to me, and I'd considered using KSUIDs from the get-go, but discarded that for two main reasons:

they're FOUR WHOLE BYTES!!! larger than UUIDs
I'd have to manually implement serialization/deserialization, since SQLx doesn't have native support for them

In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that significant, and I'd have to do the manual serialization stuff for anything besides a less-than-8-bytes number or a normal UUID. Still, four bytes is four bytes, and all other things being equal, I'd rather go for the trimmer, 64-bit-aligned value.

Finally, I'd recently finished with adding some ability to actually interact with data in a meaningful way, and to add new records to the database, which meant that it was now or never for standardizing on a type for the primary keys. I was ready to do this thing.

Serial problems

"Deserilization" is the act of converting a static, non-native representation of some kind of datatype into a dynamic, native computer programming object, so that you can do the right computer programming stuff to it. It can be as simple as when a program reads in a string of digit characters and parses it into a real number, but of course the ceiling on complexity is limitless.

In my case, it was about getting those sixteen bytes out of the database and turning them into ULIDs. Technically, I could have let Rust handle that for me by automatically deriving that functionality. There were a couple snags with that course, though:

the default serialized representation of a ULID in the library I was using to provide them is as 26-character strings, and I wanted to use only 16 bytes in the database
you could tell it to serialize as a 128-bit number, but that merely kicked the problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously discussed, so I'd still have to manually do something for them

This meant going all-in on fully custom serialization and deserialization, something I'd never done before, but how hard could it be? (spoiler: actually not that hard!)

Great coders steal

Something I appreciate about the Rust programming language is that because of the way the compiler works⁵, the full source code almost always has to be available to you, the end-user coder. The culture around it is also very biased toward open source, and so all the extremely useful libraries are just sitting there, ready to be studied and copied. So the first thing I did was take a look at how SQLx handled UUIDs:

impl Type<Sqlite> for Uuid {
    fn type_info() -> SqliteTypeInfo {
        SqliteTypeInfo(DataType::Blob)
    }

    fn compatible(ty: &SqliteTypeInfo) -> bool {
        matches!(ty.0, DataType::Blob | DataType::Text)
    }
}

impl<'q> Encode<'q, Sqlite> for Uuid {
    fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
        args.push(SqliteArgumentValue::Blob(Cow::Owned(
            self.as_bytes().to_vec(),
        )));

        IsNull::No
    }
}

impl Decode<'_, Sqlite> for Uuid {
    fn decode(value: SqliteValueRef<'_>) -> Result<Self, BoxDynError> {
        // construct a Uuid from the returned bytes
        Uuid::from_slice(value.blob()).map_err(Into::into)
    }
}

There's not a ton going on there, as you can see. To "encode" it just gets the bytes out of the UUID, and to "decode" it just gets the bytes out of the database. I couldn't use that exactly as done by the SQLx authors, as they were using datatypes that were private to their crate, but it was close enough; here's mine:

impl sqlx::Type<sqlx::Sqlite> for DbId {
    fn type_info() -> <sqlx::Sqlite as sqlx::Database>::TypeInfo {
        <&[u8] as sqlx::Type<sqlx::Sqlite>>::type_info()
    }
}

impl<'q> Encode<'q, Sqlite> for DbId {
    fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
        args.push(SqliteArgumentValue::Blob(Cow::Owned(self.bytes().to_vec())));
        IsNull::No
    }
}

impl Decode<'_, Sqlite> for DbId {
    fn decode(value: SqliteValueRef<'_>) -> Result<Self, sqlx::error::BoxDynError> {
        let bytes = <&[u8] as Decode<Sqlite>>::decode(value)?;
        let bytes: [u8; 16] = bytes.try_into().unwrap_or_default();
        Ok(u128::from_ne_bytes(bytes).into())
    }
}

(In order to implement the required methods from SQLx, I had to wrap the ULID in a new, custom type, which I called DbId, to comply with the orphan rules.)

That's only half the story, though. If all I had to worry about was getting data in and out of the database, that would be fine, but because I'm building a web app, I need to be able to include my new ID type in messages sent over a network or as part of a web page, and for that, it needed to implement some functionality from a different library, called Serde. My original implementation for deserializing looked like this:

struct DbIdVisitor;

impl<'de> Visitor<'de> for DbIdVisitor {
    type Value = DbId;

    // make a DbId from a slice of bytes
    fn visit_bytes<E>(self, v: &[u8]) -> Result<Self::Value, E>
    where
        E: serde::de::Error,
    {
        ...
    }

    // make a DbId from a Vec of bytes
    fn visit_byte_buf<E>(self, v: Vec<u8>) -> Result<Self::Value, E>
    where
        E: serde::de::Error,
    {
        ...
    }

    // you get the picture
    fn visit_string() ...
    fn visit_u128() ...
    fn visit_i128() ...
}

In my mind, the only important pieces were the visit_bytes() and visit_byte_buf() methods, which worked basically the same as the decode() function for SQLx. I mean, as far as I could tell, the only time something would be encountering a serialized DbId would be in the form of raw bytes from the database; no one else would be trying to serialize one as something else that I didn't anticipate, right?

RIGHT???

(wrong)

A puzzling failure

As soon as my code compiled, I ran my tests. Everything passed... except for one, that tested logging in.

This was very strange. All the other tests were passing, and basically every operation requires getting one of these IDs into or out of the database. But at this point, it was late, and I set it down until the next day.

When in doubt, change many things at once

The next day I sat back down to get back to work, and in the course of examining what was going on, realized that I'd missed something crucial: these things were supposed to be sortable. But the way I was inserting them meant that they weren't, because of endianness.

More like shmexicographic, amirite

"ULID" stands for "Universally Unique Lexicographically Sortable Identifier"⁶. "Lexicographic order" basically means, "like alphabetical, but for anything with a defined total order". Numbers have a defined total order; bigger numbers always go after smaller.

But sometimes numbers get sorted out of order, if they're not treated as numbers. Like say you had a directory with twelve files in it, called "1.txt" up through "12.txt". If you were to ask to see them listed out in lexicographic order, it would go like:

$ ls
10.txt
11.txt
12.txt
1.txt
2.txt
3.txt
4.txt
5.txt
6.txt
7.txt
8.txt
9.txt

This is because '10' is "less than" '2' (and '0' is "less than" '.', which is why "10.txt" is before "1.txt"). The solution, as all data-entering people know, is to pad the number with leading '0's:

$ ls
01.txt
02.txt
03.txt
04.txt
05.txt
06.txt
07.txt
08.txt
09.txt
10.txt
11.txt
12.txt

Now the names are lexicographically sorted in the right numerical order⁷.

So, now that we're all expert lexicographicographers, we understand that our IDs are just supposed to naturally sort themselves in the correct order, based on when they were created; IDs created later should sort after IDs created earlier.

The implementation for my ULIDs only guaranteed this property for the string form of them, but I was not storing them in string from. Fundamentally, the ULID was a simple 128-bit primitive integer, capable of holding values between 0 and 340,282,366,920,938,463,463,374,607,431,768,211,455.

But there's a problem: I was storing the ID in the database as a sequence of 16 bytes. I was asking for those bytes in "native endian", which in my case, meant "little endian". If you're not familiar with endianness, there are two varieties: big, and little. "Big" makes the most sense for a lot of people; if you see a number like "512", it's big-endian; the end is the part that's left-most, and "big" means that it is the most-significant-digit. This is the same as what westerners think of as "normal" numbers. In the number "512", the "most significant digit" is 5, which correspends to 500, which is added to the next-most-significant digit, 1, corresponding to 10, which is added to the next-most-significant digit, which is also the least-most-significant-digit, which is 2, which is just 2, giving us the full number 512.

If we put the least-significant-digit first, we'd write the number 512 as "215"; the order when written out would be reversed. This means that the lexicographic sort of 512, 521 would have "125" come before "215", which is backwards.

Little-endiannes is like that. If a multibyte numeric value is on a little-endian system, the least-significant bytes will come first, and a lexicographic sorting of those bytes would be non-numeric.

The solution, though, is simple: just write them out in big-endian order! This was literally a one-line change in the code, to switch from to_ne_bytes() ("ne" for "native endian") to to_be_bytes().

Boom. Sorted.

The actual problem

there was no visitor for seqs, which is what json byte arrays are, and what async_sessions was doing.

Wait, why isn't it broken?

oh, it's not

Lessons learned

don't change many things at once

automated tests aren't enough

Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are reserved for version information. ↩︎
Some databases have direct support for 128-bit primitive values (numbers). The database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support arbitrary-length sequences of bytes called "blobs". ↩︎
I'm using SQLite for reasons that I plan to dive into in a different post, but "blob" is specific to it. In general, you'll probably want to take advantage of implementation-specific features of whatever database you're using, which means that your table definitions won't be fully portable to a different database. This is fine and good, actually! ↩︎
You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha, you must have never met a programmer before! No, of course not. But, that's coming in a follow-up. ↩︎
If the code you're using has generics in it, then the compiler needs to generate specialized versions of that generic code based on how you use it; this is called "monomorphization", and it requires the original generic source to work. That's also true in C++, which is why most templated code is header-only, but Rust doesn't have header files. ↩︎
I guess the extra 'U' and 'S' are invisible. ↩︎
Is this confusing? Yes, 100%, it is not just you. Don't get discouraged. ↩︎

17 KiB Raw Blame History Unescape Escape