checkpoint; working out endianness

2023-06-28 10:57:42 -07:00 · 2023-06-28 10:57:42 -07:00 · e7bd1bd5f8
commit e7bd1bd5f8
parent 2aeec60470
1 changed files with 246 additions and 21 deletions
--- a/content/rnd/a_serialized_mystery/index.md
+++ b/content/rnd/a_serialized_mystery/index.md
@ -55,7 +55,7 @@ numbers[^uuidv4_random], and when turned into a string, usually look something l
 `1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
 reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to
 store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're
-a programmer, this sort of conspicous waste is unconscionsable.
+a programmer, this sort of conspicuous waste is unconscionable.
 You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes
 as the actual data requires. If you never have to actually display the ID inside the database, then
@ -93,13 +93,17 @@ Plus, I was familiar with the idea of using sortable IDs, from
 using KSUIDs from the get-go, but discarded that for two main reasons:
 - they're **FOUR WHOLE BYTES!!!** larger than UUIDs
- - I'd have to manually implement serialization/deserialization for them anyway, since SQLx didn't
+ - I'd have to manually implement serialization/deserialization, since SQLx doesn't
   have native support for them
 In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that
-significant, and I'd have to do the manual serialization stuff anyway.
+significant, and I'd have to do the manual serialization stuff for anything besides a
 less-than-8-bytes number or a normal UUID. Still, four bytes is four bytes, and all other things
 being equal, I'd rather go for the trimmer, 64-bit-aligned value.
-I was ready to do this thing.
+Finally, I'd recently finished with adding some ability to actually interact with data in a
 meaningful way, and to add new records to the database, which meant that it was now or never for
 standardizing on a type for the primary keys. I was ready to do this thing.
 # Serial problems
@ -113,45 +117,266 @@ ULIDs. Technically, I could have let Rust [handle that for me](https://serde.rs/
 automatically deriving that functionality. There were a couple snags with that course, though:
 - the default serialized representation of a ULID in the library I was using to provide them [is as
-26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html)
+26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html), and I wanted to use only
 16 bytes in the database
 - you could tell it to serialize as a [128-bit
-number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that only kicked the
+number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that merely kicked the
 problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously
 discussed, so I'd still have to manually do something for them
 This meant going all-in on fully custom serialization and deserialization, something I'd never done
-before, but how hard could it be? (actually not that hard!)
+before, but how hard could it be? (spoiler: actually not that hard!)
 ## Great coders steal
-steal the uuid serde impls from sqlx
+Something I appreciate about the [Rust programming language](https://www.rust-lang.org/) is that
 because of the way the compiler works[^rust-generics], the full source code almost always has to be
 available to you, the end-user coder. The culture around it is also very biased toward open source,
 and so all the extremely useful libraries are just sitting there, ready to be studied and copied. So
 the first thing I did was take a look at how [SQLx handled
 UUIDs](https://github.com/launchbadge/sqlx/blob/main/sqlx-sqlite/src/types/uuid.rs):
 ``` rust
 impl Type<Sqlite> for Uuid {
    fn type_info() -> SqliteTypeInfo {
        SqliteTypeInfo(DataType::Blob)
    }
    fn compatible(ty: &SqliteTypeInfo) -> bool {
        matches!(ty.0, DataType::Blob | DataType::Text)
    }
 }
 impl<'q> Encode<'q, Sqlite> for Uuid {
    fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
        args.push(SqliteArgumentValue::Blob(Cow::Owned(
            self.as_bytes().to_vec(),
        )));
        IsNull::No
    }
 }
 impl Decode<'_, Sqlite> for Uuid {
    fn decode(value: SqliteValueRef<'_>) -> Result<Self, BoxDynError> {
        // construct a Uuid from the returned bytes
        Uuid::from_slice(value.blob()).map_err(Into::into)
    }
 }
 ```
 There's not a ton going on there, as you can see. To "encode" it just gets the bytes out of the
 UUID, and to "decode" it just gets the bytes out of the database. I couldn't use that exactly as
 done by the SQLx authors, as they were using datatypes that were private to their crate, but it was
 close enough; here's mine:
 ``` rust
 impl sqlx::Type<sqlx::Sqlite> for DbId {
    fn type_info() -> <sqlx::Sqlite as sqlx::Database>::TypeInfo {
        <&[u8] as sqlx::Type<sqlx::Sqlite>>::type_info()
    }
 }
 impl<'q> Encode<'q, Sqlite> for DbId {
    fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
        args.push(SqliteArgumentValue::Blob(Cow::Owned(self.bytes().to_vec())));
        IsNull::No
    }
 }
 impl Decode<'_, Sqlite> for DbId {
    fn decode(value: SqliteValueRef<'_>) -> Result<Self, sqlx::error::BoxDynError> {
        let bytes = <&[u8] as Decode<Sqlite>>::decode(value)?;
        let bytes: [u8; 16] = bytes.try_into().unwrap_or_default();
        Ok(u128::from_be_bytes(bytes).into())
    }
 }
 ```
 (In order to implement the required methods from SQLx, I had to wrap the ULID in a new, custom type,
 which I called `DbId`, to comply with the [orphan rules](https://github.com/Ixrec/rust-orphan-rules).)
 That's only half the story, though. If all I had to worry about was getting data in and out of the
 database, that would be fine, but because I'm building a web app, I need to be able to include my
 new ID type in messages sent over a network or as part of a web page, and for that, it needed to
 implement some functionality from a different library, called [Serde](https://serde.rs/). My
 original implementation for *deserializing* looked like this:
 ``` rust
 struct DbIdVisitor;
 impl<'de> Visitor<'de> for DbIdVisitor {
    type Value = DbId;
    // make a DbId from a slice of bytes
    fn visit_bytes<E>(self, v: &[u8]) -> Result<Self::Value, E>
    where
        E: serde::de::Error,
    {
        ...
    }
    // make a DbId from a Vec of bytes
    fn visit_byte_buf<E>(self, v: Vec<u8>) -> Result<Self::Value, E>
    where
        E: serde::de::Error,
    {
        ...
    }
    // you get the picture
    fn visit_string() ...
    fn visit_u128() ...
    fn visit_i128() ...
 }
 ```
 In my mind, the only important pieces were the `visit_bytes()` and `visit_byte_buf()` methods,
 which worked basically the same as the `decode()` function for SQLx. I mean, as far as I could tell,
 the only time something would be encountering a serialized `DbId` would be in the form of raw bytes
 from the database; no one else would be trying to serialize one as something else that I didn't
 anticipate, right?
 RIGHT???
 (wrong)
 ## A puzzling failure
-# When in trouble, be sure to change many things at once
+As soon as my code compiled, I ran my tests. Everything passed... except for one, that tested
 logging in.
-## Death to the littlendians, obviously
+This was very strange. All the other tests were passing, and basically every operation requires
- - endianness
+getting one of these IDs into or out of the database. But at this point, it was late, and I set it
- - profit
+down until the next day.
 # When in doubt, change many things at once
 The next day I sat back down to get back to work, and in the course of examining what was going on,
 realized that I'd missed something crucial: these things were supposed to be *sortable*. But the way
 I was inserting them meant that they weren't, because of endianness.
 ## More like shmexicographic, amirite
 "ULID" stands for "Universally Unique Lexicographically Sortable
 Identifier"[^uulsid]. "[Lexicographic order](https://en.wikipedia.org/wiki/Lexicographic_order)"
 basically means, "like alphabetical, but for anything with a defined total order". Numbers have a
 defined total order; bigger numbers always go after smaller.
 But sometimes numbers get sorted out of order, if they're not treated as numbers. Like say you had a
 directory with twelve files in it, called "1.txt" up through "12.txt". If you were to ask to see
 them listed out in lexicographic order, it would go like:
 ``` text
 $ ls
 10.txt
 11.txt
 12.txt
 1.txt
 2.txt
 3.txt
 4.txt
 5.txt
 6.txt
 7.txt
 8.txt
 9.txt
 ```
 This is because '10' is "less than" '2' (and '0' is "less than" '.', which is why "10.txt" is before "1.txt"). The solution, as all
 data-entering people know, is to pad the number with leading '0's:
 ``` text
 $ ls
 01.txt
 02.txt
 03.txt
 04.txt
 05.txt
 06.txt
 07.txt
 08.txt
 09.txt
 10.txt
 11.txt
 12.txt
 ```
 Now the names are lexicographically sorted in the right numerical order[^confusing-yes].
 So, now that we're all expert lexicographicographers, we understand that our IDs are just
 supposed to naturally sort themselves in the correct order, based on when they were created; IDs
 created later should sort after IDs created earlier.
 The implementation for my ULIDs only guaranteed this property for the string form of them, but I was
 not storing them in string from. Fundamentally, the ULID was a simple [128-bit primitive
 integer](https://doc.rust-lang.org/std/primitive.u128.html), capable of holding values between 0 and
 340,282,366,920,938,463,463,374,607,431,768,211,455.
 But there's a problem: we're storing our ID in the database as a sequence of 16 bytes. I was asking
 for those bytes in "native endian", which in my case, meant "little endian". If you're not familiar
 with endianness, there are two varieties: big, and little. "Big" makes the most sense; if you see a
 number like "512", it's big-endian; the end is the part that's left-most, and "big" means that it is
 the most-significant-digit. This is the same as what westerners think of as "normal" numbers. In the
 number "512", the "most significant digit" is `5`, which correspends to `500`, which is added to the
 next-most-significant digit, `1`, corresponding to `10`, which is added to the next-most-significant
 digit, which is also the least-most-significant-digit, which is `2`, which is just `2`, giving us
 the full number `512`.
 If we put the least-significant-digit first, we'd write the number `512` as "215"; the order when
 written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
 come before "215", which is backwards.
 Little-endiannes is like that. If a multibyte value is on a little-endian system, the least-significant
 bytes will come first, and the sorting would be non-numeric.
 Unfortunaly, my computer is based on the Intel x86 instruction set, which means that it represents
 numbers in "little endian" form. This means that
 ## The actual problem
 there was no visitor for seqs, which is what json byte arrays are, and what async_sessions was doing.
 ## Wait, why isn't it broken?
 oh, it's not
 # Lessons learned
 don't change many things at once
 automated tests aren't enough
 ----
 [^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are
-reserved for version metadata.
+    reserved for version information.
 [^blob-of-bytes]: Some databases have direct support for 128-bit primitive values (numbers). The
-database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
+    database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
-arbitrary-length sequences of bytes called "blobs".
+    arbitrary-length sequences of bytes called "blobs".
-[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I
+[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I plan
-plan to dive into in a different post, but "blob" is specific to SQLite. In general, you'll probably
+    to dive into in a different post, but "blob" is specific to it. In general, you'll probably want
-want to take advantage of implementation-specific features of whatever database you're using, which
+    to take advantage of implementation-specific features of whatever database you're using, which
-means that your table definitions won't be fully portable to a different database. This is fine and
+    means that your table definitions won't be fully portable to a different database. This is fine
-good, actually!
+    and good, actually!
 [^no-stinkin-benches]: You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha,
-you must have never met a programmer before! So, no, obviously not. But that's coming in a follow-up.
+    you must have never met a programmer before! No, of course not. But, that's coming in a
    follow-up.
 [^rust-generics]: If the code you're using has
    [generics](https://doc.rust-lang.org/book/ch10-01-syntax.html) in it, then the compiler needs to
    generate specialized versions of that generic code based on how you use it; this is called
    "[monomorphization](https://doc.rust-lang.org/book/ch10-01-syntax.html#performance-of-code-using-generics)",
    and it requires the original generic source to work. That's also true in C++, which is why most
    templated code is [header-only](https://isocpp.org/wiki/faq/templates#templates-defn-vs-decl),
    but Rust doesn't have header files.
 [^uulsid]: I guess the extra 'U' and 'S' are invisible.
 [^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
 [thats_a_database]: ./thats_a_database.png "that's a database"