From 87bad5869e86e68cd4a16df6770b76428b95eb83 Mon Sep 17 00:00:00 2001 From: Joe Ardent Date: Thu, 29 Jun 2023 11:23:34 -0700 Subject: [PATCH] ready to publish --- content/rnd/a_serialized_mystery/index.md | 82 ++++++++++++++++++----- 1 file changed, 64 insertions(+), 18 deletions(-) diff --git a/content/rnd/a_serialized_mystery/index.md b/content/rnd/a_serialized_mystery/index.md index bf3cce0..29c35ba 100644 --- a/content/rnd/a_serialized_mystery/index.md +++ b/content/rnd/a_serialized_mystery/index.md @@ -1,12 +1,10 @@ +++ title = "A One-Part Serialized Mystery" slug = "one-part-serialized-mystery" -date = "2023-06-22" -updated = "2023-06-22" +date = "2023-06-29" +updated = "2023-06-29" [taxonomies] -tags = ["software", "rnd", "proclamation", "upscm"] -[extra] -toc = false +tags = ["software", "rnd", "proclamation", "upscm", "rust"] +++ # *Mise en Scene* @@ -45,11 +43,11 @@ values and move on your merry way. However, if you ever think you might want to have multiple instances of your database running, and want to make sure they're eventually consistent with each other, then you might want to use a -fancier identifier for your primary keys, to avoid collisions between primary keys. +fancier identifier for your primary keys, to avoid collisions between them. ## UUIDs -A popular type for these is called a +A popular type for fancy keys is called a [v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random numbers[^uuidv4_random], and when turned into a string, usually look something like `1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious @@ -88,9 +86,9 @@ at" field from every table that used them as primary keys. Which, in my case, wo and I'm worried less about the speed of inserts than I am about keeping total on-disk size down anyway. -Plus, I was familiar with the idea of using sortable IDs, from +I was actually already familiar with the idea of using time-based sortable IDs, from [KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered -using KSUIDs from the get-go, but discarded that for two main reasons: +using them from the get-go, but discarded that for two main reasons: - they're **FOUR WHOLE BYTES!!!** larger than UUIDs - I'd have to manually implement serialization/deserialization, since SQLx doesn't @@ -327,7 +325,7 @@ If we put the least-significant-digit first, we'd write the number `512` as "215 written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125" come before "215", which is backwards. -Little-endiannes is like that. If a multibyte numeric value is on a little-endian system, the +Little-endianness is like that. If a multibyte numeric value is on a little-endian system, the least-significant bytes will come first, and a lexicographic sorting of those bytes would be non-numeric. @@ -390,7 +388,7 @@ there, hiding in plain sight, as they had been all along: > [...] > > * seq -> - A variably sized heterogeneous sequence of values, for example Vec or HashSet. ... +> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ... > > [...] > @@ -418,7 +416,7 @@ day, I dove back into it. All my serialization code was calling a method called [`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18), -which simply called `another method that would return an array of 16 bytes, in big-endian order, so +which simply called another method that would return an array of 16 bytes, in big-endian order, so it could go into the database and be sortable, as discussed. But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes @@ -432,11 +430,56 @@ Like, everything was *working*. Why did I need to construct from a different byt I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community and presented my case. +Basically, I showed that bytes were written correctly, resident in the DB in big-endian form, but +then were "backwards" coming out and "had to be" cast using little-endian constructors +("`from_ne_bytes()`"). + +What had actually happened is that as long as there was agreement about what order to use for reconstructing the +ID from the bytes, it didn't matter if it was big or little-endian, it just had to be the same on +both the +[SQLx](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_106_105) +side and on the +[Serde](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_210_209) +side. This is also irrespective of the order they were written out in, but again, the two sides must +agree on the convention used. Inside the Serde method, I had added some debug printing of the bytes +it was getting, and they were in little-endian order. What I had not realized is that that was +because they were first passing through the SQLx method which reversed them. + +Mmmmm, delicious, delicous red herring. + +Two people were especially helpful, Julia Evans and Nicole Tietz-Sokolskaya; Julia grabbed a copy of +my database file and poked it with Python, and could not replicate the behavior I was seeing, and +Nicole did the same but with a little Rust program she wrote. Huge thanks to both of them (but not +just them) for the extended [rubber ducking](https://en.wikipedia.org/wiki/Rubber_duck_debugging)! +And apologies for the initial gas-lighting; Julia was quite patient and diplomatic when pushing back +against "the bytes are coming out of the db backwards". + + # Lessons learned -don't change many things at once +Welp, here we are, the end of the line; I hope this has been informative, or barring that, at least +entertaining. Or the other way around, I'm not that fussy! -automated tests aren't enough +Obviously, the biggest mistake was to futz with being clever about endianness before understanding +why the login code was now failing. Had I gotten it working correctly first, I would have been able to +figure out the requirement for agreement on convention between the two different serialization +systems much sooner, and I would not have wasted mine and others' time on misunderstanding. + +On the other hand, it's hard to see these things on the first try, especially when you're on your +own, and are on your first fumbling steps in a new domain or ecosystem; for me, that was getting +into the nitty-gritty with Serde, and for that matter, dealing directly with serialization-specific +issues. Collaboration is a great technique for navigating these situations, and I definitely need to +focus a bit more on enabling that[^solo-yolo-dev]. + +In the course of debugging this issue, I tried to get more insight via +[testing](https://gitlab.com/nebkor/ww/-/commit/656e6dceedf0d86e2805e000c9821e931958a920#ce34dd57be10530addc52a3273548f2b8d3b8a9b_143_251), +and though that helped a little, it was not nearly enough; the problem was that I misunderstood how +something worked, not that I had mistakenly implemented something I was comfortable with. Tests +aren't a substitute for understanding! + +And of course, I'm now much more confident and comfortable with Serde; reading the Serde code for +other things, like [UUIDs](https://github.com/uuid-rs/uuid/blob/main/src/external/serde_support.rs), +is no longer an exercise in eye-glaze-control. Maybe this has helped you with that too? ---- @@ -469,8 +512,8 @@ automated tests aren't enough [^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged. -[^ulid-timestamps]: The 7 most-significant bytes make up the timestamp in a ULID, which in the hex - dump form pasted there would be the first fourteen characters, since each byte is two hex +[^ulid-timestamps]: The 6 most-significant bytes make up the timestamp in a ULID, which in the hex + dump form pasted there would be the first twelve characters, since each byte is two hex digits. [^advanced-debugging]: "adding `dbg!()` statements in the code" @@ -483,7 +526,10 @@ automated tests aren't enough the bytes as big-endian, but were simply never actually used. I fixed [in the next commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b) +[^solo-yolo-dev]: I've described my current practices as "solo-yolo", which has its plusses and + minuses, as you may imagine. -[thats_a_database]: ./thats_a_database.png "that's a database" -[see_the_light]: ./seen_the_light.png +[thats_a_database]: ./thats_a_database.png "simpsons that's-a-paddling guy" + +[see_the_light]: ./seen_the_light.png "jake blues seeing the light"