ready to publish

This commit is contained in:
Joe Ardent 2023-06-29 11:23:34 -07:00
parent a30f2188ca
commit 87bad5869e

View file

@ -1,12 +1,10 @@
+++
title = "A One-Part Serialized Mystery"
slug = "one-part-serialized-mystery"
date = "2023-06-22"
updated = "2023-06-22"
date = "2023-06-29"
updated = "2023-06-29"
[taxonomies]
tags = ["software", "rnd", "proclamation", "upscm"]
[extra]
toc = false
tags = ["software", "rnd", "proclamation", "upscm", "rust"]
+++
# *Mise en Scene*
@ -45,11 +43,11 @@ values and move on your merry way.
However, if you ever think you might want to have multiple instances of your database running, and
want to make sure they're eventually consistent with each other, then you might want to use a
fancier identifier for your primary keys, to avoid collisions between primary keys.
fancier identifier for your primary keys, to avoid collisions between them.
## UUIDs
A popular type for these is called a
A popular type for fancy keys is called a
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
numbers[^uuidv4_random], and when turned into a string, usually look something like
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
@ -88,9 +86,9 @@ at" field from every table that used them as primary keys. Which, in my case, wo
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
anyway.
Plus, I was familiar with the idea of using sortable IDs, from
I was actually already familiar with the idea of using time-based sortable IDs, from
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
using KSUIDs from the get-go, but discarded that for two main reasons:
using them from the get-go, but discarded that for two main reasons:
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
@ -327,7 +325,7 @@ If we put the least-significant-digit first, we'd write the number `512` as "215
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
come before "215", which is backwards.
Little-endiannes is like that. If a multibyte numeric value is on a little-endian system, the
Little-endianness is like that. If a multibyte numeric value is on a little-endian system, the
least-significant bytes will come first, and a lexicographic sorting of those bytes would be
non-numeric.
@ -390,7 +388,7 @@ there, hiding in plain sight, as they had been all along:
> [...]
>
> * seq
> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ...
> - A variably sized heterogeneous sequence of values, for example Vec&lt;T&gt; or HashSet&lt;T&gt;. ...
>
> [...]
>
@ -418,7 +416,7 @@ day, I dove back into it.
All my serialization code was calling a method called
[`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18),
which simply called `another method that would return an array of 16 bytes, in big-endian order, so
which simply called another method that would return an array of 16 bytes, in big-endian order, so
it could go into the database and be sortable, as discussed.
But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes
@ -432,11 +430,56 @@ Like, everything was *working*. Why did I need to construct from a different byt
I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community
and presented my case.
Basically, I showed that bytes were written correctly, resident in the DB in big-endian form, but
then were "backwards" coming out and "had to be" cast using little-endian constructors
("`from_ne_bytes()`").
What had actually happened is that as long as there was agreement about what order to use for reconstructing the
ID from the bytes, it didn't matter if it was big or little-endian, it just had to be the same on
both the
[SQLx](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_106_105)
side and on the
[Serde](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_210_209)
side. This is also irrespective of the order they were written out in, but again, the two sides must
agree on the convention used. Inside the Serde method, I had added some debug printing of the bytes
it was getting, and they were in little-endian order. What I had not realized is that that was
because they were first passing through the SQLx method which reversed them.
Mmmmm, delicious, delicous red herring.
Two people were especially helpful, Julia Evans and Nicole Tietz-Sokolskaya; Julia grabbed a copy of
my database file and poked it with Python, and could not replicate the behavior I was seeing, and
Nicole did the same but with a little Rust program she wrote. Huge thanks to both of them (but not
just them) for the extended [rubber ducking](https://en.wikipedia.org/wiki/Rubber_duck_debugging)!
And apologies for the initial gas-lighting; Julia was quite patient and diplomatic when pushing back
against "the bytes are coming out of the db backwards".
# Lessons learned
don't change many things at once
Welp, here we are, the end of the line; I hope this has been informative, or barring that, at least
entertaining. Or the other way around, I'm not that fussy!
automated tests aren't enough
Obviously, the biggest mistake was to futz with being clever about endianness before understanding
why the login code was now failing. Had I gotten it working correctly first, I would have been able to
figure out the requirement for agreement on convention between the two different serialization
systems much sooner, and I would not have wasted mine and others' time on misunderstanding.
On the other hand, it's hard to see these things on the first try, especially when you're on your
own, and are on your first fumbling steps in a new domain or ecosystem; for me, that was getting
into the nitty-gritty with Serde, and for that matter, dealing directly with serialization-specific
issues. Collaboration is a great technique for navigating these situations, and I definitely need to
focus a bit more on enabling that[^solo-yolo-dev].
In the course of debugging this issue, I tried to get more insight via
[testing](https://gitlab.com/nebkor/ww/-/commit/656e6dceedf0d86e2805e000c9821e931958a920#ce34dd57be10530addc52a3273548f2b8d3b8a9b_143_251),
and though that helped a little, it was not nearly enough; the problem was that I misunderstood how
something worked, not that I had mistakenly implemented something I was comfortable with. Tests
aren't a substitute for understanding!
And of course, I'm now much more confident and comfortable with Serde; reading the Serde code for
other things, like [UUIDs](https://github.com/uuid-rs/uuid/blob/main/src/external/serde_support.rs),
is no longer an exercise in eye-glaze-control. Maybe this has helped you with that too?
----
@ -469,8 +512,8 @@ automated tests aren't enough
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
[^ulid-timestamps]: The 7 most-significant bytes make up the timestamp in a ULID, which in the hex
dump form pasted there would be the first fourteen characters, since each byte is two hex
[^ulid-timestamps]: The 6 most-significant bytes make up the timestamp in a ULID, which in the hex
dump form pasted there would be the first twelve characters, since each byte is two hex
digits.
[^advanced-debugging]: "adding `dbg!()` statements in the code"
@ -483,7 +526,10 @@ automated tests aren't enough
the bytes as big-endian, but were simply never actually used. I fixed [in the next
commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b)
[^solo-yolo-dev]: I've described my current practices as "solo-yolo", which has its plusses and
minuses, as you may imagine.
[thats_a_database]: ./thats_a_database.png "that's a database"
[see_the_light]: ./seen_the_light.png
[thats_a_database]: ./thats_a_database.png "simpsons that's-a-paddling guy"
[see_the_light]: ./seen_the_light.png "jake blues seeing the light"