ready to publish
This commit is contained in:
parent
a30f2188ca
commit
87bad5869e
1 changed files with 64 additions and 18 deletions
|
@ -1,12 +1,10 @@
|
|||
+++
|
||||
title = "A One-Part Serialized Mystery"
|
||||
slug = "one-part-serialized-mystery"
|
||||
date = "2023-06-22"
|
||||
updated = "2023-06-22"
|
||||
date = "2023-06-29"
|
||||
updated = "2023-06-29"
|
||||
[taxonomies]
|
||||
tags = ["software", "rnd", "proclamation", "upscm"]
|
||||
[extra]
|
||||
toc = false
|
||||
tags = ["software", "rnd", "proclamation", "upscm", "rust"]
|
||||
+++
|
||||
|
||||
# *Mise en Scene*
|
||||
|
@ -45,11 +43,11 @@ values and move on your merry way.
|
|||
|
||||
However, if you ever think you might want to have multiple instances of your database running, and
|
||||
want to make sure they're eventually consistent with each other, then you might want to use a
|
||||
fancier identifier for your primary keys, to avoid collisions between primary keys.
|
||||
fancier identifier for your primary keys, to avoid collisions between them.
|
||||
|
||||
## UUIDs
|
||||
|
||||
A popular type for these is called a
|
||||
A popular type for fancy keys is called a
|
||||
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
|
||||
numbers[^uuidv4_random], and when turned into a string, usually look something like
|
||||
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
|
||||
|
@ -88,9 +86,9 @@ at" field from every table that used them as primary keys. Which, in my case, wo
|
|||
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
|
||||
anyway.
|
||||
|
||||
Plus, I was familiar with the idea of using sortable IDs, from
|
||||
I was actually already familiar with the idea of using time-based sortable IDs, from
|
||||
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
|
||||
using KSUIDs from the get-go, but discarded that for two main reasons:
|
||||
using them from the get-go, but discarded that for two main reasons:
|
||||
|
||||
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
|
||||
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
|
||||
|
@ -327,7 +325,7 @@ If we put the least-significant-digit first, we'd write the number `512` as "215
|
|||
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
|
||||
come before "215", which is backwards.
|
||||
|
||||
Little-endiannes is like that. If a multibyte numeric value is on a little-endian system, the
|
||||
Little-endianness is like that. If a multibyte numeric value is on a little-endian system, the
|
||||
least-significant bytes will come first, and a lexicographic sorting of those bytes would be
|
||||
non-numeric.
|
||||
|
||||
|
@ -390,7 +388,7 @@ there, hiding in plain sight, as they had been all along:
|
|||
> [...]
|
||||
>
|
||||
> * seq
|
||||
> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ...
|
||||
> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ...
|
||||
>
|
||||
> [...]
|
||||
>
|
||||
|
@ -418,7 +416,7 @@ day, I dove back into it.
|
|||
|
||||
All my serialization code was calling a method called
|
||||
[`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18),
|
||||
which simply called `another method that would return an array of 16 bytes, in big-endian order, so
|
||||
which simply called another method that would return an array of 16 bytes, in big-endian order, so
|
||||
it could go into the database and be sortable, as discussed.
|
||||
|
||||
But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes
|
||||
|
@ -432,11 +430,56 @@ Like, everything was *working*. Why did I need to construct from a different byt
|
|||
I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community
|
||||
and presented my case.
|
||||
|
||||
Basically, I showed that bytes were written correctly, resident in the DB in big-endian form, but
|
||||
then were "backwards" coming out and "had to be" cast using little-endian constructors
|
||||
("`from_ne_bytes()`").
|
||||
|
||||
What had actually happened is that as long as there was agreement about what order to use for reconstructing the
|
||||
ID from the bytes, it didn't matter if it was big or little-endian, it just had to be the same on
|
||||
both the
|
||||
[SQLx](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_106_105)
|
||||
side and on the
|
||||
[Serde](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_210_209)
|
||||
side. This is also irrespective of the order they were written out in, but again, the two sides must
|
||||
agree on the convention used. Inside the Serde method, I had added some debug printing of the bytes
|
||||
it was getting, and they were in little-endian order. What I had not realized is that that was
|
||||
because they were first passing through the SQLx method which reversed them.
|
||||
|
||||
Mmmmm, delicious, delicous red herring.
|
||||
|
||||
Two people were especially helpful, Julia Evans and Nicole Tietz-Sokolskaya; Julia grabbed a copy of
|
||||
my database file and poked it with Python, and could not replicate the behavior I was seeing, and
|
||||
Nicole did the same but with a little Rust program she wrote. Huge thanks to both of them (but not
|
||||
just them) for the extended [rubber ducking](https://en.wikipedia.org/wiki/Rubber_duck_debugging)!
|
||||
And apologies for the initial gas-lighting; Julia was quite patient and diplomatic when pushing back
|
||||
against "the bytes are coming out of the db backwards".
|
||||
|
||||
|
||||
# Lessons learned
|
||||
|
||||
don't change many things at once
|
||||
Welp, here we are, the end of the line; I hope this has been informative, or barring that, at least
|
||||
entertaining. Or the other way around, I'm not that fussy!
|
||||
|
||||
automated tests aren't enough
|
||||
Obviously, the biggest mistake was to futz with being clever about endianness before understanding
|
||||
why the login code was now failing. Had I gotten it working correctly first, I would have been able to
|
||||
figure out the requirement for agreement on convention between the two different serialization
|
||||
systems much sooner, and I would not have wasted mine and others' time on misunderstanding.
|
||||
|
||||
On the other hand, it's hard to see these things on the first try, especially when you're on your
|
||||
own, and are on your first fumbling steps in a new domain or ecosystem; for me, that was getting
|
||||
into the nitty-gritty with Serde, and for that matter, dealing directly with serialization-specific
|
||||
issues. Collaboration is a great technique for navigating these situations, and I definitely need to
|
||||
focus a bit more on enabling that[^solo-yolo-dev].
|
||||
|
||||
In the course of debugging this issue, I tried to get more insight via
|
||||
[testing](https://gitlab.com/nebkor/ww/-/commit/656e6dceedf0d86e2805e000c9821e931958a920#ce34dd57be10530addc52a3273548f2b8d3b8a9b_143_251),
|
||||
and though that helped a little, it was not nearly enough; the problem was that I misunderstood how
|
||||
something worked, not that I had mistakenly implemented something I was comfortable with. Tests
|
||||
aren't a substitute for understanding!
|
||||
|
||||
And of course, I'm now much more confident and comfortable with Serde; reading the Serde code for
|
||||
other things, like [UUIDs](https://github.com/uuid-rs/uuid/blob/main/src/external/serde_support.rs),
|
||||
is no longer an exercise in eye-glaze-control. Maybe this has helped you with that too?
|
||||
|
||||
----
|
||||
|
||||
|
@ -469,8 +512,8 @@ automated tests aren't enough
|
|||
|
||||
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
|
||||
|
||||
[^ulid-timestamps]: The 7 most-significant bytes make up the timestamp in a ULID, which in the hex
|
||||
dump form pasted there would be the first fourteen characters, since each byte is two hex
|
||||
[^ulid-timestamps]: The 6 most-significant bytes make up the timestamp in a ULID, which in the hex
|
||||
dump form pasted there would be the first twelve characters, since each byte is two hex
|
||||
digits.
|
||||
|
||||
[^advanced-debugging]: "adding `dbg!()` statements in the code"
|
||||
|
@ -483,7 +526,10 @@ automated tests aren't enough
|
|||
the bytes as big-endian, but were simply never actually used. I fixed [in the next
|
||||
commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b)
|
||||
|
||||
[^solo-yolo-dev]: I've described my current practices as "solo-yolo", which has its plusses and
|
||||
minuses, as you may imagine.
|
||||
|
||||
[thats_a_database]: ./thats_a_database.png "that's a database"
|
||||
|
||||
[see_the_light]: ./seen_the_light.png
|
||||
[thats_a_database]: ./thats_a_database.png "simpsons that's-a-paddling guy"
|
||||
|
||||
[see_the_light]: ./seen_the_light.png "jake blues seeing the light"
|
||||
|
|
Loading…
Reference in a new issue