ready to publish
This commit is contained in:
parent
a30f2188ca
commit
87bad5869e
1 changed files with 64 additions and 18 deletions
|
@ -1,12 +1,10 @@
|
||||||
+++
|
+++
|
||||||
title = "A One-Part Serialized Mystery"
|
title = "A One-Part Serialized Mystery"
|
||||||
slug = "one-part-serialized-mystery"
|
slug = "one-part-serialized-mystery"
|
||||||
date = "2023-06-22"
|
date = "2023-06-29"
|
||||||
updated = "2023-06-22"
|
updated = "2023-06-29"
|
||||||
[taxonomies]
|
[taxonomies]
|
||||||
tags = ["software", "rnd", "proclamation", "upscm"]
|
tags = ["software", "rnd", "proclamation", "upscm", "rust"]
|
||||||
[extra]
|
|
||||||
toc = false
|
|
||||||
+++
|
+++
|
||||||
|
|
||||||
# *Mise en Scene*
|
# *Mise en Scene*
|
||||||
|
@ -45,11 +43,11 @@ values and move on your merry way.
|
||||||
|
|
||||||
However, if you ever think you might want to have multiple instances of your database running, and
|
However, if you ever think you might want to have multiple instances of your database running, and
|
||||||
want to make sure they're eventually consistent with each other, then you might want to use a
|
want to make sure they're eventually consistent with each other, then you might want to use a
|
||||||
fancier identifier for your primary keys, to avoid collisions between primary keys.
|
fancier identifier for your primary keys, to avoid collisions between them.
|
||||||
|
|
||||||
## UUIDs
|
## UUIDs
|
||||||
|
|
||||||
A popular type for these is called a
|
A popular type for fancy keys is called a
|
||||||
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
|
[v4 UUIDs](https://datatracker.ietf.org/doc/html/rfc4122#page-14). These are 128-bit random
|
||||||
numbers[^uuidv4_random], and when turned into a string, usually look something like
|
numbers[^uuidv4_random], and when turned into a string, usually look something like
|
||||||
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
|
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
|
||||||
|
@ -88,9 +86,9 @@ at" field from every table that used them as primary keys. Which, in my case, wo
|
||||||
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
|
and I'm worried less about the speed of inserts than I am about keeping total on-disk size down
|
||||||
anyway.
|
anyway.
|
||||||
|
|
||||||
Plus, I was familiar with the idea of using sortable IDs, from
|
I was actually already familiar with the idea of using time-based sortable IDs, from
|
||||||
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
|
[KSUIDs](https://github.com/segmentio/ksuid). It's an attractive concept to me, and I'd considered
|
||||||
using KSUIDs from the get-go, but discarded that for two main reasons:
|
using them from the get-go, but discarded that for two main reasons:
|
||||||
|
|
||||||
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
|
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
|
||||||
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
|
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
|
||||||
|
@ -327,7 +325,7 @@ If we put the least-significant-digit first, we'd write the number `512` as "215
|
||||||
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
|
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
|
||||||
come before "215", which is backwards.
|
come before "215", which is backwards.
|
||||||
|
|
||||||
Little-endiannes is like that. If a multibyte numeric value is on a little-endian system, the
|
Little-endianness is like that. If a multibyte numeric value is on a little-endian system, the
|
||||||
least-significant bytes will come first, and a lexicographic sorting of those bytes would be
|
least-significant bytes will come first, and a lexicographic sorting of those bytes would be
|
||||||
non-numeric.
|
non-numeric.
|
||||||
|
|
||||||
|
@ -390,7 +388,7 @@ there, hiding in plain sight, as they had been all along:
|
||||||
> [...]
|
> [...]
|
||||||
>
|
>
|
||||||
> * seq
|
> * seq
|
||||||
> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ...
|
> - A variably sized heterogeneous sequence of values, for example Vec<T> or HashSet<T>. ...
|
||||||
>
|
>
|
||||||
> [...]
|
> [...]
|
||||||
>
|
>
|
||||||
|
@ -418,7 +416,7 @@ day, I dove back into it.
|
||||||
|
|
||||||
All my serialization code was calling a method called
|
All my serialization code was calling a method called
|
||||||
[`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18),
|
[`bytes()`](https://gitlab.com/nebkor/ww/-/blob/656e6dceedf0d86e2805e000c9821e931958a920/src/db_id.rs#L18),
|
||||||
which simply called `another method that would return an array of 16 bytes, in big-endian order, so
|
which simply called another method that would return an array of 16 bytes, in big-endian order, so
|
||||||
it could go into the database and be sortable, as discussed.
|
it could go into the database and be sortable, as discussed.
|
||||||
|
|
||||||
But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes
|
But all[^actually_not_all] my *deserialization* code was constructing the IDs as [though the bytes
|
||||||
|
@ -432,11 +430,56 @@ Like, everything was *working*. Why did I need to construct from a different byt
|
||||||
I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community
|
I was losing my mind, so I reached out to the [Recurse Center](https://www.recurse.com) community
|
||||||
and presented my case.
|
and presented my case.
|
||||||
|
|
||||||
|
Basically, I showed that bytes were written correctly, resident in the DB in big-endian form, but
|
||||||
|
then were "backwards" coming out and "had to be" cast using little-endian constructors
|
||||||
|
("`from_ne_bytes()`").
|
||||||
|
|
||||||
|
What had actually happened is that as long as there was agreement about what order to use for reconstructing the
|
||||||
|
ID from the bytes, it didn't matter if it was big or little-endian, it just had to be the same on
|
||||||
|
both the
|
||||||
|
[SQLx](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_106_105)
|
||||||
|
side and on the
|
||||||
|
[Serde](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b_210_209)
|
||||||
|
side. This is also irrespective of the order they were written out in, but again, the two sides must
|
||||||
|
agree on the convention used. Inside the Serde method, I had added some debug printing of the bytes
|
||||||
|
it was getting, and they were in little-endian order. What I had not realized is that that was
|
||||||
|
because they were first passing through the SQLx method which reversed them.
|
||||||
|
|
||||||
|
Mmmmm, delicious, delicous red herring.
|
||||||
|
|
||||||
|
Two people were especially helpful, Julia Evans and Nicole Tietz-Sokolskaya; Julia grabbed a copy of
|
||||||
|
my database file and poked it with Python, and could not replicate the behavior I was seeing, and
|
||||||
|
Nicole did the same but with a little Rust program she wrote. Huge thanks to both of them (but not
|
||||||
|
just them) for the extended [rubber ducking](https://en.wikipedia.org/wiki/Rubber_duck_debugging)!
|
||||||
|
And apologies for the initial gas-lighting; Julia was quite patient and diplomatic when pushing back
|
||||||
|
against "the bytes are coming out of the db backwards".
|
||||||
|
|
||||||
|
|
||||||
# Lessons learned
|
# Lessons learned
|
||||||
|
|
||||||
don't change many things at once
|
Welp, here we are, the end of the line; I hope this has been informative, or barring that, at least
|
||||||
|
entertaining. Or the other way around, I'm not that fussy!
|
||||||
|
|
||||||
automated tests aren't enough
|
Obviously, the biggest mistake was to futz with being clever about endianness before understanding
|
||||||
|
why the login code was now failing. Had I gotten it working correctly first, I would have been able to
|
||||||
|
figure out the requirement for agreement on convention between the two different serialization
|
||||||
|
systems much sooner, and I would not have wasted mine and others' time on misunderstanding.
|
||||||
|
|
||||||
|
On the other hand, it's hard to see these things on the first try, especially when you're on your
|
||||||
|
own, and are on your first fumbling steps in a new domain or ecosystem; for me, that was getting
|
||||||
|
into the nitty-gritty with Serde, and for that matter, dealing directly with serialization-specific
|
||||||
|
issues. Collaboration is a great technique for navigating these situations, and I definitely need to
|
||||||
|
focus a bit more on enabling that[^solo-yolo-dev].
|
||||||
|
|
||||||
|
In the course of debugging this issue, I tried to get more insight via
|
||||||
|
[testing](https://gitlab.com/nebkor/ww/-/commit/656e6dceedf0d86e2805e000c9821e931958a920#ce34dd57be10530addc52a3273548f2b8d3b8a9b_143_251),
|
||||||
|
and though that helped a little, it was not nearly enough; the problem was that I misunderstood how
|
||||||
|
something worked, not that I had mistakenly implemented something I was comfortable with. Tests
|
||||||
|
aren't a substitute for understanding!
|
||||||
|
|
||||||
|
And of course, I'm now much more confident and comfortable with Serde; reading the Serde code for
|
||||||
|
other things, like [UUIDs](https://github.com/uuid-rs/uuid/blob/main/src/external/serde_support.rs),
|
||||||
|
is no longer an exercise in eye-glaze-control. Maybe this has helped you with that too?
|
||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
|
@ -469,8 +512,8 @@ automated tests aren't enough
|
||||||
|
|
||||||
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
|
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
|
||||||
|
|
||||||
[^ulid-timestamps]: The 7 most-significant bytes make up the timestamp in a ULID, which in the hex
|
[^ulid-timestamps]: The 6 most-significant bytes make up the timestamp in a ULID, which in the hex
|
||||||
dump form pasted there would be the first fourteen characters, since each byte is two hex
|
dump form pasted there would be the first twelve characters, since each byte is two hex
|
||||||
digits.
|
digits.
|
||||||
|
|
||||||
[^advanced-debugging]: "adding `dbg!()` statements in the code"
|
[^advanced-debugging]: "adding `dbg!()` statements in the code"
|
||||||
|
@ -483,7 +526,10 @@ automated tests aren't enough
|
||||||
the bytes as big-endian, but were simply never actually used. I fixed [in the next
|
the bytes as big-endian, but were simply never actually used. I fixed [in the next
|
||||||
commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b)
|
commit](https://gitlab.com/nebkor/ww/-/commit/84d70336d39293294fd47b4cf115c70091552c11#ce34dd57be10530addc52a3273548f2b8d3b8a9b)
|
||||||
|
|
||||||
|
[^solo-yolo-dev]: I've described my current practices as "solo-yolo", which has its plusses and
|
||||||
|
minuses, as you may imagine.
|
||||||
|
|
||||||
[thats_a_database]: ./thats_a_database.png "that's a database"
|
|
||||||
|
|
||||||
[see_the_light]: ./seen_the_light.png
|
[thats_a_database]: ./thats_a_database.png "simpsons that's-a-paddling guy"
|
||||||
|
|
||||||
|
[see_the_light]: ./seen_the_light.png "jake blues seeing the light"
|
||||||
|
|
Loading…
Reference in a new issue