checkpoint; working out endianness
This commit is contained in:
parent
2aeec60470
commit
e7bd1bd5f8
1 changed files with 246 additions and 21 deletions
|
@ -55,7 +55,7 @@ numbers[^uuidv4_random], and when turned into a string, usually look something l
|
|||
`1c20104f-e04f-409e-9ad3-94455e5f4fea`; this is called the "hyphenated" form, for fairly obvious
|
||||
reasons. Although sometimes they're stored in a DB in that form directly, that's using 36 bytes to
|
||||
store 16 bytes' worth of data, which is more than twice as many bytes as necessary. And if you're
|
||||
a programmer, this sort of conspicous waste is unconscionsable.
|
||||
a programmer, this sort of conspicuous waste is unconscionable.
|
||||
|
||||
You can cut that to 32 bytes by just dropping the dashes, but then that's still twice as many bytes
|
||||
as the actual data requires. If you never have to actually display the ID inside the database, then
|
||||
|
@ -93,13 +93,17 @@ Plus, I was familiar with the idea of using sortable IDs, from
|
|||
using KSUIDs from the get-go, but discarded that for two main reasons:
|
||||
|
||||
- they're **FOUR WHOLE BYTES!!!** larger than UUIDs
|
||||
- I'd have to manually implement serialization/deserialization for them anyway, since SQLx didn't
|
||||
- I'd have to manually implement serialization/deserialization, since SQLx doesn't
|
||||
have native support for them
|
||||
|
||||
In reality, neither of those are real show-stoppers; 20 vs. 16 bytes is probably not that
|
||||
significant, and I'd have to do the manual serialization stuff anyway.
|
||||
significant, and I'd have to do the manual serialization stuff for anything besides a
|
||||
less-than-8-bytes number or a normal UUID. Still, four bytes is four bytes, and all other things
|
||||
being equal, I'd rather go for the trimmer, 64-bit-aligned value.
|
||||
|
||||
I was ready to do this thing.
|
||||
Finally, I'd recently finished with adding some ability to actually interact with data in a
|
||||
meaningful way, and to add new records to the database, which meant that it was now or never for
|
||||
standardizing on a type for the primary keys. I was ready to do this thing.
|
||||
|
||||
# Serial problems
|
||||
|
||||
|
@ -113,45 +117,266 @@ ULIDs. Technically, I could have let Rust [handle that for me](https://serde.rs/
|
|||
automatically deriving that functionality. There were a couple snags with that course, though:
|
||||
|
||||
- the default serialized representation of a ULID in the library I was using to provide them [is as
|
||||
26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html)
|
||||
26-character strings](https://docs.rs/ulid/latest/ulid/serde/index.html), and I wanted to use only
|
||||
16 bytes in the database
|
||||
- you could tell it to serialize as a [128-bit
|
||||
number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that only kicked the
|
||||
number](https://docs.rs/ulid/latest/ulid/serde/ulid_as_u128/index.html), but that merely kicked the
|
||||
problem one step down the road since SQLite can only handle up to 64-bit numbers, as previously
|
||||
discussed, so I'd still have to manually do something for them
|
||||
|
||||
This meant going all-in on fully custom serialization and deserialization, something I'd never done
|
||||
before, but how hard could it be? (actually not that hard!)
|
||||
before, but how hard could it be? (spoiler: actually not that hard!)
|
||||
|
||||
## Great coders steal
|
||||
|
||||
steal the uuid serde impls from sqlx
|
||||
Something I appreciate about the [Rust programming language](https://www.rust-lang.org/) is that
|
||||
because of the way the compiler works[^rust-generics], the full source code almost always has to be
|
||||
available to you, the end-user coder. The culture around it is also very biased toward open source,
|
||||
and so all the extremely useful libraries are just sitting there, ready to be studied and copied. So
|
||||
the first thing I did was take a look at how [SQLx handled
|
||||
UUIDs](https://github.com/launchbadge/sqlx/blob/main/sqlx-sqlite/src/types/uuid.rs):
|
||||
|
||||
``` rust
|
||||
impl Type<Sqlite> for Uuid {
|
||||
fn type_info() -> SqliteTypeInfo {
|
||||
SqliteTypeInfo(DataType::Blob)
|
||||
}
|
||||
|
||||
fn compatible(ty: &SqliteTypeInfo) -> bool {
|
||||
matches!(ty.0, DataType::Blob | DataType::Text)
|
||||
}
|
||||
}
|
||||
|
||||
impl<'q> Encode<'q, Sqlite> for Uuid {
|
||||
fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
|
||||
args.push(SqliteArgumentValue::Blob(Cow::Owned(
|
||||
self.as_bytes().to_vec(),
|
||||
)));
|
||||
|
||||
IsNull::No
|
||||
}
|
||||
}
|
||||
|
||||
impl Decode<'_, Sqlite> for Uuid {
|
||||
fn decode(value: SqliteValueRef<'_>) -> Result<Self, BoxDynError> {
|
||||
// construct a Uuid from the returned bytes
|
||||
Uuid::from_slice(value.blob()).map_err(Into::into)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
There's not a ton going on there, as you can see. To "encode" it just gets the bytes out of the
|
||||
UUID, and to "decode" it just gets the bytes out of the database. I couldn't use that exactly as
|
||||
done by the SQLx authors, as they were using datatypes that were private to their crate, but it was
|
||||
close enough; here's mine:
|
||||
|
||||
``` rust
|
||||
impl sqlx::Type<sqlx::Sqlite> for DbId {
|
||||
fn type_info() -> <sqlx::Sqlite as sqlx::Database>::TypeInfo {
|
||||
<&[u8] as sqlx::Type<sqlx::Sqlite>>::type_info()
|
||||
}
|
||||
}
|
||||
|
||||
impl<'q> Encode<'q, Sqlite> for DbId {
|
||||
fn encode_by_ref(&self, args: &mut Vec<SqliteArgumentValue<'q>>) -> IsNull {
|
||||
args.push(SqliteArgumentValue::Blob(Cow::Owned(self.bytes().to_vec())));
|
||||
IsNull::No
|
||||
}
|
||||
}
|
||||
|
||||
impl Decode<'_, Sqlite> for DbId {
|
||||
fn decode(value: SqliteValueRef<'_>) -> Result<Self, sqlx::error::BoxDynError> {
|
||||
let bytes = <&[u8] as Decode<Sqlite>>::decode(value)?;
|
||||
let bytes: [u8; 16] = bytes.try_into().unwrap_or_default();
|
||||
Ok(u128::from_be_bytes(bytes).into())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
(In order to implement the required methods from SQLx, I had to wrap the ULID in a new, custom type,
|
||||
which I called `DbId`, to comply with the [orphan rules](https://github.com/Ixrec/rust-orphan-rules).)
|
||||
|
||||
That's only half the story, though. If all I had to worry about was getting data in and out of the
|
||||
database, that would be fine, but because I'm building a web app, I need to be able to include my
|
||||
new ID type in messages sent over a network or as part of a web page, and for that, it needed to
|
||||
implement some functionality from a different library, called [Serde](https://serde.rs/). My
|
||||
original implementation for *deserializing* looked like this:
|
||||
|
||||
``` rust
|
||||
struct DbIdVisitor;
|
||||
|
||||
impl<'de> Visitor<'de> for DbIdVisitor {
|
||||
type Value = DbId;
|
||||
|
||||
// make a DbId from a slice of bytes
|
||||
fn visit_bytes<E>(self, v: &[u8]) -> Result<Self::Value, E>
|
||||
where
|
||||
E: serde::de::Error,
|
||||
{
|
||||
...
|
||||
}
|
||||
|
||||
// make a DbId from a Vec of bytes
|
||||
fn visit_byte_buf<E>(self, v: Vec<u8>) -> Result<Self::Value, E>
|
||||
where
|
||||
E: serde::de::Error,
|
||||
{
|
||||
...
|
||||
}
|
||||
|
||||
// you get the picture
|
||||
fn visit_string() ...
|
||||
fn visit_u128() ...
|
||||
fn visit_i128() ...
|
||||
}
|
||||
```
|
||||
|
||||
In my mind, the only important pieces were the `visit_bytes()` and `visit_byte_buf()` methods,
|
||||
which worked basically the same as the `decode()` function for SQLx. I mean, as far as I could tell,
|
||||
the only time something would be encountering a serialized `DbId` would be in the form of raw bytes
|
||||
from the database; no one else would be trying to serialize one as something else that I didn't
|
||||
anticipate, right?
|
||||
|
||||
RIGHT???
|
||||
|
||||
(wrong)
|
||||
|
||||
## A puzzling failure
|
||||
|
||||
# When in trouble, be sure to change many things at once
|
||||
As soon as my code compiled, I ran my tests. Everything passed... except for one, that tested
|
||||
logging in.
|
||||
|
||||
## Death to the littlendians, obviously
|
||||
- endianness
|
||||
- profit
|
||||
This was very strange. All the other tests were passing, and basically every operation requires
|
||||
getting one of these IDs into or out of the database. But at this point, it was late, and I set it
|
||||
down until the next day.
|
||||
|
||||
# When in doubt, change many things at once
|
||||
|
||||
The next day I sat back down to get back to work, and in the course of examining what was going on,
|
||||
realized that I'd missed something crucial: these things were supposed to be *sortable*. But the way
|
||||
I was inserting them meant that they weren't, because of endianness.
|
||||
|
||||
## More like shmexicographic, amirite
|
||||
|
||||
"ULID" stands for "Universally Unique Lexicographically Sortable
|
||||
Identifier"[^uulsid]. "[Lexicographic order](https://en.wikipedia.org/wiki/Lexicographic_order)"
|
||||
basically means, "like alphabetical, but for anything with a defined total order". Numbers have a
|
||||
defined total order; bigger numbers always go after smaller.
|
||||
|
||||
But sometimes numbers get sorted out of order, if they're not treated as numbers. Like say you had a
|
||||
directory with twelve files in it, called "1.txt" up through "12.txt". If you were to ask to see
|
||||
them listed out in lexicographic order, it would go like:
|
||||
|
||||
``` text
|
||||
$ ls
|
||||
10.txt
|
||||
11.txt
|
||||
12.txt
|
||||
1.txt
|
||||
2.txt
|
||||
3.txt
|
||||
4.txt
|
||||
5.txt
|
||||
6.txt
|
||||
7.txt
|
||||
8.txt
|
||||
9.txt
|
||||
```
|
||||
|
||||
This is because '10' is "less than" '2' (and '0' is "less than" '.', which is why "10.txt" is before "1.txt"). The solution, as all
|
||||
data-entering people know, is to pad the number with leading '0's:
|
||||
|
||||
``` text
|
||||
$ ls
|
||||
01.txt
|
||||
02.txt
|
||||
03.txt
|
||||
04.txt
|
||||
05.txt
|
||||
06.txt
|
||||
07.txt
|
||||
08.txt
|
||||
09.txt
|
||||
10.txt
|
||||
11.txt
|
||||
12.txt
|
||||
```
|
||||
|
||||
Now the names are lexicographically sorted in the right numerical order[^confusing-yes].
|
||||
|
||||
So, now that we're all expert lexicographicographers, we understand that our IDs are just
|
||||
supposed to naturally sort themselves in the correct order, based on when they were created; IDs
|
||||
created later should sort after IDs created earlier.
|
||||
|
||||
The implementation for my ULIDs only guaranteed this property for the string form of them, but I was
|
||||
not storing them in string from. Fundamentally, the ULID was a simple [128-bit primitive
|
||||
integer](https://doc.rust-lang.org/std/primitive.u128.html), capable of holding values between 0 and
|
||||
340,282,366,920,938,463,463,374,607,431,768,211,455.
|
||||
|
||||
But there's a problem: we're storing our ID in the database as a sequence of 16 bytes. I was asking
|
||||
for those bytes in "native endian", which in my case, meant "little endian". If you're not familiar
|
||||
with endianness, there are two varieties: big, and little. "Big" makes the most sense; if you see a
|
||||
number like "512", it's big-endian; the end is the part that's left-most, and "big" means that it is
|
||||
the most-significant-digit. This is the same as what westerners think of as "normal" numbers. In the
|
||||
number "512", the "most significant digit" is `5`, which correspends to `500`, which is added to the
|
||||
next-most-significant digit, `1`, corresponding to `10`, which is added to the next-most-significant
|
||||
digit, which is also the least-most-significant-digit, which is `2`, which is just `2`, giving us
|
||||
the full number `512`.
|
||||
|
||||
If we put the least-significant-digit first, we'd write the number `512` as "215"; the order when
|
||||
written out would be reversed. This means that the lexicographic sort of `512, 521` would have "125"
|
||||
come before "215", which is backwards.
|
||||
|
||||
Little-endiannes is like that. If a multibyte value is on a little-endian system, the least-significant
|
||||
bytes will come first, and the sorting would be non-numeric.
|
||||
|
||||
Unfortunaly, my computer is based on the Intel x86 instruction set, which means that it represents
|
||||
numbers in "little endian" form. This means that
|
||||
|
||||
## The actual problem
|
||||
|
||||
there was no visitor for seqs, which is what json byte arrays are, and what async_sessions was doing.
|
||||
|
||||
## Wait, why isn't it broken?
|
||||
|
||||
oh, it's not
|
||||
|
||||
# Lessons learned
|
||||
|
||||
don't change many things at once
|
||||
|
||||
automated tests aren't enough
|
||||
|
||||
----
|
||||
|
||||
[^uuidv4_random]: Technically, most v4 UUIDs have only 122 random bits, as six out of 128 are
|
||||
reserved for version metadata.
|
||||
reserved for version information.
|
||||
|
||||
[^blob-of-bytes]: Some databases have direct support for 128-bit primitive values (numbers). The
|
||||
database I'm using, SQLite, only supports up to 64-bit primitive values, but it does support
|
||||
arbitrary-length sequences of bytes called "blobs".
|
||||
|
||||
[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I
|
||||
plan to dive into in a different post, but "blob" is specific to SQLite. In general, you'll probably
|
||||
want to take advantage of implementation-specific features of whatever database you're using, which
|
||||
means that your table definitions won't be fully portable to a different database. This is fine and
|
||||
good, actually!
|
||||
[^sqlite-dataclasses]: I'm using [SQLite](https://www.sqlite.org/index.html) for reasons that I plan
|
||||
to dive into in a different post, but "blob" is specific to it. In general, you'll probably want
|
||||
to take advantage of implementation-specific features of whatever database you're using, which
|
||||
means that your table definitions won't be fully portable to a different database. This is fine
|
||||
and good, actually!
|
||||
|
||||
[^no-stinkin-benches]: You may wonder: have I benchmarked this system with UUIDs vs. ULIDs? Ha ha,
|
||||
you must have never met a programmer before! So, no, obviously not. But that's coming in a follow-up.
|
||||
you must have never met a programmer before! No, of course not. But, that's coming in a
|
||||
follow-up.
|
||||
|
||||
[^rust-generics]: If the code you're using has
|
||||
[generics](https://doc.rust-lang.org/book/ch10-01-syntax.html) in it, then the compiler needs to
|
||||
generate specialized versions of that generic code based on how you use it; this is called
|
||||
"[monomorphization](https://doc.rust-lang.org/book/ch10-01-syntax.html#performance-of-code-using-generics)",
|
||||
and it requires the original generic source to work. That's also true in C++, which is why most
|
||||
templated code is [header-only](https://isocpp.org/wiki/faq/templates#templates-defn-vs-decl),
|
||||
but Rust doesn't have header files.
|
||||
|
||||
[^uulsid]: I guess the extra 'U' and 'S' are invisible.
|
||||
|
||||
[^confusing-yes]: Is this confusing? Yes, 100%, it is not just you. Don't get discouraged.
|
||||
|
||||
|
||||
[thats_a_database]: ./thats_a_database.png "that's a database"
|
||||
|
|
Loading…
Reference in a new issue