blog/content/sundries/presenting-julids/index.md

309 lines
14 KiB
Markdown
Raw Normal View History

2023-07-29 23:46:18 +00:00
+++
2023-07-30 20:11:29 +00:00
title = "Presenting Julids, another fine sundry from Nebcorp Heavy Industries and Sundries"
2023-07-29 23:46:18 +00:00
slug = "presenting-julids"
2023-07-30 17:42:31 +00:00
date = "2023-07-31"
2023-07-29 23:46:18 +00:00
[taxonomies]
tags = ["software", "sundry", "proclamation", "sqlite", "rust", "ulid", "julid"]
+++
# Presenting Julids
2023-07-30 17:11:25 +00:00
Nebcorp Heavy Industries and Sundries, long the world leader in sundries, is proud to announce the
public launch of the official identifier type for all Nebcorp companies' assets and database
entries, [Julids](https://gitlab.com/nebkor/julid). Julids are globally unique sortable identifiers,
2023-07-30 17:42:31 +00:00
backwards-compatible with [ULIDs](https://github.com/ulid/spec), *but better*.
2023-07-29 23:46:18 +00:00
2023-07-30 17:42:31 +00:00
Inside your Rust program, simply add `julid-rs`[^julid-package] to your project's `Cargo.toml` file, and use it
2023-07-29 23:46:18 +00:00
like:
``` rust
use julid::Julid;
fn main() {
let id = Julid::new();
dbg!(id.created_at(), id.as_string());
}
```
Such a program would output something like:
``` text
2023-07-30 17:11:25 +00:00
[main.rs:5] id.created_at() = 2023-07-29T20:21:50.009Z
[main.rs:5] id.as_string() = "01H6HN10SS00020YT344XMGA3C"
2023-07-29 23:46:18 +00:00
```
However, it can also be built as a [loadable extension](https://www.sqlite.org/loadext.html) for
SQLite, adding database functions for creating and querying Julids:
``` text
$ sqlite3
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load ./libjulid
sqlite> select hex(julid_new());
018998768ACF000060B31DB175E0C5F9
sqlite> select julid_string(julid_new());
01H6C7D9CT00009TF3EXXJHX4Y
sqlite> select julid_seconds(julid_new());
1690480066.208
sqlite> select datetime(julid_timestamp(julid_new()), 'auto');
2023-07-27 17:47:50
sqlite> select julid_counter(julid_new());
0
```
2023-07-30 20:11:29 +00:00
Intrigued? Confused? Disgusted? Enraged?? Well, read on!
2023-07-29 23:46:18 +00:00
## Julids vs ULIDs
2023-07-30 20:11:29 +00:00
Julids are a drop-in replacement for ULIDs: all Julids are valid ULIDs, but not all ULIDs are valid
Julids.
2023-07-29 23:46:18 +00:00
Given their compatibility relationship, Julids and ULIDs must have quite a bit in common, and indeed
they do:
* they are 128-bits long
* they are lexicographically sortable
* they encode their creation time as the number of milliseconds since the [UNIX
2023-07-30 20:11:29 +00:00
epoch](https://en.wikipedia.org/wiki/Unix_time) in their top 48 bits
2023-07-29 23:46:18 +00:00
* their string representation is a 26-character [base-32
Crockford](https://en.wikipedia.org/wiki/Base32) encoding of their big-endian bytes
* IDs created within the same millisecond are still meant to sort in their order of creation
Julids and ULIDs have different ways to implement that last piece. If you look at the layout of bits
2023-07-30 17:11:25 +00:00
in a ULID, you see:
2023-07-29 23:46:18 +00:00
![ULID bit structure](./ulid.svg)
2023-07-30 17:11:25 +00:00
According to the ULID spec, for ULIDs created within the same millisecond, the least-significant bit
2023-07-29 23:46:18 +00:00
should be incremented for each new ID. Since that portion of the ULID is random, that means you may
not be able to increment it without spilling into the timestamp portion. Likewise, it's easy to
guess a new possibly-valid ULID simply by incrementing an already-known one. And finally, this means
that sorting will need to read all the way to the end of the ULID for IDs created in the same
millisecond.
2023-07-31 14:44:31 +00:00
To address these shortcomings, Julids (Joe's[^httm] ULIDs) have the following structure:
2023-07-29 23:46:18 +00:00
![Julid bit structure](./julid.svg)
As with ULIDs, the 48 most-significant bits encode the time of creation. Unlike ULIDs, the next 16
2023-07-30 20:11:29 +00:00
most-significant bits are not random[^counter idea]: they're a monotonic counter for IDs created
within the same millisecond[^monotonic]. Since it's only 16 bits, it will saturate after 65,536 IDs
2023-07-30 17:11:25 +00:00
intra-millisecond creations, after which, IDs in that same millisecond will not have an intrinsic
total order (the random bits will still be different, so you shouldn't have collisions). My PC,
which is no slouch, can only generate about 20,000 per millisecond, so hopefully this is not an
issue! Because the random bits are always fresh, it's not possible to easily guess a valid Julid if
you already have one.
2023-07-29 23:46:18 +00:00
# How to use
2023-07-30 20:11:29 +00:00
The Julid crate can be used in two different ways: as a regular Rust library, declared in your Rust
project's `Cargo.toml` file (say, by running `cargo add julid-rs`), and used as shown above. There's
a rudimentary [benchmark](https://gitlab.com/nebkor/julid/-/blob/main/examples/benchmark.rs) example
in the repo, which I'll talk more about below. But the primary use case for me was as a loadable
SQLite extension, as I [previously
2023-07-29 23:46:18 +00:00
wrote](/rnd/one-part-serialized-mystery-part-2/#next-steps-with-ids). Both are covered in the
[documentation](https://docs.rs/julid-rs/latest/julid/), but let's go over them here, starting with
the extension.
## Inside SQLite as a loadable extension
The extension, when loaded into SQLite, provides the following functions:
* `julid_new()`: create a new Julid and return it as a 16-byte
[blob](https://www.sqlite.org/datatype3.html#storage_classes_and_datatypes)
* `julid_seconds(julid)`: get the number seconds (as a 64-bit float) since the UNIX epoch that this
julid was created
* `julid_counter(julid)`: show the value of this julid's monotonic counter
* `julid_sortable(julid)`: return the 64-bit concatenation of the timestamp and counter
* `julid_string(julid)`: show the [base-32 Crockford](https://en.wikipedia.org/wiki/Base32)
2023-07-30 20:11:29 +00:00
encoding of this julid; the raw bytes of Julids won't be valid UTF-8, so use this or the built-in
`hex()` function to `select` a human-readable representation
2023-07-29 23:46:18 +00:00
### Building and loading
If you want to use it as a SQLite extension:
* clone the [repo](https://gitlab.com/nebkor/julid)
* build it with `cargo build --features plugin` (this builds the SQLite extension)
* copy the resulting `libjulid.[so|dylib|whatevs]` to some place where you can...
* load it into SQLite with `.load /path/to/libjulid` as shown at the top
* party
If you, like me, wish to use Julids as primary keys, just create your table like:
``` sql
create table users (
id blob not null primary key default (julid_new()),
...
);
```
and you've got a first-class ticket straight to Julid City, baby!
For a table created like:
``` sql
-- table of things to watch
create table if not exists watches (
id blob not null primary key default (julid_new()),
kind int not null, -- enum for movie or tv show or whatev
title text not null,
length int,
release_date int,
2023-07-30 20:11:29 +00:00
added_by blob not null,
2023-07-29 23:46:18 +00:00
last_updated int not null default (unixepoch()),
foreign key (added_by) references users (id)
);
```
and then [some
2023-07-30 17:11:25 +00:00
code](https://gitlab.com/nebkor/ww/-/blob/cc14c30fcfbd6cdaecd85d0ba629154d098b4be9/src/import_utils.rs#L92-126)
that inserted rows into that table like
2023-07-29 23:46:18 +00:00
``` sql
insert into watches (kind, title, length, release_date, added_by) values (?,?,?,?,?)
```
where the wildcards get bound in a loop with unique values and the Julid `id` field is
generated by the extension for each row, I get over 100,000 insertions/second.
## Inside a Rust program
Of course, you can also use it outside of a database; the `Julid` type is publicly exported. There's
a simple benchmark in the examples folder of the repo, the important parts of which look like:
``` rust
use julid::Julid;
fn main() {
2023-07-30 20:11:29 +00:00
/* snip some stuff */
2023-07-29 23:46:18 +00:00
let start = Instant::now();
for _ in 0..num {
v.push(Julid::new());
}
let end = Instant::now();
let dur = (end - start).as_micros();
for id in v.iter() {
eprintln!(
"{id}: created_at {}; counter: {}; sortable: {}",
id.created_at(),
id.counter(),
id.sortable()
);
}
println!("{num} Julids generated in {dur}us");
```
If you were to run it on a computer like mine[^my computer], you might see something like this:
``` text
$ cargo run --example=benchmark --release -- -n 30000 2> /dev/null
30000 Julids generated in 1240us
```
That's about 24,000 IDs/millisecond; 24 *MILLION* per second!
The default optional Cargo features include implementations of traits for getting Julids into and
2023-07-30 17:42:31 +00:00
out of SQLite with [SQLx](https://github.com/launchbadge/sqlx), and for generally
2023-07-29 23:46:18 +00:00
serializing/deserializing with [Serde](https://serde.rs/), via the `sqlx` and `serde` features,
respectively. One final default optional feature, `chrono`, uses the Chrono crate to return the
timestamp as a [`DateTime`](https://docs.rs/chrono/latest/chrono/struct.DateTime.html) by adding a
`created_at(&self)` method to `Julid`.
Something to note: don't enable the `plugin` feature in your Cargo.toml if you're using this crate
2023-07-30 20:11:29 +00:00
inside your Rust application, especially if you're *also* loading it as an extension in SQLite in
2023-07-29 23:46:18 +00:00
your application. You'll get a long and confusing runtime panic due to there being multiple
entrypoints defined with the same name.
# Why Julids?
2023-07-30 20:11:29 +00:00
The astute may have noticed that this is the third time I've written about globally unique sortable
IDs ([here is part one](/rnd/one-part-serialized-mystery), and [part two is
2023-07-29 23:46:18 +00:00
here](/rnd/one-part-serialized-mystery-part-2)). What's, uh... what's up with that?
![marge just thinks they're neat][marge ids]
<div class = "caption">we both just think they're neat</div>
2023-07-30 17:42:31 +00:00
Like Marge, I just think they're neat! We're not the only ones; here are just some related projects:
2023-07-29 23:46:18 +00:00
* Segment's [KSUID](https://segment.com/blog/a-brief-history-of-the-uuid/), released in 2017. This
was possibly my first exposure to this idea. They're 36 bits larger than UUIDs or ULIDs, but
otherwise very similar to ULIDs (and hence Julids)
2023-07-30 17:42:31 +00:00
* [ULIDs](https://github.com/ulid/spec), of course
2023-07-29 23:46:18 +00:00
* [UUIDv7](https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html#name-uuidv7-layout-and-bit-order);
these are *very* similar to Julids; the primary difference is that the lower 62 bits are left up
to the implementation, rather than always containing pseudorandom bits as in Julids (which use
the lower 64 bits for that, instead of UUIDv7's 62)
* [Snowflake ID](https://en.wikipedia.org/wiki/Snowflake_ID), developed by Twitter in 2010; these
are 63-bit identifiers (so they fit in a signed 64-bit number), where the top 41 bits are a
2023-07-30 17:11:25 +00:00
millisecond timestamp, the next 10 bits are a machine identifier[^twitter machine count], and the
last 12 bits are for an intra-millisecond sequence counter (what Julid calls a "monotonic
counter"); unlike all the other IDs discussed, there are no random bits
2023-07-29 23:46:18 +00:00
and I'm sure the list can go on.
2023-07-30 17:11:25 +00:00
I wanted to use them in my SQLite-backed [web app](https://gitlab.com/nebkor/ww), in order to fix
some deficiencies in ULIDs and the way I was using them, as [I said
before](/rnd/one-part-serialized-mystery-part-2/#next-steps-with-ids):
> [...] it bothers me that ID generation is not done inside the database itself. Aside from being
> a generally bad idea, this lead to at least one frustrating debug session where I was inserting
> one ID but reporting back another. SQLite doesn't have native support for this, but it does have
> good native support for loading shared libraries as plugins in order to add functionality to it,
> and so my next step is to write one of those, and remove the ID generation logic from the
> application.
2023-07-29 23:46:18 +00:00
2023-07-30 20:11:29 +00:00
Now that I've accomplished all that I've set out to do, is this the last time I'll time I'll be
writing at length about these things? It's hard to say for sure, but signs point to "yes". I hope
you've found them at least a little interesting!
2023-07-29 23:46:18 +00:00
# Thanks
2023-07-30 20:11:29 +00:00
This project wouldn't have happened without a lot of inspiration (and a little shameless stealing)
from the [ulid-rs](https://github.com/dylanhart/ulid-rs) crate. For the loadable extension, the
[sqlite-loadable-rs](https://github.com/asg017/sqlite-loadable-rs) crate made it *extremely* easy to
write; what I thought would take a couple days instead took a couple hours. Thank you, authors of
those crates! Feel free to steal code from me any time!
2023-07-29 23:46:18 +00:00
----
2023-07-30 17:42:31 +00:00
[^julid-package]: The Rust crate *package's*
[name](https://gitlab.com/nebkor/julid/-/blob/2484d5156bde82a91dcc106410ed56ee0a5c1e07/Cargo.toml#L2)
is "julid-rs"; that's the name you add to your `Cargo.toml` file, that's how it's listed on
[crates.io](https://crates.io/crates/julid-rs), etc. The crate's *library*
[name](https://gitlab.com/nebkor/julid/-/blob/2484d5156bde82a91dcc106410ed56ee0a5c1e07/Cargo.toml#L24)
is just "julid"; that's how you refer to it in a `use` statement in your Rust program.
2023-07-31 14:44:31 +00:00
[^httm]: Remember in *Hot Tub Time Machine*, where Rob Cordry's character, "Lew", decides to stay in
the past and use his future-knowledge to amass wealth and power, and he makes his own versions
of things that were done in his past, like forming a glam rock band called "Mötley Lew", and a
search engine called "Loogle", etc.?
2023-07-31 14:28:17 +00:00
[^counter idea]: Putting the counter bits after the timestamp bits was stolen from
2023-07-30 20:11:29 +00:00
<https://github.com/ahawker/ulid/issues/306#issuecomment-451850395>, though they use only 15 bits
for the counter, due to each character in the string encoding representing five bits, and using
three whole characters for the counter. That gives them one more random bit than Julids, and
lowers the number of available unique intra-millisecond IDs in the same process to 32,678.
2023-07-29 23:46:18 +00:00
[^monotonic]: At least, they will still have a total order if they're all generated within the same
2023-07-30 17:11:25 +00:00
process in the same way; the code uses a [64-bit atomic
integer](https://gitlab.com/nebkor/julid/-/blob/2484d5156bde82a91dcc106410ed56ee0a5c1e07/src/julid.rs#L11-12)
to ensure that IDs generated within the same millisecond have incremented counters, but that
atomic counter is not global; calling `Julid::new()` in Rust and `select julid_new()` in SQLite
2023-07-31 14:28:17 +00:00
would be as though they were generated on different machines. I just make sure to only generate
them inside the DB.
2023-07-30 17:11:25 +00:00
[^my computer]: According to the output of `lscpu`, my computer has an "AMD Ryzen 9 3900X 12-Core
2023-07-29 23:46:18 +00:00
Processor", running between 2.2 and 4.6 GHz. It's no slouch!
[^twitter machine count]: There are only ten bits for the machine ID, which means there are only
1,024 possible machine IDs; did twitter only have a thousand machines in production? Maybe only
a thousand at a time, so you could use the timestamp to look up what machine any given 10-bit ID
referred to?
2023-07-30 17:42:31 +00:00
[marge ids]: ./marge_thinks_theyre_neat.png 'marge simpson holding a potato labeled "globally unique sortable identifiers"'