Joe Ardent 7955a3aa69 checkpoint

2023-07-14 18:10:48 -07:00

12 KiB

Raw Blame History

+++ title = "A One-Part Serialized Mystery, Part 2: The Benchmarks" slug = "one-part-serialized-mystery-part-2" date = "2023-07-09" [taxonomies] tags = ["software", "rnd", "proclamation", "upscm", "rust", "macros"] +++

A one-part serial mystery post-hoc prequel

I wrote recently about switching the types of the primary keys in the database for an in-progress web app I'm building. At that time, I'd not yet done any benchmarking, but had reason to believe that using sortable primary keys would yield some possibly-significant gains in performance, in both time and space. I'd also read accounts of regret that databases had not used ULIDs (instead of UUIDs) from the get-go, so I decided it couldn't hurt to switch to them before I had any actual data in my DB.

And that was correct: it didn't hurt performance, but it also didn't help much either. I've spent a bunch of time now doing comparative benchmarks between ULIDs and UUIDs, and as I explain below, the anticipated space savings did not materialize, and the speed-up is merely augmenting what was already more than fast enough into slightly more faster than that. Of course, of course, and as always, the real treasure was the friends we made along the way etc., etc. So come along on a brief journey of discovery!

Bottom Line Up Front

With sqlite and my final table schema, the size difference and speed differences are negligible,

TODO MOR STUFFF

However, with my initial database layout and import code, ULIDs resulted in about 5% less space and took only about 2/3rds as much time as when using UUIDs (5.7 vs 9.8 seconds). The same space and time results held whether or not without rowid was specified on table creation, which was counter to expectation, though I now understand why; I'll explain at the end.

It's a setup

My benchmark is pretty simple: starting from an empty database, do the following things:

insert 10,000 randomly chosen movies (title and year of release, from between 1965 and 2023) into the database
create 1,000 random users¹
for each user, randomly select around 100 movies from the 10,000 available and put them on their list of things to watch

Only that last part is significant, and is where I got my timing information from.

The table that keeps track of what users want to watch was defined² like this:

create table if not exists witch_watch (
  id blob not null primary key,
  witch blob not null, -- "user"
  watch blob not null, -- "thing to watch"
  [...]
  foreign key (witch) references witches (id) on delete cascade on update no action,
  foreign key (watch) references watches (id) on delete cascade on update no action
);
[...]
create index if not exists ww_witch_dex on witch_watch (witch);
create index if not exists ww_watch_dex on witch_watch (watch);

The kind of queries I'm trying to optimize with those indices is "what movies does a certain user want to watch?" and "what users want to watch a certain movie?". The IDs are 16-byte blobs; an entire row in the table is less than 100 bytes.

A digression on SQLite and performance

I've mentioned once or twice before that I'm using SQLite for this project. Any time I need a database, my first reach is for SQLite:

the database is a single file, along with a couple temp files that live alongside it, simplifying management
there's no network involved between the client and the database; a connection to the database is a pointer to an object that lives in the same process as the host program; this means that read queries return data back in just a few microseconds
it scales vertically extremely well; it can handle database sizes of many terabytes
it's one of the most widely-installed pieces of software in the world; there's at least one sqlite database on every smartphone, and there's a robust ecosystem of useful extensions and other bits of complimentary code freely available

And, it's extremely performant. When using the WAL journal mode and the recommended durability setting for WAL mode, along with all other production-appropriate settings, I got almost 20,000 writes per second³. There were multiple concurrent writers, and each write was a transaction that inserted about 100 rows at a time. I had retry logic in case a transaction failed due to the DB being locked by another writer, but that never happened: each write was just too fast.

Over-indexing on sortability

The reason I had hoped that ULIDs would help with keeping the sizes of the indexes down was the possibility of using clustered indexes. To paraphrase that link:

In an ordinary SQLite table, the PRIMARY KEY is really just a UNIQUE index. The key used to look up records on disk is the rowid. [...]any other kind of PRIMARY KEYs, including "INT PRIMARY KEY" are just unique indexes in an ordinary rowid table.

...

Consider querying this table to find the number of occurrences of the word "xsync".: SELECT cnt FROM wordcount WHERE word='xsync';

This query first has to search the index B-Tree looking for any entry that contains the matching value for "word". When an entry is found in the index, the rowid is extracted and used to search the main table. Then the "cnt" value is read out of the main table and returned. Hence, two separate binary searches are required to fulfill the request.

A WITHOUT ROWID table uses a different data design for the equivalent table. [in those tables], there is only a single B-Tree... Because there is only a single B-Tree, the text of the "word" column is only stored once in the database. Furthermore, querying the "cnt" value for a specific "word" only involves a single binary search into the main B-Tree, since the "cnt" value can be retrieved directly from the record found by that first search and without the need to do a second binary search on the rowid.

Thus, in some cases, a WITHOUT ROWID table can use about half the amount of disk space and can operate nearly twice as fast. Of course, in a real-world schema, there will typically be secondary indices and/or UNIQUE constraints, and the situation is more complicated. But even then, there can often be space and performance advantages to using WITHOUT ROWID on tables that have non-integer or composite PRIMARY KEYs.

sorry what was that about secondary indices i didn't quite catch that

HALF the disk space and TWICE as fast?? Yes, sign me up, please!

Sorry, the best I can do is all the disk space

There are some guidelines about when to use without rowid:

The WITHOUT ROWID optimization is likely to be helpful for tables that have non-integer or composite (multi-column) PRIMARY KEYs and that do not store large strings or BLOBs.

[...]

WITHOUT ROWID tables work best when individual rows are not too large. A good rule-of-thumb is that the average size of a single row in a WITHOUT ROWID table should be less than about 1/20th the size of a database page. That means that rows should not contain more than ... about 200 bytes each for 4KiB page size.

As I mentioned, each row in that table was less than 100 bytes, so comfortably within the given heuristic. In order to test this out, all I had to do was change the table creation statement to:

create table if not exists witch_watch (
  id blob not null primary key,
  witch blob not null, -- "user"
  watch blob not null, -- "thing to watch"
  [...]
  foreign key (witch) references witches (id) on delete cascade on update no action,
  foreign key (watch) references watches (id) on delete cascade on update no action
) without rowid;

So I did.

Imagine my surprise when it took nearly 20% longer to run, and the total size on disk was nearly 5% larger. Using random UUIDs was even slower, so there's still a relative speed win for ULIDs, but it was still an overall loss to go without the rowid. Maybe it was time to think outside the box?

Schema pruning

I had several goals with this whole benchmarking endeavor. One, of course, was to get data on ULIDs vs. UUIDs in terms of performance, at the very least so that I could write about when I publicly said I would. But another, and actually-more-important goal, was to optimize the design of my database and software, especially as it came to size on disk (my most-potentially-scare computing resource; network and CPU are not problems until you get very large, and you would have long ago bottlenecked on disk size if you weren't careful).

So it was Cool and Fine to take advantage of the new capabilities that ULIDs offered if those new capabilities resulted in better resource use. Every table in my original, UUID-based schema had had a created_at column, stored as a 64-bit signed offset from the UNIX epoch. Because ULIDs encode their creation time, I could remove that column from every table that used ULIDs as their primary key. Doing so dropped the overall DB size by 5-10% compared to UUID-based tables with a created_at column.

But I also realized that for the watch_quests table, no explicit

At last, I've reached my final form

In the course of writing this post, I had a minor epiphany, which is that the reason for the regressed performance when using without rowid was that the secondary indices needed to point to the entries in the table, using the primary key of the table as the target. So when there was a ULID or UUID primary key, the indexes looked like, eg, this:

16-byte blob -> 16-byte blob

left side is, eg, user id, and right side is id of a row in the quests table

using implicit rowid with ULIDs:

*** Indices of table WATCH_QUESTS *********************************************

Percentage of total database......................  43.3%
Number of entries................................. 199296
Average fanout.................................... 106.00

$ cargo run --release --bin import_users -- -d ~/movies.db -u 2000 -m 200
[...]
Added 398119 quests in 20.818506 seconds

20k writes/second, baby

size on disk is 75% of previous size (13M vs 17M)

I did the classic "open /usr/share/dict/words and randomly select a couple things to stick together" method of username generation, which results in gems like "Hershey_motivations84" and "italicizes_creaminesss54". This is old-skool generative AI. ↩︎
The original schema was defined some time ago, and it took me a while to get to the point where I was actually writing code that used it. In the course of doing the benchmarks, and even in the course of writing this post, I've made changes in response to things I learned from the benchmarks and to things I realized by thinking more about it and reading more docs. ↩︎
old job python 100 reqs/sec fall down ↩︎

12 KiB Raw Blame History