Add ADR for our choice of SQLite as our primary database,

backed up by experiments demonstrating that SQLite will meet all of our
requirements.

This also introduces ADRs in the repo, and adds a README in preparation
making the repository public.
This commit is contained in:
Nicole Tietz-Sokolskaya 2024-03-16 11:12:46 -04:00
parent 05812a521e
commit 77d4ebb371
29 changed files with 6549 additions and 1 deletions

1
.adr-dir Normal file
View File

@ -0,0 +1 @@
_docs/decisions/

4
.gitignore vendored
View File

@ -1 +1,3 @@
/target
target/
*.db
*.xml

2456
Cargo.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,3 +1,4 @@
workspace = { members = ["_experiments/2024-03-02-database-benchmark"] }
[package]
name = "pique"
version = "0.1.0"

58
README.md Normal file
View File

@ -0,0 +1,58 @@
# Pique
Pique is project management software that is a delight to use!
This project is in very early stages, so here's what you need to know:
- It's being developed by [Nicole / ntietz](https://ntietz.com/) as a side project
- It's not production ready!
- It's **not open-source** and contributions are not welcome
- It will be free to use while it's in development, but will likely transition
to paid plans pretty quickly. I hope to always offer some paid plan, but that
is if I can do it without burning my budget.
**If it's not open-source, why can you see this?** Simply because I (Nicole)
find it much better and easier to work in the open. The code is available
because there is utility in that. It has few drawbacks. If someone wants to
steal it, they can, but that's pretty illegal. Eventually it *might* wind up
open-source, or as a coop, or just as a solo dev project. I don't know, but
openness is a core value for me, so here we are.
If you want to use it, and there is not a plan available yet, just let me know.
My personal email is [me@ntietz.com](mailto:me@ntietz.com) and I can get you
set up.
## Workflow and setup
### Rust
This project uses Rust. Setup the toolchain on your local machine as per usual.
We use nightly, and installation and management using [rustup][rustup] is
recommended.
### Docs
Decisions are recorded in ADRs[^adr] using a command-line tool to create and
manage them. You can install it with:
```bash
cargo install adrs
```
See the [adrs docs](https://crates.io/crates/adrs) for more infomration on
usage.
---
[^adr]: [Archictecture Decision Records](https://adr.github.io/) are a
lightweight way of recording decisions made on a project.
[rustup]: https://rustup.rs/

View File

@ -0,0 +1,20 @@
# 1. Record architecture decisions
Date: 2024-03-16
## Status
Accepted
## Context
We need to record the architectural decisions made on this project.
## Decision
We will use Architecture Decision Records, as [described by Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
## Consequences
See Michael Nygard's article, linked above. For a lightweight ADR toolset, see Nat Pryce's [adr-tools](https://github.com/npryce/adr-tools).

View File

@ -0,0 +1,56 @@
# 2. Primary database choice
Date: 2024-03-16
## Status
Accepted
## Context
Pique has to store data somewhere. We're going to use a database for this, and
have to choose which one to use.
Constraints:
- Should require minimal ops
- Should support storing large-ish rows (about 64 kB)
- Should support fast random reads (page loads will be p99 under 50ms, and the
DB allocation from this is a small fraction)
## Decision
We are going to use SQLite as our primary database and [SeaORM](https://github.com/SeaQL/sea-orm)
as the ORM. We will limit rows to 8 kB or smaller to have performance margin.
This decision was made using an [experiment](../../_experiments/2024-03-02-database-benchmark),
which found that:
- The ops burden for MariaDB would be unsuitably high, requiring work to get
it setup for our size of data and some work for performance tuning
- PostgreSQL cannot meet our performance requirements on larger documents
- SQLite can meet our performance requirements on up to 64 kB documents, and
possibly higher.
These experiments were done with memory constraints on both SQLite and Postgres,
with SQLite having about 10x faster random reads.
## Consequences
This has a few consequences for Pique.
First, it means that **we will be limited to single-node hosting** unless we
implement read replication using something like [litestream](https://litestream.io/).
This is acceptable given our focus on smaller organizations, and we can shard
the application if we need to.
Second, it means that **self-hosting is more feasible**. We can more easily offer
backup downloads from within the app itself, leveraging SQLite's features for
generating a backup, and we can have everything run inside one executable with
data stored on the disk. Not requiring a separate DB process makes the hosting
story simpler.

View File

@ -0,0 +1,2 @@
DATABASE_URL=postgresql://postgres:password@localhost/postgres
#DATABASE_URL=sqlite:./database.db?mode=rwc

View File

@ -0,0 +1 @@
DATABASE_URL=postgresql://postgres:password@localhost/postgres

View File

@ -0,0 +1 @@
DATABASE_URL=sqlite:./database.db?mode=rwc

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,28 @@
[package]
name = "bench"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow = "1.0.80"
chrono = { version = "0.4.35", features = ["now"] }
criterion = { version = "0.5.1", features = ["async", "async_tokio", "async_futures", "async_std"] }
dotenvy = "0.15.7"
entity = { version = "0.1.0", path = "entity" }
env_logger = "0.11.3"
futures = "0.3.30"
log = "0.4.21"
migration = { version = "0.1.0", path = "migration" }
rand = "0.8.5"
sea-orm = { version = "0.12.14", features = ["sqlx-mysql", "sqlx-sqlite", "sqlx-postgres", "macros", "runtime-async-std-rustls"] }
serde = { version = "1.0.197", features = ["derive"] }
tokio = { version = "1.36.0", features = ["full", "rt"] }
[workspace]
members = [".", "entity", "migration"]
[[bench]]
name = "db"
harness = false

View File

@ -0,0 +1,9 @@
entity: FORCE
sea-orm-cli generate entity -o entity/src/ -l
migrate: FORCE
sea-orm-cli migrate up
FORCE:

View File

@ -0,0 +1,133 @@
use bench::data::random_entities;
use criterion::async_executor::{AsyncExecutor, FuturesExecutor};
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use rand::distributions::{Distribution, Uniform};
use std::sync::Arc;
use std::time::Duration;
use entity::prelude::*;
use migration::Migrator;
use migration::MigratorTrait;
use sea_orm::ConnectOptions;
use sea_orm::Database;
use sea_orm::{prelude::*, Condition};
async fn load_row(db: &DatabaseConnection, count: &i32) {
let mut rng = rand::thread_rng();
let ids: Vec<i32> = Uniform::new(0, *count)
.sample_iter(&mut rng)
.take(5)
.collect();
let _ = Page::find()
.filter(Condition::all().add(entity::page::Column::Id.is_in(ids)))
.all(db)
.await
.unwrap();
//let _ = Page::find_by_id(id).one(db).await.unwrap().unwrap();
}
async fn setup_db(
db_url: &str,
dsize: usize,
dcount: usize,
) -> anyhow::Result<Arc<DatabaseConnection>> {
let mut opts = ConnectOptions::new(db_url);
opts.connect_timeout(Duration::from_secs(2));
opts.max_connections(50);
let db = Database::connect(opts).await?;
Migrator::reset(&db).await?;
Migrator::refresh(&db).await?;
match db.get_database_backend() {
sea_orm::DatabaseBackend::MySql => {
let _ = db
.execute(sea_orm::Statement::from_string(
db.get_database_backend(),
"SET GLOBAL max_allowed_packet=1073741824;",
))
.await?;
}
sea_orm::DatabaseBackend::Postgres => {}
sea_orm::DatabaseBackend::Sqlite => {}
};
let max_per_chunk = 32 * MB;
let num_chunks = (dsize * dcount) / max_per_chunk;
let pages_per_chunk = std::cmp::min(dcount / num_chunks, 5000);
let pages = random_entities(dcount, dsize);
for chunk in pages.chunks(pages_per_chunk) {
let _ = Page::insert_many(chunk.to_vec()).exec(&db).await?;
}
Ok(Arc::new(db))
}
const SQLITE_URL: &str = "sqlite:./database.db?mode=rwc";
const POSTGRES_URL: &str = "postgresql://postgres:password@localhost/postgres";
static KB: usize = 1024;
static MB: usize = 1024 * KB;
static GB: usize = 1024 * MB;
fn load_from_sqlite(c: &mut Criterion) {
let mut group = c.benchmark_group("sqlite");
//for document_size in [KB, 8 * KB, 64 * KB, 512 * KB, 4 * MB, 32 * MB] {
for document_size in [8 * KB, 64 * KB] {
let document_count = 3 * GB / document_size;
println!(
"attempting {} documents of size {}",
document_count, document_size
);
let db = FuturesExecutor
.block_on(setup_db(SQLITE_URL, document_size, document_count))
.unwrap();
println!("db setup, about to abuse it");
FuturesExecutor.block_on(async {
let res = db.execute_unprepared("PRAGMA hard_heap_limit = 1073741824").await.unwrap();
println!("{:?}", res);
});
group.throughput(Throughput::Bytes(document_size as u64));
group.bench_with_input(
BenchmarkId::from_parameter(document_size),
&(db, document_size, document_count as i32),
|b, (db, _size, count)| {
b.to_async(FuturesExecutor).iter(|| async {
load_row(&db, count).await;
});
},
);
}
group.finish();
}
fn load_from_postgres(c: &mut Criterion) {
let mut group = c.benchmark_group("postres");
//for document_size in [KB, 8 * KB, 64 * KB, 512 * KB, 4 * MB, 32 * MB] {
for document_size in [8 * KB, 64 * KB] {
let document_count = 3 * GB / document_size;
let db = FuturesExecutor
.block_on(setup_db(POSTGRES_URL, document_size, document_count))
.unwrap();
group.throughput(Throughput::Bytes(document_size as u64));
group.bench_with_input(
BenchmarkId::from_parameter(document_size),
&(db, document_size, document_count as i32),
|b, (db, _size, count)| {
b.to_async(FuturesExecutor).iter(|| async {
load_row(db, count).await;
});
},
);
}
group.finish();
}
criterion_group!(benches, load_from_postgres, load_from_sqlite,);
criterion_main!(benches);

View File

@ -0,0 +1,15 @@
[package]
name = "entity"
version = "0.1.0"
edition = "2021"
[lib]
name = "entity"
path = "src/lib.rs"
[dependencies.sea-orm]
version = "0.12.0"
features = [
"runtime-tokio-rustls",
"sqlx-sqlite",
]

View File

@ -0,0 +1,5 @@
//! `SeaORM` Entity. Generated by sea-orm-codegen 0.12.14
pub mod prelude;
pub mod page;

View File

@ -0,0 +1,5 @@
//! `SeaORM` Entity. Generated by sea-orm-codegen 0.12.14
pub mod prelude;
pub mod page;

View File

@ -0,0 +1,18 @@
//! `SeaORM` Entity. Generated by sea-orm-codegen 0.12.14
use sea_orm::entity::prelude::*;
#[derive(Clone, Debug, PartialEq, DeriveEntityModel, Eq)]
#[sea_orm(table_name = "page")]
pub struct Model {
#[sea_orm(primary_key)]
pub id: i32,
pub external_id: i64,
pub title: String,
pub text: String,
}
#[derive(Copy, Clone, Debug, EnumIter, DeriveRelation)]
pub enum Relation {}
impl ActiveModelBehavior for ActiveModel {}

View File

@ -0,0 +1,3 @@
//! `SeaORM` Entity. Generated by sea-orm-codegen 0.12.14
pub use super::page::Entity as Page;

View File

@ -0,0 +1,24 @@
[package]
name = "migration"
version = "0.1.0"
edition = "2021"
publish = false
[lib]
name = "migration"
path = "src/lib.rs"
[dependencies]
async-std = { version = "1", features = ["attributes", "tokio1"] }
[dependencies.sea-orm-migration]
version = "0.12.0"
features = [
# Enable at least one `ASYNC_RUNTIME` and `DATABASE_DRIVER` feature if you want to run migration via CLI.
# View the list of supported features at https://www.sea-ql.org/SeaORM/docs/install-and-config/database-and-async-runtime.
# e.g.
"runtime-tokio-rustls", # `ASYNC_RUNTIME` feature
"sqlx-sqlite", # `DATABASE_DRIVER` feature
"sqlx-postgres", # `DATABASE_DRIVER` feature
"sqlx-mysql", # `DATABASE_DRIVER` feature
]

View File

@ -0,0 +1,41 @@
# Running Migrator CLI
- Generate a new migration file
```sh
cargo run -- generate MIGRATION_NAME
```
- Apply all pending migrations
```sh
cargo run
```
```sh
cargo run -- up
```
- Apply first 10 pending migrations
```sh
cargo run -- up -n 10
```
- Rollback last applied migrations
```sh
cargo run -- down
```
- Rollback last 10 applied migrations
```sh
cargo run -- down -n 10
```
- Drop all tables from the database, then reapply all migrations
```sh
cargo run -- fresh
```
- Rollback all applied migrations, then reapply all migrations
```sh
cargo run -- refresh
```
- Rollback all applied migrations
```sh
cargo run -- reset
```
- Check the status of all migrations
```sh
cargo run -- status
```

View File

@ -0,0 +1,12 @@
pub use sea_orm_migration::prelude::*;
mod m20240307_110706_create_tables;
pub struct Migrator;
#[async_trait::async_trait]
impl MigratorTrait for Migrator {
fn migrations() -> Vec<Box<dyn MigrationTrait>> {
vec![Box::new(m20240307_110706_create_tables::Migration)]
}
}

View File

@ -0,0 +1,54 @@
use std::fmt;
use sea_orm_migration::prelude::*;
#[derive(DeriveMigrationName)]
pub struct Migration;
#[async_trait::async_trait]
impl MigrationTrait for Migration {
async fn up(&self, manager: &SchemaManager) -> Result<(), DbErr> {
manager
.create_table(
Table::create()
.table(Page::Table)
.if_not_exists()
.col(
ColumnDef::new(Page::Id)
.integer()
.not_null()
.auto_increment()
.primary_key(),
)
.col(ColumnDef::new(Page::ExternalId).big_integer().not_null())
.col(ColumnDef::new(Page::Title).string().not_null())
.col(ColumnDef::new(Page::Text).string().not_null())
//.col(ColumnDef::new(Page::Text).custom(LongText).not_null())
.to_owned(),
)
.await
}
async fn down(&self, manager: &SchemaManager) -> Result<(), DbErr> {
manager
.drop_table(Table::drop().table(Page::Table).to_owned())
.await
}
}
pub struct LongText;
impl Iden for LongText {
fn unquoted(&self, s: &mut dyn fmt::Write) {
s.write_str("LongText").unwrap();
}
}
#[derive(DeriveIden)]
enum Page {
Table,
Id,
ExternalId,
Title,
Text,
}

View File

@ -0,0 +1,6 @@
use sea_orm_migration::prelude::*;
#[async_std::main]
async fn main() {
cli::run_cli(migration::Migrator).await;
}

View File

@ -0,0 +1,38 @@
# Overview
The goal of this experiment is to determine what whatabase to use for Pique.
Normally, we can just go with a tried-and-true option, like PostgreSQL. However,
a few factors are working against us here:
- The goal for pageloads is under 50ms for every page load (on the server side)
- PostgreSQL uses different storage for values over 2kB (compressed), which can
lead to much slower reads if it goes to the disk.
- We'll be storing documents up to 1 MB each in the database. This is just text
content and does *not* include resources like images.
This combination may make Postgres an unsuitable choice! It has proven slow in
the past when, at a job, someone had put large JSON blobs (multiple kB) into a
column, and queries were over 100ms when these columns were involved. I don't
know how much of that was a Postgres limitation and how much was the particular
schema and hardware we had, so I want to find out!
# Experiment design
I'm going to run a benchmark on three databases: Postgres, MariaDB, and SQLite.
Each will start by loading in a bunch of text documents into each database,
then we will do some random reads and measure the time of each. Memory and CPU
limits will be set on the non-embedded databases.
The text documents will be generated randomly at a size and count determined to
be a reasonable size Pique's data will probably reach after a few years. Our
experiment is not particularly valid if it only lasts a year.
To sample, we will pick random IDs from in the range of (0, count), since IDs
are assigned monotonically increasing for the databases we have chosen.
# Results

View File

@ -0,0 +1,84 @@
use bench::data::random_entities;
use env_logger::{Builder, Env};
use log::info;
use migration::{Migrator, MigratorTrait};
use rand::prelude::*;
use sea_orm::prelude::*;
use sea_orm::sea_query::{Func, SimpleExpr};
use sea_orm::ConnectOptions;
use sea_orm::{Database, QuerySelect};
use entity::prelude::*;
async fn run() -> Result<(), anyhow::Error> {
dotenvy::dotenv()?;
let db_url = std::env::var("DATABASE_URL")?;
info!("starting db");
let opts = ConnectOptions::new(db_url);
let db = Database::connect(opts).await?;
Migrator::refresh(&db).await?;
let db = &db;
info!("connected to db");
info!("starting data load");
let pages = random_entities(1000, 1_000_000);
info!("finished data load");
info!("starting db insert");
for chunk in pages.chunks(5000) {
let _ = Page::insert_many(chunk.to_vec()).exec(db).await?;
}
info!("finished db insert");
let length_expr: SimpleExpr = Func::char_length(Expr::col((
entity::page::Entity,
entity::page::Column::Text,
)))
.into();
info!("fetching big row count");
let mut large_row_ids: Vec<i32> = entity::page::Entity::find()
.filter(length_expr.binary(migration::BinOper::GreaterThan, Expr::val(8 * 1024)))
.column(entity::page::Column::Id)
.into_tuple()
.all(db)
.await?;
info!("counted {} big rows", large_row_ids.len());
let num_rows = Page::find().count(db).await?;
info!("inserted {} rows", num_rows);
let mut rng = thread_rng();
large_row_ids.shuffle(&mut rng);
info!("starting");
let mut bytes_read = 0;
for id in large_row_ids.iter().take(1000) {
let row = Page::find_by_id(*id).one(db).await?.unwrap();
bytes_read += row.text.len();
}
println!("read {} bytes", bytes_read);
info!("done");
Ok(())
}
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
init_logger();
run().await?;
Ok(())
}
fn init_logger() {
let env = Env::default().filter_or("BENCH_LOG_LEVEL", "info,sqlx=error");
Builder::from_env(env).format_timestamp_millis().init();
}

View File

@ -0,0 +1,26 @@
use rand::{
distributions::{Alphanumeric, DistString},
thread_rng,
};
use sea_orm::ActiveValue;
pub fn random_entities(count: usize, text_length: usize) -> Vec<entity::page::ActiveModel> {
let mut pages = vec![];
let mut rng = thread_rng();
for idx in 0..count {
let _id = idx as i32;
let title = "dummy_title";
let text = Alphanumeric.sample_string(&mut rng, text_length);
pages.push(entity::page::ActiveModel {
external_id: ActiveValue::Set(1),
title: ActiveValue::Set(title.to_owned()),
text: ActiveValue::Set(text),
..Default::default()
});
}
pages
}

View File

@ -0,0 +1 @@
pub mod data;

View File

@ -0,0 +1,3 @@
#!/bin/bash
podman run --name postgres --network=host --cpus 2 --memory 1g -e POSTGRES_PASSWORD=password -d postgres