1 MOOODB Storage Engine Design

1.1 Overview
MOOODB is a Copy-on-Write, B+tree based storage engine written in C, designed prioritizing correctness, performance, and deterministic simulation testing. It will eventually be extended with SQL query planning and exeuction.
The existing Rust implementation is a functional CoW B-tree (LMDB-style). This design will extend it with an update-chain layer to amortize write amplification, a WAL for durability, and io_uring-native I/O.
const Su64 = [8]u8;
fn to_serialized(x: u64) Su64 {
return mem.toBytes(switch (builtin.cpu.arch.endian()) {
.little => @byteSwap(x),
.big => x,
});
}
fn from_serialized(x: Su64) u64 {
const x_: u64 = @bitCast(x);
return switch (builtin.cpu.arch.endian()) {
.little => @byteSwap(x_),
.big => x_,
};
}
fn f(x: anytype) @TypeOf(x) {
comptime {
switch (@typeInfo(@TypeOf(x))) {
.int, .float, .comptime_int, .comptime_float, => {},
else => @compileError("must pass int or float"),
}
}
return x + 1;
}1.2 Core Architecture
1.2.1 Two-Layer Write Design
The fundamental insight: decouple API transaction commit from CoW checkpoint.
- API transaction commit = append to WAL + insert into in-memory update chain. Fast, returns immediately to caller.
- CoW checkpoint = flush update buffer -> rewrite affected leaf pages -> atomic superblock swap. A background/triggered operation, invisible to the API caller.
The CoW B-tree becomes a read-optimized checkpoint structure that is rewritten periodically in bulk. The update buffer absorbs writes between checkpoints.
This amortizes CoW write amplification: instead of copying O(log N) pages per write, each write pays ~O(1) (WAL append + update chain prepend). Reconciliation of many writes into one leaf rewrite happens at checkpoint time.
1.2.2 Update Chains
Each key on a leaf page can have a linked list of UPDATE
nodes hanging off it, representing versions written since the last
checkpoint. Writers prepend to the chain; readers walk it from newest to
oldest to find their visible version.
New keys that don’t yet exist in the reconciled page image are held in a per-page insert skiplist - an ordered structure needed for range scan merging.
At checkpoint, both update chains and insert skiplists are collapsed into the new reconciled page image.
1.2.3 WAL
The WAL covers only the update buffer - committed API transactions that have not yet been checkpointed. It does not need to cover the B-tree itself because the CoW structure guarantees the on-disk B-tree is always consistent.
Recovery: load the last checkpointed superblock (always consistent), replay WAL forward. Clean and simple.
After a successful checkpoint, the WAL up to the checkpoint LSN can be discarded.
1.3 Memory Model
All memory is statically allocated at startup. No dynamic allocation after initialization. This is a hard constraint (TigerStyle).
1.3.1 Two Fixed Pools
Page cache - N pages of size PAGE_SIZE, allocated upfront. Pure LRU eviction. No interaction with update buffer pressure. PAGE_SIZE is a compile-time constant (4KB / 8KB / 16KB / 32KB are all reasonable; larger pages favor write amortization).
Update buffer - M bytes, allocated upfront. A single flat bump allocator shared across all dirty pages. Both update chain nodes and insert skiplist nodes are carved from this pool. When the pool is full, a checkpoint is triggered.
The two pools are completely independent. No dynamic rebalancing between them.
Cleanup after checkpoint: reset the bump pointer to zero. The entire update buffer is freed in one operation - no per-node cleanup needed.
1.3.2 Update Buffer Sizing
Required size scales with
write_rate * checkpoint_interval. If writes average 100
MB/s and checkpoints run every second, ~100 MB of update buffer is
needed. Size the page cache based on working set size. The two are
orthogonal concerns.
1.3.3 B-tree Traversal
Tree traversal uses a fixed-length (depth 64) stack-allocated array
of page references - no recursion. A static_assert should
verify that 64 levels is unreachable given PAGE_SIZE and maximum file
size.
1.4 Freelist
The freelist is a CoW B-tree pointed to by the superblock, identical in mechanism to the main data B-tree. It is only touched during checkpoint - normal API writes that only hit update chains have zero interaction with the freelist.
At checkpoint, newly freed pages (old versions of rewritten leaves) and newly allocated pages are tracked, then the superblock swap atomically updates both the data B-tree root and the freelist root together. This is unchanged from the existing Rust implementation.
Freed pages during a checkpoint are tracked as a linked list threaded through the freed page buffers themselves - no allocations needed, safe to persist even if the checkpoint aborts.
1.5 Key/Value Layout
Values are stored directly in leaf pages (clustered index). The primary key IS the B-tree. Secondary indexes store the primary key as their value.
Large values that exceed a threshold get overflow pages; the leaf cell stores a pointer to the overflow page. This keeps leaf pages from blowing up on large values.
1.6 I/O Layer
1.6.1 io_uring Native
All I/O is io_uring-based. No blocking syscalls on the critical path.
Writes (checkpoint): After computing all CoW page rewrites, batch-submit all writes to io_uring in one shot, wait for all completions, then perform the superblock swap. Simple, bulk, naturally suited to io_uring’s batching.
Reads: Submit a read to io_uring, return
PENDING to the caller, resume when the completion
arrives.
1.6.2 DST Abstraction
All I/O goes through an abstract interface - the storage engine never calls io_uring directly. Two implementations:
- Production: submits to io_uring, reaps completions
- Simulator: fake completion queue with full fault injection - arbitrary latency, reordering, partial writes, injected crashes at any point
This makes the entire engine deterministically simulatable. Crash safety and recovery - the hardest things to test - can be exhaustively explored by crashing at arbitrary I/O completion points and verifying the on-disk state is always recoverable.
The simulator controls what drain_completions() returns
and when. All non-determinism (I/O, time, randomness) is injectable
through this interface.
1.7 Concurrency & Threading
Single-threaded event loop with explicit state machines. No coroutines at the storage engine level.
Read operations return PENDING when a page is not in
cache. The caller loop:
This is trivially DST-able and requires no stack switching or scheduler complexity.
1.7.1 Cursor API
The public interface is cursor-based: - cursor_open /
cursor_close - cursor_step -> returns ROW,
PENDING, DONE, or NOT_FOUND
PENDING is the async contract. The storage engine never blocks; it always returns immediately.
1.7.2 Future SQL Layer
The SQL query executor will manage multiple cursors for joins and complex queries. Coroutines are a candidate at that layer - each coroutine drives one cursor or sub-operation, treating PENDING as its yield signal, with a scheduler multiplexing over io_uring completions.
This keeps coroutine complexity out of the storage engine entirely. The storage engine’s PENDING return is sufficient async contract for the SQL layer to build on.
1.8 Style
Follows TigerStyle. Key constraints:
- All memory statically allocated at startup
- No dynamic allocation after init
- No recursion
- Fixed upper bounds on all loops and queues
- Minimum 2 assertions per function (pre/postconditions, invariants)
- Hard limit of 70 lines per function
static_asserton compile-time constant relationships- All errors handled
- All I/O abstracted for DST