1 MOOODB Storage Engine Design

1.1 Overview

MOOODB is a Copy-on-Write, B+tree based storage engine written in C, designed prioritizing correctness, performance, and deterministic simulation testing. It will eventually be extended with SQL query planning and exeuction.

The existing Rust implementation is a functional CoW B-tree (LMDB-style). This design will extend it with an update-chain layer to amortize write amplification, a WAL for durability, and io_uring-native I/O.

const Su64 = [8]u8;

fn to_serialized(x: u64) Su64 {
    return mem.toBytes(switch (builtin.cpu.arch.endian()) {
        .little => @byteSwap(x),
        .big    => x,
    });
}

fn from_serialized(x: Su64) u64 {
    const x_: u64 = @bitCast(x);
    return switch (builtin.cpu.arch.endian()) {
        .little => @byteSwap(x_),
        .big    => x_,
    };
}

fn f(x: anytype) @TypeOf(x) {
    comptime {
        switch (@typeInfo(@TypeOf(x))) {
            .int, .float, .comptime_int, .comptime_float,  => {},
            else => @compileError("must pass int or float"),
        }
    }

    return x + 1;
}

1.2 Core Architecture

1.2.1 Two-Layer Write Design

The fundamental insight: decouple API transaction commit from CoW checkpoint.

The CoW B-tree becomes a read-optimized checkpoint structure that is rewritten periodically in bulk. The update buffer absorbs writes between checkpoints.

This amortizes CoW write amplification: instead of copying O(log N) pages per write, each write pays ~O(1) (WAL append + update chain prepend). Reconciliation of many writes into one leaf rewrite happens at checkpoint time.

1.2.2 Update Chains

Each key on a leaf page can have a linked list of UPDATE nodes hanging off it, representing versions written since the last checkpoint. Writers prepend to the chain; readers walk it from newest to oldest to find their visible version.

New keys that don’t yet exist in the reconciled page image are held in a per-page insert skiplist - an ordered structure needed for range scan merging.

At checkpoint, both update chains and insert skiplists are collapsed into the new reconciled page image.

1.2.3 WAL

The WAL covers only the update buffer - committed API transactions that have not yet been checkpointed. It does not need to cover the B-tree itself because the CoW structure guarantees the on-disk B-tree is always consistent.

Recovery: load the last checkpointed superblock (always consistent), replay WAL forward. Clean and simple.

After a successful checkpoint, the WAL up to the checkpoint LSN can be discarded.

1.3 Memory Model

All memory is statically allocated at startup. No dynamic allocation after initialization. This is a hard constraint (TigerStyle).

1.3.1 Two Fixed Pools

Page cache - N pages of size PAGE_SIZE, allocated upfront. Pure LRU eviction. No interaction with update buffer pressure. PAGE_SIZE is a compile-time constant (4KB / 8KB / 16KB / 32KB are all reasonable; larger pages favor write amortization).

Update buffer - M bytes, allocated upfront. A single flat bump allocator shared across all dirty pages. Both update chain nodes and insert skiplist nodes are carved from this pool. When the pool is full, a checkpoint is triggered.

The two pools are completely independent. No dynamic rebalancing between them.

Cleanup after checkpoint: reset the bump pointer to zero. The entire update buffer is freed in one operation - no per-node cleanup needed.

1.3.2 Update Buffer Sizing

Required size scales with write_rate * checkpoint_interval. If writes average 100 MB/s and checkpoints run every second, ~100 MB of update buffer is needed. Size the page cache based on working set size. The two are orthogonal concerns.

1.3.3 B-tree Traversal

Tree traversal uses a fixed-length (depth 64) stack-allocated array of page references - no recursion. A static_assert should verify that 64 levels is unreachable given PAGE_SIZE and maximum file size.

1.4 Freelist

The freelist is a CoW B-tree pointed to by the superblock, identical in mechanism to the main data B-tree. It is only touched during checkpoint - normal API writes that only hit update chains have zero interaction with the freelist.

At checkpoint, newly freed pages (old versions of rewritten leaves) and newly allocated pages are tracked, then the superblock swap atomically updates both the data B-tree root and the freelist root together. This is unchanged from the existing Rust implementation.

Freed pages during a checkpoint are tracked as a linked list threaded through the freed page buffers themselves - no allocations needed, safe to persist even if the checkpoint aborts.

1.5 Key/Value Layout

Values are stored directly in leaf pages (clustered index). The primary key IS the B-tree. Secondary indexes store the primary key as their value.

Large values that exceed a threshold get overflow pages; the leaf cell stores a pointer to the overflow page. This keeps leaf pages from blowing up on large values.

1.6 I/O Layer

1.6.1 io_uring Native

All I/O is io_uring-based. No blocking syscalls on the critical path.

Writes (checkpoint): After computing all CoW page rewrites, batch-submit all writes to io_uring in one shot, wait for all completions, then perform the superblock swap. Simple, bulk, naturally suited to io_uring’s batching.

Reads: Submit a read to io_uring, return PENDING to the caller, resume when the completion arrives.

1.6.2 DST Abstraction

All I/O goes through an abstract interface - the storage engine never calls io_uring directly. Two implementations:

This makes the entire engine deterministically simulatable. Crash safety and recovery - the hardest things to test - can be exhaustively explored by crashing at arbitrary I/O completion points and verifying the on-disk state is always recoverable.

The simulator controls what drain_completions() returns and when. All non-determinism (I/O, time, randomness) is injectable through this interface.

1.7 Concurrency & Threading

Single-threaded event loop with explicit state machines. No coroutines at the storage engine level.

Read operations return PENDING when a page is not in cache. The caller loop:

This is trivially DST-able and requires no stack switching or scheduler complexity.

1.7.1 Cursor API

The public interface is cursor-based: - cursor_open / cursor_close - cursor_step -> returns ROW, PENDING, DONE, or NOT_FOUND

PENDING is the async contract. The storage engine never blocks; it always returns immediately.

1.7.2 Future SQL Layer

The SQL query executor will manage multiple cursors for joins and complex queries. Coroutines are a candidate at that layer - each coroutine drives one cursor or sub-operation, treating PENDING as its yield signal, with a scheduler multiplexing over io_uring completions.

This keeps coroutine complexity out of the storage engine entirely. The storage engine’s PENDING return is sufficient async contract for the SQL layer to build on.

1.8 Style

Follows TigerStyle. Key constraints: