Why scrolls are the right fixture

A scroll is a complete, causally-ordered, typed record of a workflow. Loading one into a test means the test sees exactly what production saw. Mocking the LLM isn't needed — the recorded ai.response is the mock. Mocking the clock isn't needed — events carry timestamps that the runner honors. Mocking side-effects isn't needed — a sandboxed replay stubs externals at the tool-handler boundary, deterministic and contained.

The result: tests that exercise the real system against real inputs, without the three things that usually make AI tests brittle — drift between mock and prod, wall-clock coupling, and expensive live LLM calls per test run.

pkg/scrolltest: the fixture toolkit

The Go-side primitive for writing tests against scrolls. Three pieces: a Scenario (the library + fake client), a Seed API (queue events to commit before the test runs), and AssertScroll (predicates over the resulting stream).

func TestReconcileQuestAcceptsNovelRumor(t *testing.T) {
    s := scrolltest.New(t)
    defer s.Close()

    // Seed the signals scroll with one proposed quest.
    s.Seed("quest_signals:tavern-42").
        With("quest_signals.quest_proposed", QuestProposed{
            SignalID: "sig_1",
            Title:    "Slay the goblin raiding the north road",
            Evidence: "three merchants reported the same ambush",
        }).
        Commit()

    // Record what reconcile-quest would call the LLM with.
    s.WithAIResponse(`{"verdict":"accept","reasoning":"novel"}`)

    // Execute the reactor once.
    require.NoError(t, s.RunReactor("reconcile-quest"))

    // Assert on the resulting event stream.
    s.AssertScroll("quest_signals:tavern-42").
        EventCount(2).
        HasTopic("quest_signals.signal_accepted").
        PayloadMatches("quest_signals.signal_accepted", map[string]any{
            "signalId": "sig_1",
        })
}

Zero live LLM calls. Zero database writes. The test is deterministic by construction because the fake client and scroll-first replay give the runner the ai.response it needs without going external. A test failure is a real logic error — never a flaky network call.

Test generation from scrolls

When something breaks in production, the scroll of the incident is already a complete reproduction. Scry turns it into a test with one command:

scry test-gen <scroll-id> \
    --target reconcile-quest \
    --out internal/tavern/reconcile_quest_regression_test.go

# What the generator does:
#   1. Reads the scroll, identifies the reactor's input events
#   2. Identifies the reactor's output events (the ones to assert)
#   3. Extracts the ai.response events used during the run
#   4. Synthesizes a Go test that seeds inputs, records responses,
#      invokes the reactor, and asserts outputs match.
#
# The generated test is review-ready Go, matching the idioms of
# the existing test file it lands next to.

Every bug-fix commit on a weave pipeline is, in principle, a scroll — and therefore a generatable regression test. The backlog of "bugs we fixed but never wrote a test for" becomes a script that walks the git history and emits test files.

Corpus replay at scale

A single test verifies one case. A corpus replay verifies a population. Given a directory of scrolls and a candidate repository state, Scry runs each scroll through the new code and reports the divergence profile.

scry replay-corpus ./corpus --against HEAD --out results/

# results/
#   ├── done.json              — "complete" marker, written last
#   ├── summary.json           — aggregate counts, cost, latency
#   ├── pass/                  — scrolls that replayed identically
#   ├── improved/              — scrolls where outcome got better
#   ├── regressed/             — scrolls where outcome got worse
#   │   ├── scroll-abc.diff    — per-event divergence
#   │   └── scroll-abc.ai.md   — narrator's explanation of the diff
#   └── errored/               — scrolls that failed to replay

Used as a pre-merge gate, corpus replay rejects refactors that silently change behavior — even when every unit test still passes. Used as a nightly sweep, it catches drift introduced by upstream model updates, prompt edits, or dependency bumps. Resumable by design; a 45-minute run that fails at minute 44 picks up where it left off.

Invariants as scroll predicates

Some correctness properties aren't about what a single reactor does — they're about the shape of the scroll as a whole. Every proposal eventually gets a verdict. No signal is accepted without a gate check. Every chat turn has a matching assistant response. These are invariants, not tests: they should hold on every scroll, not just the ones you happened to record.

every-proposal-has-a-verdict

For every quest_proposed (or reward_proposed, rumor_heard, etc) event, there must exist a signal_accepted or validator_rejected referencing its signalId within N events downstream. Surfaces stuck proposals.

accepted-signals-project

Every signal_accepted event must be followed by a projection marker (quest_projected, monster_projected) within its projection window. Surfaces orphaned acceptances.

gate-precedes-acceptance

Every signal_accepted must be preceded by a gate_checked event on the same signalId. Surfaces gate bypass.

single-verdict-per-signal

No signalId appears on more than one verdict event. Surfaces double-processing or cursor advancement bugs.

conversation-continuity

Every message.assistant event on a chat scroll has a preceding message.user with a matching sourceMessageId chain. Surfaces dangling turns.

Invariants run as a separate Scry command over a corpus. Violations are ranked by severity and come with narration — "scroll X violates gate-precedes-acceptance at sequence 43; the likely cause is Y." Adding a new invariant is a DSL change, not a code change; each invariant is a named versioned fact about your system.

Property-based testing over folds

A fold is a pure function: events in, accumulator out. That makes folds the most test-friendly primitive in weave. Scry generates arbitrary event sequences, runs them through the fold, and checks that fold-level invariants hold — associativity where appropriate, monotonicity, upper bounds on accumulator size.

When a property fails, the counterexample gets shrunk to the minimal event sequence that reproduces the failure. That minimal scroll becomes a .scroll fixture committed alongside a regression test — the same pattern as test-gen, applied to the shrunk input.

The CI story

Three levels, escalating cost and depth:

Per-commit: unit tests via pkg/scrolltest, plus invariants on the canonical corpus. Runs in under a minute.
Per-PR: corpus replay across the curated 200-case corpus. Runs in minutes. Gates merge on regression-free status.
Nightly: full-corpus sweep across every production scroll ever recorded. Runs in the background, reports drift, files regression issues automatically.

The cost curve matches the feedback curve. Fast signal per commit. Deeper signal per PR. Maximum signal as a background sweep the team reviews in the morning.

Status

pkg/scrolltest is the first piece and is under active shaping. Test generation, corpus replay, and invariants land as the fork/substitute/diff surface matures — each builds on the primitives described in the Fork, substitute, replay page.

pkg/scrolltest — deterministic fixture toolkit

sdk-shimmed

Scenario, Seed, AssertScroll exist. Stable enough to use inside the weave repo; still shaping the public API. No official release.

Test generation (scry test-gen)

designed

Input: scroll ID + optional target reactor or agent. Output: a Go test file that seeds the relevant events, invokes the target, and asserts the expected output stream. Deterministic — uses the scroll's recorded ai.response events.

Corpus replay (scry replay-corpus)

designed

Batch replay across N scrolls. Reports per-case pass / unchanged / diverged. Resumable. Runs in a sandbox so it never competes with production runner capacity.

Invariant engine (scry check-invariants)

designed

Scroll predicates expressed in a small DSL. Each invariant runs against every scroll in a corpus and reports violations. Built-in invariants cover the canonical anti-patterns.

Property-based testing over folds

designed

Generate scrolls that exercise a fold under stress — add/remove/reorder arbitrary events — and check fold invariants hold. Shrinks on failure to the minimal counterexample.

Scroll snapshot diffing in tests

designed

Standard snapshot testing, adapted for scrolls: a test captures a scroll's event stream, future runs diff against it, snapshots live as .scroll fixtures next to test files.