The core operation

Fork a scroll. Substitute an event. Replay the workflow. Diff the results. That's it. Every higher-level Scry capability — test generation, regression sweeps, prompt bisection, incident replay — is a specific composition of these four.

The value comes from what the primitives don't require. You never have to fabricate a scenario from scratch: real scrolls are the fixtures. You never have to mock the LLM: the recorded ai.response is the mock, and substituting it is a scroll operation. You never have to worry about test pollution: the origin scroll is unchanged, forks live in their own namespace.

The primitives

Fork

Clone a scroll's lineage, get a new addressable scroll.

Creates a new scroll that inherits the target's events up to some point. The origin is unchanged. The fork is independently appendable and independently addressable. Lineage metadata records which scroll it came from and at what sequence.

Substitute

On a fork, replace a recorded event with a new value.

Appends a replacement event at the same logical position. Downstream consumers on the fork see the substituted value instead of the original. Typical use: replace a recorded ai.response so replay re-derives everything downstream with a different model output.

Overlay

Non-destructive read-time override.

A read-time lens that rewrites events on the fly without committing a fork. Useful for one-shot what-if queries where persisting a new scroll would be wasteful. Overlays stack; their composition order is recorded.

Replay

Re-derive the workflow against the scroll.

Runs the runner against the target scroll. External commits (ai.response, tool.result) are read from the scroll when present; live calls happen only where the scroll is missing them. A fully-recorded scroll replays deterministically with zero cost.

Diff

Compare two scrolls event-by-event.

Aligns by causal chain (not just sequence number) and reports the first divergence and its downstream impact. Outputs the typed event payloads for each side. Used as the verification layer on top of replay.

Scroll-first replay: one change, everything else

The load-bearing implementation detail is in the runner. When the runner needs an ai.response, it reads the scroll before calling the model. If a recorded response is there, use it. If not, call live and append the result.

Consequence: production and replay are the same code path. A fresh scroll has no recorded externals, so everything goes live. A substituted fork has the externals you want, so nothing does. There is no "test mode" switch to get wrong. The runner doesn't know or care whether it's serving a prod request or re-deriving an incident.

// internal/runner/sync/replay.go (shipped)

func (r *SyncRunner) ensureReplayLoaded(ctx context.Context) {
    if r.replayLoaded { return }
    r.replayLoaded = true

    events, _ := r.scroll.Read(ctx, scroll.WithTopic(TopicAIResponse))
    for _, e := range events {
        var ev aiResponseEvent
        json.Unmarshal(e.Data, &ev)
        r.replayedAIResponses = append(r.replayedAIResponses, ev.Content)
    }
}

func (r *SyncRunner) tryReplayAIResponse(ctx context.Context) (string, bool) {
    r.ensureReplayLoaded(ctx)
    if r.replayCursor >= len(r.replayedAIResponses) {
        return "", false
    }
    content := r.replayedAIResponses[r.replayCursor]
    r.replayCursor++
    return content, true
}

This is v0.1 — positional match by call order, good enough for single-call agents like the narrator. v0.2 matches by content hash over the request, so a re-ordered or re-fanned-out workflow still resolves the right recorded response.

Worked example: a production prompt change

The scenario: the tavern's reconcile-quest reconciler is accepting too many duplicate quests — adventurers keep finding the same slain dragon on the board under two different names. The hypothesis: tightening the system prompt will fix it without regressing anything else.

Without Scry, the options are unpleasant: ship and hope; manually replay a handful of cases; mock out the LLM entirely and test something that isn't the real system.

With Scry:

# 1. Identify a representative corpus
scry search --topic quest_signals.quest_proposed \
           --filter "dedup_false_negative" \
           --limit 40 \
           --out corpus/

# 2. Replay the corpus against HEAD — baseline
scry replay-corpus corpus/ --out baseline/

# 3. Substitute the reconciler's system prompt on each scroll
scry corpus-substitute corpus/ \
    --event system.prompt \
    --agent reconcile-quest \
    --from-file new-prompt.tmpl \
    --out corpus-candidate/

# 4. Replay the candidate corpus
scry replay-corpus corpus-candidate/ --out candidate/

# 5. Diff the outcomes
scry diff-corpus baseline/ candidate/ --out report.html

# report.html summary:
#   34 cases: identical outcome (pass)
#    5 cases: improved (dedup fires correctly now)
#    1 case: regressed (rejects a legitimate second quest)

The one regression is actionable. Fork that specific scroll, inspect the rejection, consider whether to adjust the prompt further or accept the trade-off, move on. No guesswork. No live LLM dollars burned on exploration. No deployment required.

This is the same shape as every other prompt change on every other reactor — reward phrasing, difficulty grounding, rumor-to-quest extraction. The pattern generalizes.

Safety: immutability is preserved

Fork and substitute never mutate an existing event. A fork is a new scroll; substitutions are appends on that new scroll positioned to override, not overwrite. The origin scroll — the production record — is read-only from Scry's point of view.

This matters for two reasons. First, compliance: the event log is tamper-evident by construction (and hash-chainable when that lands). Second, parallel experimentation: ten analysts can fork the same scroll and not collide. Each fork has its own address, its own lineage metadata, its own acceptance history.

What doesn't need replay

Not every Scry operation needs to re-run a workflow. Some are pure scroll analysis — counting, aggregating, narrating, searching. Those don't touch the runner at all; they just read the scroll.

Replay is the operation that matters when you want to know what the system would do if some input changed. For questions about what the system did, read the scroll directly.

Status

Scroll-first replay ships inside the syncrunner today — it is what makes the narrator deterministic when given a session scroll. The networked fork / substitute / diff RPCs on scroll-server are the v0.2 surface that unlocks Scry at scale; they are designed and scoped, not yet wired.

Scroll-first replay in the runner

implemented

internal/runner/sync/replay.go. If the runner has a scroll configured, ai.response events are read from it in order. Foundation for every Scry operation that re-derives a workflow.

Fork RPC on scroll-server

designed

RPC that takes (scrollID, at?) and returns a new scroll with lineage metadata. Copy-on-fork for v0.2 simplicity; copy-on-write is a scale optimization.

Substitute RPC on scroll-server

designed

RPC that takes (forkID, eventKey, newPayload) and appends the replacement with the correct causal positioning. Immutability preserved: substitutions live on the fork, never mutate origin events.

Overlay read-time rewrites

designed

Read-path lens compiled into a single iterator. Overlay keys are content-hash-based (over the preceding causation chain) for robustness to upstream insertions.

Diff engine

designed

Causal alignment + typed event comparison. Emits structured divergence records suitable for both human review and agent reasoning.

Scry CLI: fork, substitute, replay, diff

designed

One-shot invocation per operation. JSON output. Artifact directory for long runs. Resumable via done.json markers.