Fork, substitute, replay
Four primitives — fork, substitute, overlay, replay — and one comparison — diff — that everything else in Scry is built on. Testing, debugging, prompt iteration, and counterfactual analysis all compose from this set.
The core operation
Fork a scroll. Substitute an event. Replay the workflow. Diff the results. That's it. Every higher-level Scry capability — test generation, regression sweeps, prompt bisection, incident replay — is a specific composition of these four.
The value comes from what the primitives don't require.
You never have to fabricate a scenario from scratch: real
scrolls are the fixtures. You never have to mock the LLM:
the recorded ai.response is the
mock, and substituting it is a scroll operation.
You never have to worry about test pollution: the origin
scroll is unchanged, forks live in their own namespace.
The primitives
Fork
Clone a scroll's lineage, get a new addressable scroll.Creates a new scroll that inherits the target's events up to some point. The origin is unchanged. The fork is independently appendable and independently addressable. Lineage metadata records which scroll it came from and at what sequence.
Substitute
On a fork, replace a recorded event with a new value.Appends a replacement event at the same logical position. Downstream consumers on the fork see the substituted value instead of the original. Typical use: replace a recorded ai.response so replay re-derives everything downstream with a different model output.
Overlay
Non-destructive read-time override.A read-time lens that rewrites events on the fly without committing a fork. Useful for one-shot what-if queries where persisting a new scroll would be wasteful. Overlays stack; their composition order is recorded.
Replay
Re-derive the workflow against the scroll.Runs the runner against the target scroll. External commits (ai.response, tool.result) are read from the scroll when present; live calls happen only where the scroll is missing them. A fully-recorded scroll replays deterministically with zero cost.
Diff
Compare two scrolls event-by-event.Aligns by causal chain (not just sequence number) and reports the first divergence and its downstream impact. Outputs the typed event payloads for each side. Used as the verification layer on top of replay.
Scroll-first replay: one change, everything else
The load-bearing implementation detail is in the runner.
When the runner needs an ai.response, it
reads the scroll before calling the model. If a recorded
response is there, use it. If not, call live and append
the result.
Consequence: production and replay are the same code path. A fresh scroll has no recorded externals, so everything goes live. A substituted fork has the externals you want, so nothing does. There is no "test mode" switch to get wrong. The runner doesn't know or care whether it's serving a prod request or re-deriving an incident.
// internal/runner/sync/replay.go (shipped)
func (r *SyncRunner) ensureReplayLoaded(ctx context.Context) {
if r.replayLoaded { return }
r.replayLoaded = true
events, _ := r.scroll.Read(ctx, scroll.WithTopic(TopicAIResponse))
for _, e := range events {
var ev aiResponseEvent
json.Unmarshal(e.Data, &ev)
r.replayedAIResponses = append(r.replayedAIResponses, ev.Content)
}
}
func (r *SyncRunner) tryReplayAIResponse(ctx context.Context) (string, bool) {
r.ensureReplayLoaded(ctx)
if r.replayCursor >= len(r.replayedAIResponses) {
return "", false
}
content := r.replayedAIResponses[r.replayCursor]
r.replayCursor++
return content, true
}This is v0.1 — positional match by call order, good enough for single-call agents like the narrator. v0.2 matches by content hash over the request, so a re-ordered or re-fanned-out workflow still resolves the right recorded response.
Worked example: a production prompt change
The scenario: the tavern's reconcile-quest reconciler is accepting too many duplicate quests —
adventurers keep finding the same slain dragon on the
board under two different names. The hypothesis:
tightening the system prompt will fix it without
regressing anything else.
Without Scry, the options are unpleasant: ship and hope; manually replay a handful of cases; mock out the LLM entirely and test something that isn't the real system.
With Scry:
# 1. Identify a representative corpus
scry search --topic quest_signals.quest_proposed \
--filter "dedup_false_negative" \
--limit 40 \
--out corpus/
# 2. Replay the corpus against HEAD — baseline
scry replay-corpus corpus/ --out baseline/
# 3. Substitute the reconciler's system prompt on each scroll
scry corpus-substitute corpus/ \
--event system.prompt \
--agent reconcile-quest \
--from-file new-prompt.tmpl \
--out corpus-candidate/
# 4. Replay the candidate corpus
scry replay-corpus corpus-candidate/ --out candidate/
# 5. Diff the outcomes
scry diff-corpus baseline/ candidate/ --out report.html
# report.html summary:
# 34 cases: identical outcome (pass)
# 5 cases: improved (dedup fires correctly now)
# 1 case: regressed (rejects a legitimate second quest)The one regression is actionable. Fork that specific
scroll, inspect the rejection, consider whether to adjust
the prompt further or accept the trade-off, move on. No
guesswork. No live LLM dollars burned on exploration. No
deployment required.
This is the same shape as every other prompt change on
every other reactor — reward phrasing, difficulty
grounding, rumor-to-quest extraction. The pattern
generalizes.
Safety: immutability is preserved
Fork and substitute never mutate an existing event. A fork is a new scroll; substitutions are appends on that new scroll positioned to override, not overwrite. The origin scroll — the production record — is read-only from Scry's point of view.
This matters for two reasons. First, compliance: the event log is tamper-evident by construction (and hash-chainable when that lands). Second, parallel experimentation: ten analysts can fork the same scroll and not collide. Each fork has its own address, its own lineage metadata, its own acceptance history.
What doesn't need replay
Not every Scry operation needs to re-run a workflow. Some are pure scroll analysis — counting, aggregating, narrating, searching. Those don't touch the runner at all; they just read the scroll.
Replay is the operation that matters when you want to know what the system would do if some input changed. For questions about what the system did, read the scroll directly.
Status
Scroll-first replay ships inside the syncrunner today — it is what makes the narrator deterministic when given a session scroll. The networked fork / substitute / diff RPCs on scroll-server are the v0.2 surface that unlocks Scry at scale; they are designed and scoped, not yet wired.