How do you know a correction held? Instrumenting an agent in production

When I shipped functional scars, I made a quiet claim and then walked past it. The essay said the loop is closeable because underneath it runs a logging primitive: every scar that fires writes a line, so recall becomes computable. I even drew the diagram with observability running along the bottom.

I used it for a month. The logging primitive turned out to answer a smaller question than the one I actually had.

A scar’s log tells you that specific correction fired. It says nothing about the corrections you never wrote a scar for. It says nothing about whether the assistant, taken as a whole, is drifting back toward its prior in some corner you stopped watching. fscars gives you recall on the rules you remembered to encode. It gives you nothing on the regressions you didn’t see coming.

So the honest question was never “did scar number seven fire.” It was “is this system getting better or am I just adding rules and feeling productive.”

You cannot answer that from inside a single session. You have to instrument the whole thing.

What I built

Two layers. Neither of them is a product, and I’ll come back to that.

The two-layer loop: every session, a SessionEnd hook (session_eval.py) writes a deterministic, zero-token, never-blocking daily JSON record; once a month, monthly.py aggregates the records, a model proposes a mechanism, and a human decides. — *Capture is cheap and deterministic; analysis is slow and monthly. No model touches the capture loop.*

The first layer is capture. There is a hook on SessionEnd — the event Claude Code fires when a session closes via /clear, logout, or exit. It runs a small Python script that reads the session transcript and writes one JSON record to disk. The parser is deterministic. It spends zero model tokens. It is wrapped so that it can never block the session from closing: if it throws, it drops a marker file and exits clean, because an instrument that crashes the thing it measures is worse than no instrument.

The second layer is analysis, and it runs once a month. A separate script aggregates the month’s records, applies a handful of hard heuristics, and emits a structured file plus a readable one. Then — once, because once a month is cheap — a capable model reads the structured signals and proposes changes: a new scar, a rule to tighten in the project’s instructions, a recalibration of which model I should be reaching for. It proposes. A human decides. The script never edits anything on its own.

The governing principle is one I keep stealing from the rest of the shop: mechanism, not catalog. A failure that repeats is not a “known limitation” you write down and live with. It is a missing mechanism. The monthly pass exists to turn repetition into the proposal that closes the gap, not into a longer list of things you already knew were broken.

What one session record holds

Metrics and short labels only — never the raw transcript. Privacy lives in the capture layer.

The record is deliberately cheap. Counts — prompts, assistant turns, tool calls, errors, subagents, skills. Which models did the turns, and the token split across input, output, and cache. Which tools and skills got used. Which areas of the workspace the work actually touched, derived by mapping edited files back to their folder, so the focus of a session is a fact instead of a memory. The deliverables it produced. How many times each scar fired, which is the friction signal. And a flag for whether the session touched anything sensitive.

That last one matters more than it looks. This is the instrumentation for a working engineering office, not a lab toy. Some sessions touch financial paths. When a session does, the record marks itself sensitive and suppresses every text sample, keeping only the structural counts. The instrument is allowed to know that the work happened and where; it is not allowed to keep what was said. Privacy is a property of the capture layer, not a policy I have to remember to apply later.

What it does not store is the raw text of prompts and responses. There is no transcript warehouse. The daily layer is metrics and short labels, by design.

The first thing the data caught was me

I seeded the corpus from the transcripts I already had — a hundred and twenty-two of them — and looked at May. Eighty-nine sessions. Two numbers stopped me.

The first: a scar I had written to force a self-review pass on large code writes fired a hundred and forty-three times, across thirty-three different sessions. Read that the right way. The scar working is good — it caught the situation a hundred and forty-three times. But a hundred and forty-three is also the number of times I walked up to a large code write without reviewing it first and needed a deterministic hook to stop me. The scar wasn’t the success story. The scar was the smoke, and the fire was a habit I hadn’t fixed at the source.

The second: model usage. I have a table that says which class of task deserves which model — the whole point of which is to not reach for the heaviest model on reflexes. The data showed one model dominating almost everything, with the lighter ones barely registering. I preach right-sizing and the instrument caught me not doing it. There is no version of reading your own evals honestly where this doesn’t sting a little.

This is the part I’d defend to anyone skeptical of self-instrumentation: a system that only logs the rules you remembered to write will always flatter you, because you only wrote rules for mistakes you already respect. A system that counts everything will eventually point at the mistake you’ve been making the whole time and not noticing.

Reading scars as symptoms

The monthly pass doesn’t just count fires. It reads them as symptoms and looks one layer down.

Read a scar as a symptom, not a tally: a low-level scar firing four times in one session is the symptom; look one layer down and the cause is a higher-level rule getting skipped; the proposal is the upstream mechanism, not another low-level rule. — *The recurring low-level fire is the smoke. The fix belongs one layer up.*

Concrete example. There is a scar that fires when I run a certain fix-up step after generating a document. There is a different, higher-order rule that says: use the native tooling first, so you never need the fix-up step. The heuristic in the aggregator is blunt — if the low-level scar fires more than twice in a single session, that is not a low-level problem, it’s evidence the high-level rule got skipped. The eval is built to surface the cause, not just tally the effect. The recurring scar of the month becomes a candidate for an upstream mechanism, which is exactly the move that keeps the catalog from growing forever.

That inversion is the whole reason to instrument above the level of the individual rule. Per-scar logs make each correction observable. Aggregate evals make the shape of your mistakes observable, which is the thing you actually want to fix.

The instrument needed its own instrument

The watcher needed its own watcher: a SessionStart check (freshness_check.py) watches the eval (session_eval.py), which watches the sessions. If a capture fails it drops a _last_error.json marker that the next session start reads and warns about. — *The capture must not be able to fail invisibly, so a session-start check watches it. Turtles.*

Here is the part I find funniest, in the way that production engineering is funny.

The eval engine went dark for four days and I didn’t know. Zero new records. And I could not tell, from the outside, whether that meant “I simply hadn’t closed any sessions” or “the hook has been failing silently the whole time.” It couldn’t tell me, because the command was wired with the shell-script reflex of swallowing errors — the trailing || true that turns every failure into a clean exit. The thing whose entire job is to make failures visible was hiding its own.

The fix was not a better hook. The fix was a second instrument pointed at the first. The capture script now, on any failure, writes a small error marker with a timestamp and a trace and then exits clean. And there is a check at session start — deterministic, zero tokens, same family as the validators that already guard the workspace — that reads two things: is there an error marker lying around, and is the most recent record older than it should be. If either is true, it warns me at the top of the next session. The observability got observability. It is turtles, and on the day you ship the turtle that watches the other turtles, you’ve understood the problem.

What this is not

It is not a product or a SaaS — no pip, no login, no remote telemetry. The wiring I run is local to one specific shop, mine. But the pattern travels, so I genericized the core into a small reference repo: the record schema, a SessionEnd capture hook, the monthly aggregator, and the session-start check that watches the capture. It is a sibling in spirit to lucy-syndrome — a spec plus reference code — not a maintained package like fscars. The domain-specific parts (how files map to areas of work, the bridge to a correction-rule system) stay pluggable.

It is not a model in the loop where it counts. Capture is pure parsing — no inference, no judgment, fully traceable. The only place a model touches the system is the monthly reading, once, on signals a deterministic script already extracted.

And it is not a dashboard I check for reassurance. A metric I don’t turn into a decision changes nothing. The output that matters is the monthly proposal — a new mechanism — and the only thing that closes the loop is acting on it.

If you want to run it

The reference repo is github.com/VDP89/agent-evals. Two files and a settings.json snippet: wire the SessionEnd hook and the session-start check, close a session, and a record appears under .agent-evals/. At the end of the month, python monthly.py. The README has the five-minute flow and schema/session-eval.v1.json is the full field list. Apache-2.0.

One rule it ships with that I’d underline: do not wire the hook with 2>/dev/null || true. Swallowing the error is the exact bug that made my own capture go dark for four days. Let it leave a marker; let the session-start check surface it.

Where this sits

The arc has a shape now. Lucy Syndrome named the problem: corrections don’t survive across sessions. From memory to scar argued the fix has to run outside the model. Functional scars shipped the primitive that makes a single correction persist. This is the layer above all of it — the one that tells you whether any of it is actually working, and points at the next thing to fix when it isn’t. The reference code is agent-evals.

Persisting a correction is step one. Knowing whether your corrections are compounding, or whether you’re just accumulating rules and self-regard, is step two. The first time the monthly pass hands me a proposal I wouldn’t have thought to write myself, step two will have earned its keep.