Skip to content
Victor Del Puerto
Go back

A month of functional scars: 934 fires, one broken validation loop, and what it cost

When I shipped fscars two and a half weeks ago, the framing was “this is the cleaner version of what I run at home.” That was true. What I did not say is that the version I run at home had not been audited in seven weeks, that half the volume came from a single hook with documented false positives, and that the observer was filling a JSONL file nobody read.

This post is the audit. It is also the reason fscars shipped v0.2.0 today.

The setup

The system I run at home is a set of functional scars wired into Claude Code, plus an opportunity observer that captures every time a scar could have fired (whether or not it did). The first hook started logging on April 25, 2026. Today is May 26, 2026. That is thirty-one days of data.

There are ten scars in the workspace. Seven have hook code wired into .claude/settings.json and run on real events. The other three live as session-start reminders only: they show up in the system prompt at the beginning of a session but never intercept a tool call. The audit treats both as part of the same surface.

wired hookdoctrine onlytotal
fired at least once606
never fired134
total7310

Six scars produced signal in a month. Four did not. That last column matters: the system feels like ten scars, but it operates as six.

Thirty-one days of fires

The hooks emitted 934 fire records into .fscars/logs/fires.jsonl over the month. The distribution is not flat.

Bar chart of fires per hook: scar_004 leads with 468 fires (50.1%), followed by scar_002 (142), session_start (138), scar_005 (56), scar_011 (39), skill_suggest (37), scar_001 (34), scar_010 (20)
Half the signal of the entire system lives in one hook.

scar_004 is a knowledge-base expander. When my prompt mentions a topic I have written about before, it injects pointers to those notes so I do not start from cold. It is the most useful scar on paper and also the one with the documented false-positive problem: a prompt that says “let us close the session” can trigger it on the verb “close.” I raised its match threshold once already. It still leads the count.

The cadence at the session level is rising, not flat.

Sparkline of Claude Code sessions per ISO week: W17=4, W18=22, W19=25, W20=37, W21=40, W22=10 partial
Adoption is going up. Whether that means the system is helping is a separate question.

W21 had forty sessions. Six weeks earlier, the first week of instrumentation, the same metric was four. That is a 10× increase in surface area for a system whose precision I had never measured.

The hidden backlog

While the seven wired hooks were firing, the opportunity observer was doing its own job: capturing every event where a scar could have fired so I could later check recall. By May 26 the observer had collected 3,838 candidate opportunities.

None of them had been validated. The script that would have processed them sat on disk for the whole month, unrun.

Bar chart of opportunities per scar versus validated count: scar_010 1577, scar_002 1454, scar_011 573, scar_004 233, with zero validated for every row
The observer never stops. The validator never started.

This is the part of running observability that I had underweighted. The hook code is the easy half: an if statement on a tool payload. The hard half is the loop back — the part where you confirm that the fires you logged were the right fires, and that the misses you captured were really misses. Without that loop, every metric you compute is either trivially 1.0 (recall on fires-only) or unknowable (precision without ground truth).

For thirty-one days, my “instrumented” system was producing logs that nobody, including me, was reading.

What the audit caught

I ran a self-audit on May 26 with the lens of “a product manager at Anthropic looking at this customer’s footprint.” Not Anthropic actually auditing me — me framing the question that way to keep myself honest. The conclusions were not generous.

The most-fired hook had no precision metric attached. Half the signal of the system was coming from scar_004, a hook whose false-positive cases I had personally documented on three separate days without changing the heuristic. There was no useful=true|false field on any fire record, so I could not tell what fraction of those 468 fires was helping me and what fraction was noise. Calibrating the hook in that state was guessing.

The structural finding was harder. I had been building infrastructure inward and not outward. The runtime, the paper, and the package all shipped inside the same five-week window, and only the package had any external feedback channel turned on at all. At audit time the package had three stars, no forks, and no pull requests from anyone who was not me. The piece I had operated longest and trusted most, the internal scars, had been validated by exactly one person.

And there was a timer. A hook system without monthly review decays to a placebo inside three to six months: the reviewer stops noticing the fires, the calibration drifts, and the surface keeps growing because adding a new scar is cheaper than checking whether the existing ones still earn their cost. I had not run a formal audit since April 8. That was forty-eight days of drift.

Building the validation loop

The fix had to be the part I had been deferring: a way to process the backlog without paying for it in afternoons of manual review. Three tiers, in order of cost.

Three-tier validation diagram: 3838 opportunities go into Capa 4 deterministic rules (2784 resolved, 72.5%), remaining flow to Capa 3 LLM classifier (273 resolved, 7.1%), unresolved go to human review (781, 20.4%)
The cheap tier should resolve most rows. The expensive tier should only see what is hard.

Capa 4 is a deterministic rules classifier. One callable per scar reads the opportunity row and answers auto_tp, auto_fp, or ambiguous. It costs nothing to run. On the first pass it resolved 2,784 of the 3,838 opportunities.

Capa 3 is an LLM classifier. It only sees rows Capa 4 flagged as ambiguous, runs claude -p --model haiku against each one with the file content attached, and writes the verdict back if the model’s reported confidence clears 0.8. That layer resolved 273 more, for about half a US dollar in inference.

What is left is what should be left: 781 rows the deterministic rules could not decide and the LLM was not confident about. Those wait for me, or for sharper rules.

Stacked bar chart of resolution outcomes: 731 true positives (19.0%), 2326 false positives (60.6%), 781 pending (20.4%)
Of 3,838 opportunities, the system kept 731 as real and discarded 2,326 as noise.

Sixty-one percent of what the observer captured turned out to be false positives once the rules and the LLM agreed on it. That number is the price of having no validation loop for thirty-one days. It is also the answer to “is the observer worth running” — yes, but only with the loop attached.

The same three-tier architecture is what shipped today as fscars.validation in v0.2.0 of the package. The runtime that produced these numbers is private; the abstractions are not.

What it cost me to learn this

The clearest thing in retrospect is that a hook without a useful field is not finished. You do not know it is useful, the model does not know it is useful, and a month later you cannot remember why you wrote it. The flag does not need to be elaborate. A keystroke at session end works. A retrospective auto-classifier works. But the field has to exist before the hook ships, not after the backlog is in the thousands.

The observer turned out to be more expensive to operate than to build. Capturing opportunities is an afternoon of code. Reviewing what they say takes hours per month forever, unless the cheap classifier exists from day one. I waited a month and paid for the wait in a 60% false-positive rate that I had no way to see from the outside.

The last lesson is harder to admit. Infrastructure built without an external feedback loop ends up confirming what the builder already believed. A system I run on my own logs, with my own classifiers, mostly tells me what I already believe. The package shipped outward is what tells me whether anyone else found the framing useful, and at thirty-one days I do not have that signal yet.

What is next

There is a control review scheduled for June 25, 2026, at the thirty-day mark for v0.2.0. The questions for that review are written down already. Did the validation loop hold its 80% auto-resolve rate as the volume grew? Did anyone outside my workspace install the package, file an issue, or mention it in a context that was not me posting about it?

The first answer comes from the data. The second is the one I cannot fake, and it is the one that decides whether functional scars are useful as a practice or only useful as a personal tool. I will know more in a month.

If you want to try the validation loop on your own observer logs, pip install fscars will install v0.2.0. The architecture and the worked example live in docs/advanced_validation.md. The earlier post, Functional Scars — turning corrections into a primitive, covers what a scar is and why it exists in the first place.


Share this post on:

Next Post
The Pixar precedent: vibe coding has been here before